The premise

After enduring it for a long time, he broke down and posted a boiling point at the Nuggets:


It is a “sad story” that the boiling point of the emotional outburst seems to have been shielded, and that the Canal has been online for a while, but when it does come out, it is calm. Key points:

  • Up to dateRELEASEVersion forv1.1.4, published inThe 2019-9-2It hasn’t been updated for almost a year.
  • IssueThere are a lot of unanswered or unanswered questions, many of them older.
  • masterBranches often submit exception code that is not build-friendly becausev1.1.4More problems, but also once thought of usingmasterThe code is built manually, the project is imported, and you decide to abandon itMyBatisThe source code.

These are just appearances. Let’s talk about the pits.

Resolve thread blocking problems

This is basically the bane of every developer who uses Canal. $CANAL_HOME/conf/canal. The properties in the configuration file is a line commented out configuration: canal. The instance. The parser. ParallelThreadSize = 16. This configuration is used to specify the number of concurrent threads for the parser instance, and nothing happens if the annotation causes the parser thread to block.


You are advised to use the default value 16.

Table structure cache is abnormally blocked


This is the problem mentioned by most of the questioners in the Issue, which has not been solved for a long time, namely the storage of table structure metadata (the term TSDB is used in the configuration item, which is called TSDB function below).


By default, the TSDB function is enabled, that is, the table structure will be parsed through the H2 database cache, but in practice, if the upstream changes the table structure, the CORRESPONDING H2 database cache will not be updated, this time usually a surprising parsing exception, the exception information is generally as follows:

Under Caused by: com. Alibaba. Otter. Canal. Parse. Exception. CanalParseException: the column size is not match for the table: the name of the database. Table name, number of fields in the new table structure vs. number of fields in the cached table structure;

Copy the code

This exception also causes a terrible consequence: the parsing thread is blocked, meaning that binlog events are no longer received and parsed. This problem the author also looked at a lot of Issue, everyone thought is a serious BUG, which is more feasible solution is: to disable the TSDB function (really rough enough), namely canal. The instance. The TSDB. Enable set to false. If TSDB is disabled, the Canal service must be “stopped” first, then “deleted” $CANAL_HOME/conf/ target database instance id /h2.mv.db, and then “started” the Canal service.

Because of this problem, the author disabled the TSDB function in production and added DDL statement processing logic, directly to the pin alert and @ the whole group of people.


Every time I see this warning, I get nervous.

Log problem

If the file whose binlog locus is at the lower end of the file location is large, bit-seeking logs will be frantically printed. The previous reboot attempt printed several gigabytes of logs, more than 99% of which were the log lines for locating binlog files and position. $CANAL_HOME/conf/logback.xml (not recommended, Configure or specify the following attributes of $CANAL_HOME/conf/ target database instance id /instance.properties to manually locate the starting point for parsing:

Canal. The instance. The master. The journal. The name = binlog file name

Canal. The instance. The master. The position = binlog file of the site

Canal. The instance. The master. The timestamp = timestamp

Canal. The instance. The master. The gtid = gtid value

Copy the code

The above attributes need to be updated or commented out before the next Canal restart, otherwise it will cause reparsing or the file cannot be found!!

“Every time you restart Canal’s service, it’s a thrill, and no open source software can make you feel that way.” Because the production of server disk is not very enough, only buy 100GB when matching, and considering the nature of these logs do not have much meaning, so can only go up to delete logs regularly, early is manually delete, later feel troubled to write a Shell script regularly delete long log files.

Cloud RDS MySQL usage issues

If you happen to be using Aliyun’s RDS MySQL, you may encounter a bigger pit. The main issues are:

  • RDS MySQLThere are disk space optimization rules, triggering the rules will bebinlogFile uploadOSSAnd then delete the localbinlogFile.
  • fromCanalTo see the document, will automatically pullOSSOn thebinlogFile parsing, so that users do not perceive, but this function hasBUG, has been unable to use properly.
  • RDS MySQLIt’s a black box. If there’s a problem, you can only passMySQLRelated queries to locate the problem, there is no way to go into the server to view the real scene.

When this problem is hit, the usual exceptions are:

. sqlstate = HY000 errmsg = Could not find first log file name in binary log index file

Copy the code

You can basically confirm that this feature is flawed, for example here is issue-2596:


At present, the author’s approach is as follows:

  • Completely abandonedCanalpullOSSOn thebinlogFile function.
  • RDS MySQLExpand the disk as much as possible, adjust the policy so that as many as possiblebinlogKeep files locally for as long as possible, allowing them to be fully parsed and then uploaded manually or automatically uploaded after hitting an expiration rule. There are a lot of extra charges in between, depending on your discretion.

There are still bugs in the master branch when loading and parsing binlog files on OSS (2020-08-05).

This problem has serious consequences: “there is a high possibility that some binlog file parsing is completely missing.” Unless the binlog file can be inserted back into RDS MySQL, manual synchronization between upstream and downstream is required.

to be continue

In addition, it should be noted that Canal is best deployed in master/standby mode. Zookeeper is recommended for submission sites and cluster management, while ccanal. ServerMode, Support for TCP, Kafka, and RocketMQ) kafka is recommended (the master branch has RabbitMQ connector support, if you want to build it manually), and the resource requirements of each node are relatively high, the author’s production of each node used 2C8G low frequency ECS, feel a bit of pressure. If the binlog locus needs to be repositioned during a special instance restart, CPU usage will soar for a period of time.

The author found that Canal was used as the basic middleware for data synchronization in THE DTS of Ali Cloud, indicating that it has been put into production in practical application scenarios. I really hope that it will eventually evolve into an abandoned KPI task project. I don’t know how many problems I will encounter in the future, but if I do, I will keep updating this guide.

(C-2-D E-A-20200805)