Internet application With the development of business, the volume of some single table data is getting larger and larger. Considering the service performance and stability, there is a need to do separate database, separate table and data migration. This paper introduces the practice of Vivo account to cope with the above requirements.

One, foreword

Canal is an open source project of Alibaba. What can you do? I’ll explain them all later. In this article, you will learn what kind of business pain points are solved by Vivo account using Canal. Based on this, I hope to have some enlightenment for your business.

Introduction to Canal

1. Introduction

Canal [kə næl] is used for incremental log parsing based on MySQL database, providing incremental data subscription and consumption.

In the early stage, Because of the deployment of double machine rooms in Hangzhou and the United States, Alibaba had the business requirement of cross-machine room synchronization, and the realization method was mainly based on business trigger to obtain incremental changes. Since 2010, businesses have gradually tried database log parsing to obtain incremental changes for synchronization, resulting in a large number of database incremental subscription and consumption services.

2. Working principle

2.1 Principle of active/Standby MySQL Replication

Canal’s core operating mechanism is dependent on the primary and secondary replication of MySQL. We first briefly explain the principle of primary and secondary replication of MySQL.

The MySQL master writes data changes to the binary log (binary log events).

MySQL slave copies master binary log events to its relay log.

MySQL slave replays events in the relay log to reflect data changes to its own data.

2.2 MySQL Binary Log Introduction

Mysql-binlog is a binary log of the MySQL database, which records information about SQL statements (except data query statements) that users perform operations on the database.

If we later need to configure the master/slave database, if we need to synchronize the contents of the master database from the database, we can do this through Binlog.

2.3 Working principle of Canal

Canal emulated the interaction protocol of the MySQL slave, disguised itself as the MySQL slave, and sent the dump protocol to the MySQL master.

The MySQL master received the dump request and started pushing the binary log to the slave (Canal).

Canal parses the binary log object (originally a byte stream).

Canal pushes the parsed binary log ina specific format for downstream consumption.

2.4 Canal overall architecture

Description:

  • Server represents a running instance of Canal, corresponding to a JVM
  • Instance corresponds to a data queue (1 server corresponds to 1.. N the instance)

The instance module:

  • EventParser (data source access, simulates slave and Master interaction, protocol parsing) interacts with the database simulates the slave library, sends the dump binlog request, receives the binlog for protocol parsing and data encapsulation, and sends the data to the lower layer EventSink for storage. Records the binlog synchronization location.

  • EventSink (Parser and Store link) Data filtering, data merging, data processing, data routing and storage.

  • EventStore Manages the storage of data objects, including the write management of new binlog objects, the location management of object subscription, and the location management of object receipt after object consumption.

  • MetaManager (Incremental Subscription & Consumption Information Manager)

    A publish-subscription manager responsible for binlog objects as a whole, similar to MQ.

2.5 Canal Data format

Let’s take a look at Canal’s internal encapsulated Binlog object format to better understand Canal.

Canal can synchronize DCL, DML, and DDL.

Businesses typically care about data changes caused by INSERTS, updates, and DELETES.

EntryProtocol.proto

Entry Header logfileName [binlog file name] logfileOffset [binlog position] executeTime [Timestamp of changes in binlog] schemaName [database instance] TableName [tableName] eventType [insert/update/delete type] entryType [BEGIN/ END/ data ROWDATA] storeValue [byte data, which can be expand, The corresponding type is RowChange. RowChange isDdl [Whether it is a DDL change operation, For example, create TABLE/DROP TABLE SQL [specific DDL SQL] rowDatas [Specific INSERT /update/ DELETE change data, can be multiple, one binlog event can correspond to multiple changes, For example, batch] beforeColumns [Column array] afterColumns [Column array] Column index [Column number] sqlType [JDBC type] name [Column Name] isKey updated [whether it has been changed] isNull [value isNull] valueCopy the code

2.6 Canal Example Demo

Let’s see the data model of the Canal object resolved by Binlog to deepen our understanding through the judgment of the actual code logic

  • The insert statement

  • The delete statement

  • The update statement

2.7 Canal HA mechanism

The stability of online services is extremely important. Canal supports HA, and its implementation mechanism also relies on Zookeeper, which is similar to HDFS HA.

Canal HA is divided into two parts. Canal Server and Canal Client have corresponding HA implementations respectively.

  • **Canal Server: ** To reduce requests for mysql dump, only one instance on different servers should be running at a time and the others should be standby.

  • **Canal Client: ** To ensure orderliness, only one Canal Client can perform get/ ACK /rollback operations on an instance at a time. Otherwise, the Client cannot ensure orderliness.

Depends on Zookeeper features (this article does not focus on Zookeeper features, please find the corresponding information on the network) :

  • Watcher mechanism

  • EPHEMERAL node (tied to the Session lifecycle)

General steps:

Whenever Canal Server wants to start a Canal instance, it first makes an EPHEMERAL attempt to ZooKeeper.

After the ZooKeeper node is successfully created, the corresponding Canal Server starts the corresponding Canal Instance. The Canal instance that is not successfully created is in standby state.

If the node created by Canal Server A disappears, ZooKeeper immediately notifies the other Canal Servers to perform Step 1 again and select A Canal Server to start instance.

Each time the Canal Client connects, it first asks ZooKeeper who started Canal Instance and then establishes a link with it. If the link is unavailable, it tries to connect again.

2.8 Canal Usage Scenario

The principle and operation mechanism of Canal have been introduced above. Let’s see what problems Canal can solve for our business scenarios from the actual scenarios.

2.8.1 Continuous service Migration

In the early stage of service development, to quickly support service development, many data stores are extensively designed. For example, the user table and order table may be designed as a single table. In this case, the common method is to use separate databases and tables to solve capacity and performance problems.

However, data migration will face the biggest problem: online services need to run normally. If data is changed during migration, how to ensure data consistency is the biggest challenge.

Based on Canal, this problem can be solved by subscribing to the Binlog database.

See the vivo account non-stop migration practice below.

2.8.2 Cache Refresh

Internet business data sources are not only databases. For example, Redis is commonly used in Internet business. When data is changed, cache needs to be refreshed.

Based on Canal, the cache can be decoupled asynchronously by subscribing to the Binlog of specified table data.

2.8.3 Task Delivery

Another common application scenario is “deliver a task”, when data changes need to be notified to other dependent systems.

The principle is that the task system listens for database changes and then writes the changed data to MQ/Kafka for task delivery.

For example, when an account is cancelled, the downstream service party needs to notify the order, delete service data for the user, or archive the data.

Based on Canal, the accuracy of data delivery can be guaranteed, and various codes of MQ delivery will not be scattered in the business system, so as to realize the delivery and collection, as shown in the figure below:

2.8.4 Heterogeneous Data

In the architecture of large websites, databases always use separate tables to solve capacity and performance problems, but after the separate tables bring new problems.

For example, queries with different dimensions or aggregated queries can be tricky. We usually solve this problem through data heterogeneity.

The so-called data heterogeneity, that is, the multiple tables that need to join query are aggregated in one DB according to a certain dimension.

Data heterogeneity can be realized based on Canal, as shown in the figure below:

3. Canal installation and use

For detailed installation, configuration, and use of Canal, see the official documentation >> link

Three, account practice

1. Practice 1: Separate database and table

1.1 requirements

  • Difficult points:

Table data volume, a single table more than 300 million.

Regular scheduled task full data migration takes a long time and causes service damage.

  • Core appeal:

Non-stop migration to maximize service continuity

“Change a tire for a car running on the highway.”

1.2 Migration Scheme

1.3 Migration Process

The overall process is roughly as follows:

  • Analyze existing pain points of accounts

The amount of data in a single table exceeds 300 million

Too many unique user identifiers

Services are improperly divided

  • Determine the sub – database sub – table scheme

  • Storage data migration scheme

The traditional scheduled task migration takes a long time. In addition, to ensure data consistency during the migration, maintenance needs to be stopped, which affects users greatly.

Determine the use of CANAL for migration, do a thorough investigation and evaluation of CANAL, determine with middleware and DBA, and support full and incremental synchronization.

  • The migration process is controlled by switching from single table mode to double write mode to separate table mode.

  • The data migration period is long, and some unexpected problems are encountered during the migration.

  • After the migration, it officially switches to dual-write mode, that is, data is written to both single table and sub-table. At this time, data is still read in single-table mode, and Canal still subscrires to the original single table for data changes.

  • After two weeks of operation, no new problems occurred on the line, and it was officially switched to the table mode. At this time, no data was written to the original single table, that is, no new Binlog was generated on the single table. After the switch, some problems occurred on the line.

2. Practice 2: Transnational data migration

2.1 requirements

In the early stage of Vivo’s overseas business, the data of some overseas countries was stored in the computer room in Singapore, a neutral country. However, with the increasingly strict legal compliance requirements of overseas countries, especially the GDPR compliance requirements of the EU region, Vivo account made a lot of compliance renovation work to meet the compliance requirements.

The compliance requirements of some non-EU countries change accordingly. For example, the local requirements of Australia meet the GDPR compliance requirements, and the Australian user data originally stored in the computer room in Singapore need to be migrated to the computer room in the EU. The overall migration complexity increases, and the difficulties involved are as follows:

  • Continuous migration, shipped mobile phone users need to be able to access account services.

  • Data consistency is guaranteed when users change data.

  • Affected by the service party. The account service of the live network cannot be affected.

2.2 Migration Scheme

2.3 Migration Process

  • Set up the standby database in Singapore machine room and synchronize the master and slave binlogs.

  • Set up the server and client of Canal, synchronize subscription consumption Binlog.

  • The client parses the subscribed Binlog and encrypts the data to the GDPR machine room of the EU.

  • The EU application data parses the transmitted data and stores it on the ground.

  • After data synchronization, o&M colleagues assisted in forwarding the DNS resolution of upper-layer domain names to the EU machine room to complete data switchover.

  • Observe the running status of the Canal service in the Singapore machine room, and stop the Canal service if there is no exception.

  • On the service side, the account is switched over.

  • After the switchover is complete, clear the data in the Singapore equipment room.

3. Experience summary

3.1 Data Serialization

The bottom layer of Canal uses Protobuf as the columnization method of data. When canal-client subscribes to changed data, null data is automatically converted to an empty string. When data is updated on the ORM side, data in the final table is updated to an empty string due to inconsistent judgment logic.

3.2 Data Consistency

The canal-client account has only one online node this time. However, due to service characteristics, data inconsistency occurred in the process of data migration. The example is roughly as follows:

  • User A changes the bound mobile phone number.

  • Canal has not yet subscribed to the Binlog position.

  • The user changed the phone number B.

  • At the corresponding time, Canal consumed the update of Binlog of mobile phone number A, resulting in overwriting of the user’s newly tied mobile phone number.

3.3 Database Master-slave Delay

For the sake of data consistency (account service data does not meet the need for separate database), the account separate table is carried out in the same database, that is, the data of separate table is constantly written in the migration process, which increases the database load and causes the delay of reading from the database.

Solution: Add rate control and configure different policies based on actual service conditions. For example, the write speed can be reduced when the traffic is heavy in the daytime and increased when the traffic is light at night.

3.4 Monitoring Alarms

In the process of overall data migration, Vivo account added a simple monitoring means for real-time data synchronization on the client side, that is, counting based on business table and memory.

The overall monitoring granularity is coarse, including the inconsistency of the above data. After data synchronization is completed, no exception is found, which leads to service problems in the sub-table mode. Fortunately, logical data can be compensated by other means, with less impact on online data.

Fourth, expand your thinking

1. Analysis of existing problems

The above is a simple diagram drawn based on Canal’s existing architecture. Although it is based on the overall high availability of HA, some hidden dangers can still be found after a close study. Among them, the nodes marked with red X can be regarded as possible points of failure.

2. Reuse of common components

Based on the above possible problem points, we can try to do the optimization shown above.

3. Extended application – multi-data center synchronization

In the Internet industry, “different live” is known to everybody, and data synchronization is the basis of different live, all the capacity of data storage components such as database, cache and MQ, data can be synchronized, forming a large and complex data synchronization topology, data backup each other, then can attain the true sense “long distance”.

This logic is beyond the scope of this discussion, we can refer to the following article, the author personally think explain details: www.tianshouzhi.com/api/tutoria…

5. Reference materials

  • Github.com/alibaba/can…

  • Github.com/alibaba/ott…

Author: Vivo Product platform development team