preface

I believe many of you are already familiar with these two concepts. This article is in the form of plain English, and I hope it can bring you different feelings. If you are not familiar with these two concepts, then I suggest you share your understanding of read-write separation and separate tables with your colleagues/friends.

Read and write separation & library and table

Reading and writing separation

What is read-write separation?

Read/write split is mainly used to spread read/write operations to different database nodes. In this case, the write performance can be slightly improved while the read performance can be significantly improved.

I drew a simple graph to help those of you who don’t understand the separation of reading and writing.

In general, we choose one master and many slaves, that is, one master database is responsible for the writes and the other slave databases are responsible for the reads. Data synchronization is performed between master and slave libraries to ensure the accuracy of data in slave libraries. This architecture is relatively simple to implement and conforms to the characteristics of the system with less write and more read.

What are the problems with read/write separation? How to solve it?

Read/write separation is a great way to improve the concurrency of a database, but it can also lead to a problem: there are delays between the master and slave databases. For example, after you write to the master database, it takes time for the data from the master database to be synchronized to the slave database. This time difference causes data inconsistency between the master and slave databases. This is often referred to as master-slave synchronization delay.

There is no good solution to the master-slave synchronization delay problem. Depending on your business scenario, consider the following solutions.

1. Route read requests to the primary library forcibly.

Since your data from the library is out of date, I will read it directly from the main library. Although this solution will increase the pressure on the main library, it is relatively simple to implement and is the most used method I know of.

For example, Sharding-JDBC is used in this scheme. By using sharding-JDBC’s HintManager Sharding key value manager, we can enforce the use of the master library.

HintManager hintManager = HintManager.getInstance(); hintManager.setMasterRouteOnly(); // Continue JDBC operationsCopy the code

For this scenario, you can delegate any read requests that must get the latest data to the master library.

2. Delay reading.

Some friends will think that since there is a delay in master-slave synchronization, I will read the data after the delay. For example, if the master-slave synchronization delay is 0.5s, I will read the data after 1s. How convenient! Convenient is convenient, but it’s also bullshit.

However, it is much better if you design the business process this way: for some data-sensitive scenarios, you can avoid making requests immediately after completing the write requests. For example, after you have paid successfully, you will be redirected to a successful payment page, and you will only return to your account when you click back.

How to achieve read/write separation?

Regardless of the specific implementation of read/write separation, there are generally the following steps to achieve read/write separation:

  1. Deploy multiple databases. Select one database of one type as the primary database and one or more other databases as secondary databases.
  2. Ensure that data is synchronized between the master and slave databases in real time. This process is also known as master-slave replication.
  3. The system sends write requests to the primary database and read requests to the secondary database.

To implement the project itself, there are two common ways:

1. Agency

We can add an agent layer between applications and data. All of the application’s data requests are handled by the agent layer, which separates the read and write requests and routes them to the corresponding database.

Middleware that provide similar functionality include MySQL Router (official), Atlas (based on MySQL Proxy), Maxscale, and MyCat.

2. Component mode

In this way, we can introduce third-party components to help us read and write requests.

This is the one I recommend. At present, this method is most used in various Internet companies, and there are many relevant practical cases. Sharding-jdbc is recommended if you want to use this approach, and it is very convenient to import jar packages. At the same time, it also saves a lot of operation and maintenance costs.

You can find sharding-JDBC for read/write separation in the ShardingSphere.

Master slave replication?

The MySQL binlog records all changes to the data in the MySQL database (all DDL and DML statements executed by the database). Therefore, we can synchronize data from the master database to the slave database based on the MySQL binlog of the master database.

A more specific and detailed process looks like this (image from MySQL Master-slave Replication on the Same Machine) :

  1. The master library writes changes to data in the database to the binlog
  2. The slave library connects to the master library
  3. The slave library creates a binlog that the I/O thread requests for updates from the master library
  4. The master library creates a binlog dump thread to send binlogs, and the slave I/O thread receives them
  5. The I/O thread from the library writes the received binlog to the relay log.
  6. Read relay logs from the library’s SQL thread to synchronize data locally (that is, execute the SQL again).

How’s that? After reading my explanation of the process of master slave replication, you should have figured it out!

You usually think of binlog as master-slave replication. Of course, in addition to master-slave replication, binlog can also help with data recovery.

Let me expand it a little bit. I don’t know if you’ve used canal, an open source tool from Alibaba. This tool allows you to synchronize data between MySQL and other data sources such as Elasticsearch or another MySQL database. Obviously, the underlying principle of this tool must also rely on binlog. Canal’s principle is to simulate the process of MySQL master-slave replication by parsing binlog to synchronize data to other data sources.

In addition, the commonly used distributed cache component Redis also achieves read/write separation through master/slave replication.

To summarize: MySQL master/slave replication is binlog dependent. In addition, common tools that synchronize MySQL data to other data sources (such as Canal) rely on binlog for their underlying level.

Depots table

Read/write separation mainly deals with concurrent database reads and does not solve database storage problems. Just think: what if MySQL has too much data in a table?

In other words, how do we solve MySQL’s storage stress?

One answer is separate databases and separate tables.

What is repository?

Branching is the process of distributing data from a database to different databases.

The following operations all involve repository splitting:

  • You place the user table and the user order table in the database in two different databases.
  • Because of the large amount of data in the user table, you shard the user table horizontally and then put the two shard user tables in two different databases.

What is the classification table?

Split table is to split the data of a single table, which can be split vertically or horizontally.

What is vertical split?

Simply put, vertical splitting is the splitting of data table columns. A table with many columns is split into multiple tables.

For example, we can pull out columns from a user information table as a table.

What is horizontal splitting?

In simple terms, horizontal splitting is the splitting of data table rows. A table with many rows is split into multiple tables.

For example, we can split the user information table into multiple user information tables to avoid the performance impact of a single table having too much data.

There is a graphic in Architecture from Scratch that illustrates vertical and horizontal splitting.

When do we need separate tables?

The following scenarios can be considered to divide the database into tables:

  • The data in a single table is more than 10 million, and the database read/write speed is slow (sub-table).
  • Data in the database takes up more space and takes longer to back up (repository).
  • Application concurrency is too high (split library).

What’s the problem with separate tables?

Remember, any technology decision you make in the company is not only about whether the technology meets our requirements and fits the current business scenario, but also about the cost.

What challenges will be brought to the system after the introduction of sub-database sub-table?

  • Join operation: The tables in the same database are distributed in different databases, so the JOIN operation cannot be used. As a result, we need to manually encapsulate the data. For example, after you query a data in one database, you can find the corresponding data in another database according to the data.
  • Transaction problem: The tables in the same database are distributed in different databases. If a single operation involves multiple databases, the transaction of the database cannot meet our requirements.
  • Distributed ID: After the database is divided, the data is spread across the database on different servers, and the auto-increment primary key of the database can no longer meet the unique primary key generated. How do we generate globally unique primary keys for different data nodes? At this point, we need to introduce distributed ids for our system.
  • .

In addition, the introduction of a database table typically requires the involvement of a DBA, as well as additional database servers, which are costs.

Is there any plan recommended by sub-database sub-table?

ShardingSphere project (including Sharding-JDBC, Sharding-Proxy and Sharding-Sidecar) is donated by Dangdang to Apache, and is mainly maintained by some giants of JINGdong Data Department at present.

ShardingSphere can definitely be said to be the first choice of the current sub-database sub-table! ShardingSphere has perfect functions. In addition to supporting read and write separation, library and table separation, it also provides distributed transactions, database governance and other functions.

In addition, ShardingSphere has a perfect ecosystem, active communities, perfect documents, frequent updates and releases.

How to transfer data after dividing database and table?

How do we migrate data from the old database (single database and single table) to the new database (database system after separate database and single table)?

One of the simplest and most common solutions is to stop migration and write a script to write a single table to a new library. Let’s say you hang an announcement at 2am, when the number of people using the system is very low, saying the system is due for maintenance and upgrade in an estimated one hour. Then, you write a script to write the single table data to the new library.

If you don’t want to be down migrating data, you can also consider dual-write. Dual-write schemes are slightly more difficult to implement for scenarios where migration cannot be stopped. Here’s how it works:

  • For the program level, we update the old library, but also to write the database system after the database table. That way, we can make sure that the data in our new database is up to date.

  • In addition, we will write a script to write data from the old stand-alone database to the database system behind the sub-database and sub-table.

  • In addition, we will write a script to check the accuracy of the new library data and see if there are any missing data.

  • Finally, if the old library is completely migrated to the new library, we can change the relevant code in the program.

Double write access requires us to change the code in the program segment. Isn’t there a way to do this without stopping and without changing the programmer side of the code too much?

We can use the database synchronization tool Canal to get the binglog of the old library and parse it. According to the result of parsing, we write the data to the new library.

Is there any other way to divide the database and table?

Additionally, in addition to this separate database and table solution, there are many companies using TiDB, an open source distributed relational database that is currently popular.

For TiDB, not having to worry about database storage stress can save us a lot of things.

Moreover, TiDB naturally supports horizontal scaling and scaling, financial high availability, and is compatible with MySQL 5.7 protocol and