Large table optimization sub-database sub-table

Preface: The content is basically from the notes I made before

When the number of MySQL single table records is too large, the CRUD performance of the database will be significantly reduced. Some common optimization measures are as follows:

1. Limit the scope of your data

Be sure to disallow queries that do not have any conditions limiting the data range. For example, when users query the order history, we can control it within a month;

2. Read/write separation

The classical database split scheme, the master library is responsible for writing, the slave library is responsible for reading;

The data is hash based on the order ID or an ID and distributed to the corresponding database system

After the demolition, each database pressure is greatly reduced, reduced to 1/3 running SQL time will also be reduced

The first is to divide the library, the library apart, each library exactly the same table

1. Sub-database sub-table middleware

How do I distribute data to various databases when I receive it? In this case, we need to use the sub-database and sub-table middleware (database middleware, which distributes the received data to different databases based on orderID or other fields).

There are two types, one is independently deployed proxy, and the other is in client mode, which references JAR packages together and distributes them directly

Sharding JDBC (belonging to the client layer scheme) is relatively stable and is now used more; No additional deployment, no secondary forwarding, high performance. The downside is that each system needs to be re-upgraded and re-released if an upgrade is encountered, and each system needs to couple Sharding JDBC dependencies

Mycat, a proxy layer solution, is relatively young and needs to be deployed and operated and maintained by itself. The advantage is that it is more transparent. When upgrading, you can directly upgrade, independently deployed

2. How to split it specifically?

Horizontal split and vertical split

** Vertical split: ** fewer columns, field split; Each library table contains a subset of fields. Generally, the less frequently accessed fields are placed in one table, and the less frequently accessed fields are placed in one table. Fields that are accessed frequently have a smaller amount of data (more rows can be stored in the cache); Take apart a big list, order list, order payment list, order goods list

Advantages of vertical split: The column data is smaller, the number of blocks read during query is reduced, and the I/O count is reduced. In addition, vertical partitioning simplifies table structure and is easier to maintain.
Disadvantages of vertical split: Redundant primary keys, need to manage redundant columns, and may cause Join operations, which can be solved by joining at the application layer. In addition, vertical partitioning makes transactions more complex;

** Horizontal split: ** single table data 6 million lines, after the split into several databases; Put a portion of each database (and then divide it into multiple tables)

1. Decide which database to put into according to the id of the mold hash (also according to the creation time of the order, January, February, the first week, the second week and the third week, can be divided according to the time range). The advantage of hash is that it can balance the request and pressure of the database; The disadvantage is that it is troublesome to expand the capacity (there will be data migration during expansion).

2. According to the range, it will be convenient to expand the capacity in the future. The downside is that most general data query requests are for the most recent data, so data from previous months is rarely accessed. When using range in production, depending on the scenario, your users will not only access the latest data, but also the current data and historical data evenly

Relatively uniform distribution, SQL execution efficiency increased several times, disk usage also down!

Split table is generally said to be horizontal split, vertical split is generally done at the time of database design

** Advantages: ** Horizontal splitting can support very large data volumes. Note that the split table only solves the problem of large data in a single table, but because the table data is still on the same machine, in fact, there is no significance to improve MySQL concurrency, so horizontal split is best.

** Disadvantages: ** horizontal splitting can support very large amount of data storage, and the application end is less transformed, but it is difficult to solve fragmented transactions, poor performance of cross-node Join, and complicated logic. The author of “The Way to Train Java Engineers” recommends avoiding data sharding as much as possible because of the complexity of logic, deployment, and operation and maintenance. A typical data table can support less than 10 million data volumes with proper optimization. If sharding is necessary, choose client sharding architecture to reduce network I/O with middleware.

3. If the system does not stop the system is not subdivided into the subdivided table system to migrate to the subdivided table system?

How to migrate the 600W data volume into 3 databases with 4 tables in each database and 500,000 data volume in each table?

1. Long time shutdown migration sub-table (hang a notice, which point, shutdown operation and maintenance, website maintenance)

Run a few more machines, 20 threads, run a background temp, run a few more threads, configure the rules through the database middleware, and then read them and write them to the database

Modify the configuration of the system so that the data is written to the database middleware, which then distributes the data; It was done long ago!!

Cons: hours of downtime, if not done, roll back, the next morning to continue

2. Non-stop dual-write solution

More common

Modify the system to write library code, at the same time let him write the old library and the new sub-library sub-table of the library

Background data manipulation

At the same time, open the background data migration tools (standard database design contains a field, last modified time, whether the data inside the depots table, if there is no written directly, if it exists, then compare the two timestamp of data, if you want to new, then cover depots data in a table) move round, performs a check again, Check whether the data in the two databases are consistent. If they are different, read from the single database and single table for different data, compare the timestamp, and judge whether it needs to be overwritten;

The last step is to modify the code, just write the table in the sub-library sub-table can be deployed again.

4. How to design a database and table division scheme that can be dynamically expanded and reduced?

1. Select a database middleware, conduct research, study and test

2. Design your plan to divide the database into as many tables as you want

3. Based on the selected database middleware, establish the sub-database and sub-table environment in the test environment, and then test whether the normal sub-database and sub-table reading and writing can be carried out

4. Complete the migration from single database single table to sub-database sub-table, double write scheme

5. The online system began to provide external services based on sub-database and sub-table

6. How to dynamically expand capacity?

Solution:

1. Shut down and expand capacity

Just like the shutdown migration, it is also an outage, which is done through the database middleware, and the configuration can be modified at the end

Disadvantages: It takes a long time and takes a long time to expand

2. Optimized scheme

32 * 32 is divided into 32 libraries with 32 tables in each database, which can basically bear tens of thousands of write requests per second

For the first time, divide the table into 32 * 32

Mysql database server can open multiple services in mysql, each database server opened 8 libraries, four database servers opened a total of 32 libraries, each library has 32 tables

How to double the write concurrency, add four machines, migrate several libraries respectively;

The reason for the expansion of the database is that the disk of the database is almost full, and the database cannot bear the write concurrency

Just library migration

Up to 32 servers, each database server has only one library, a database has 32 tables; No need to change the number of databases and tables in the database and then in the table, first % then/later %

The biggest advantage is that the database does not need to be migrated, just need to change the configuration, the address is ok.

Anyway, it is the number of servers that can be expanded, and then /

If shrink, is to change the number of servers can be, do not need to extract the data inside

The whole idea: expand the number of servers, and then the DBA moves out all the libraries

5. How to handle the global primary key id after the database and table are divided?

If this parameter is not processed, both ids are auto-incremented, and the ids will be the same. Therefore, the parameter cannot be auto-incremented

1. Based on the database increment ID (specialized primary key generation global library)

Generate primary key library, global only 1, after inserting a data, autoincrement ID is 1; That is, before inserting into a global database surface insert, and then according to the id inside to their own database;

Suitable for situations with low concurrency but large data volumes

2.uuid

It is almost guaranteed to be unique, but the disadvantage is that it is too long, and poor performance as a primary key, not suitable for the main key

Suitable scenario is: random generation of file name, number and so on, the primary key should not use uUID

3. Obtain the current system time

When the amount of concurrency is very high, there will be duplication, basically not considered;

Suitable scenarios: concatenated with other fields of the current business, as an ID is ok, to form a globally unique number; For example, order number: timestamp + user ID + code of business meaning

4. The snowflake algorithm

The 64-bit LONG ID is converted to binary

If the first digit is 0, it’s positive

The middle 41 bits are time stamps, and we did some calculations and converted the corresponding binary to 41 bits

In the last five digits, the machine room ID is converted to binary. A maximum of 32 machine rooms are required

The last five is the machine ID, the machine ID into binary, requires a maximum of 32 machines, if greater than or less than the error will be reported

The last twelve digits are serial numbers. In the same machine room at the same time, within one millisecond, if an ID is required to be generated again, + 1 is added after the serial number according to binary. A maximum of 4096 ids can be generated