preface

Hello, everyone. I am a little boy picking up field snails. Today, I will talk to you about the distribution table.

  1. What is sub-database sub-table
  2. Why do we need separate databases and separate tables
  3. How to divide database and table?
  4. When did you start thinking about separate tables
  5. What problems can be caused by separate database and separate table
  6. Introduction to subdatabase and subtable middleware
  • Public number: a boy picking up snails

1. What is sub-database sub-table

Repository: A database is divided into multiple databases and deployed on different machines.

Split table: a database table is divided into multiple tables.

2. Why do we need separate databases and tables

2.1 Why do we need separate databases?

If the volume of traffic increases, performance bottlenecks may occur in the database, at which point we need to consider splitting the database. From these aspects:

  • Disk storage

With the increase of service volume, the disk capacity of the MySQL single machine will burst, and the disk usage will be greatly reduced when multiple databases are dismounted.

  • Concurrent connection support

We know that database connections are limited. In the scenario of high concurrency, a large number of requests to access the database, MySQL alone is not able to withstand! The current hot microservices architecture has emerged to cope with high concurrency. It orders, users, goods and other different modules, divided into multiple applications, and the single database is divided into a number of different functional modules of the database (order library, user library, commodity library), in order to share the reading and writing pressure.

2.2 Why are sub-tables needed?

Too much data can slow down SQL queries. If a query SQL fails to index, millions of tables could overwhelm the database.

Even if the SQL hits the index, the query will be significantly slower if the table size exceeds 10 million. This is because the index is usually a B+ tree structure, which increases the height of the B+ tree and slows down the query.

MySQL > select height from B+ tree; Review it while you’re at it

InnoDB storage engine the smallest storage unit is a page, a page size is 16K. B+ leaves store data, and internal nodes store keys + Pointers. The index organization table determines which page the data is in through the binary search method of non-leaf nodes and Pointers, and then finds the required data in the data page. The B+ tree structure diagram is as follows:

Let’s say the height of a B+ tree is 2, so there’s one root and several leaves. The total number of records stored in this B+ tree is = the number of Pointers to the root node * the number of records in a single leaf node.

  • If the data size of a row is 1K, the number of records that can be stored in a single leaf node=16k/1k =16.
  • How many Pointers are stored in non-leaf nodes? Let’s say the primary key ID isThe value is of the bigint type and is 8 bytes long(An int is 32 bits, 4 bytes), and the pointer size is set to 6 bytes in InnoDB source code, so it is8 + 6 = 14Bytes,16k/14B =16*1024B/14B = 1170

Thus, a B+ tree of height 2 can hold 1170 * 16=18720 such data records. Similarly, a B+ tree of 3 height can store 1170 *1170 *16 =21902400, which can store about 20 million records. The height of the B+ tree is 1-3 layers. If the height of the B+ tree reaches 4 layers, the number of disk searches will be increased and the SQL will slow down.

Therefore, the amount of data in a single table exceeds tens of millions, and it is necessary to consider the sub-table.

3. How to divide the database and table

3.1 Vertical split

3.1.1 Vertical branch library

In the early stage of business development, there are few business functional modules. In order to quickly go online and iterate, a single database is often used to store data. The database architecture is as follows:

However, as the business progresses, the system functions are gradually improved. At this time, the system can be divided into different businesses, such as user database, order database, points database, commodity database, deployed in different database servers, this is vertical branch database.

Vertical repository, which distributes the pressure of a single database to different databases, is a good way to cope with high concurrency scenarios. The vertical split database architecture is as follows:

3.1.2 Vertical table

If a single table contains dozens or even hundreds of columns, it will be messy to manage, and will consume IO resources if you select * every time. At this point, we can split some infrequently used, large or long columns into another table.

For example, a user table contains user_id, user_name, mobile_no, age, email, nickname, address, and user_desc. If email, address, and user_desc are not used, We can split it into another table and call it the User Details table. This is the vertical table

3.2 Horizontal split

3.2.1 Horizontal database

Horizontal partitioning means that the data volume of a table is divided into different database servers. Each server has the same library and table, but the data set in the table is different. It can effectively relieve the performance bottleneck and pressure of a single library.

The horizontal split architecture of the user base is as follows:

3.2.2 Level table

If the amount of data in a table is too large, the data can be divided into multiple tables according to certain rules (such as hash modulus and range).

An order table, broken down by time range as follows:

3.3. Horizontal database and table strategy

There are several kinds of database and table strategy, which are used in different scenarios:

  • Scope of the range
  • Hash modulus
  • Range +hash mod mix

3.3.1 range range

Range: the range policy divides the table. For example, we can divide the primary key of a table from 0 to 10 million into one table and from 10 to 20 million into another table. The diagram below:

Of course, sometimes we can divide orders by time range. For example, orders from different years can be placed in different tables, which is also a division strategy of range.

Advantages of this scheme:

  • This solution facilitates capacity expansion and does not require data migration. Let’s say we go up to 50 million, we just need to level up by one table before0 ~ 40 millionData, do not need to migrate.

Disadvantages:

  • This scheme has a hot spot because order ids are always increasing, which means they have been aggregated in a table for a while. For example, all orders for the last month are here10 million ~ 2000Ten thousand between, usually the user generally check the order of the recent month is more, the request is hitorder_1The table, this causes the tableHot dataThe problem.

3.3.2 rainfall distribution on 10-12 hash modulus

Hash modding policy: The specified route key (user_id and order ID are generally used as keys) modulates the total number of sub-tables to distribute data to each table.

For example, the original order table information is divided into four sub-tables:

  • So id is equal to 1, if you modulo 4, you get 1, you put it in the first table, which is 1t_order_0;
  • Id is equal to 3, if you modulo 4, you get 3, so you put it in the third table, which is 3t_order_2;

Advantages of this scheme:

  • There are no obvious hot issues in the hash mode.

Disadvantages:

  • If the table is divided into four tables according to the hash mode at the beginning, at some point in the future, the data volume of the table will reach the bottleneck and need to be expanded, which will be tricky. Let’s say you go from four tables to one8A sheet, before thatid=5The data of (5% 4 = 1, that is, the first table), should now be placed in (8 = 5 5%, that is, the first5A table), that is to sayHistorical data will be migrated.

3.3.3 Range + Hash Mod mix

Since range has the problem of hot data, it is difficult to expand and migrate data with hash mode, so we can combine the two schemes to take the best and discard the best.

A relatively simple way is to divide the database, we can first use the range scheme, for example, the order ID in the range of 0 40 million can be divided into the order library 1, the data with the ID of 480 million can be divided into the order library 2, and the data with the ID of 80 million ~ 120 million can be divided into the order library 3 in the future expansion. Then, in the order library, different orders are divided into different tables by using the strategy of hash taking modules.

4. When should we think about separate database and separate table?

4.1 When are the tables divided?

If your system is in a period of rapid development, if the order flow is increased by hundreds of thousands every day, and the order table query efficiency is slowing down, you need to plan the branch table. Generally, B+ tree index height is 2~3 layers is the best. If the data volume is ten million levels, the height may change to 4 layers, and the data volume will be significantly slower. However, the industry spread, generally 5 million data to consider the table.

4.2 When will the warehouse be divided

When the business develops rapidly and multiple services share a single database, the database becomes a performance bottleneck, so it is necessary to consider repository separation. For example, orders, users, etc., can be extracted, a new application (in fact, is the micro-service idea), and split the database (order library, user library).

5. What problems will be caused by separate database and separate table

After dividing the database into different tables, there will also be some problems:

  • Transaction issues
  • Across the library association
  • Scheduling problems
  • Paging problem
  • A distributed ids

5.1 Transaction Issues

If the two tables are in different databases, then the local transaction is invalid and a distributed transaction is required.

5.2 Cross-library association

Cross-node Join problem: Solving this problem can be implemented in two queries

5.3 Sorting Problems

Count, Order BY,group BY, and aggregate functions across nodes: Results can be obtained separately on each node and combined on the application side.

5.4 Paging Problems

  • Scheme 1: After the corresponding results are found on each node, aggregation is performed at the code end and then paging is performed.
  • Scenario 2: Hand over paging to the front end, which passes pageSize and pageNo, performs paging at each database node, and then aggregates the total number front end. The disadvantage of this is that it can cause empty searches, and if the pages need to be sorted, it can be difficult.

5.5 Distributed ID

After the database is shard, it can no longer rely on the primary key generation mechanism of the database itself. The simplest is to consider UUID, or use the Snowflake algorithm to generate distributed ids.

6. Sub-database sub-table middleware

At present, there are many popular sub-database sub-table middleware:

  • cobar
  • Mycat
  • Sharding-JDBC
  • Atlas
  • TDDL (Taobao)
  • vitess