The company has been working on service separation and data sharding recently, because the volume of data in single package tables is too large, and it is increasing by 60W per day. I have known about the sub-database sub-table before and read a few blog posts, but I only know a vague concept, and now everything is vague in retrospect.

Today I looked at the database sub-database sub-table for a whole afternoon and read a lot of articles. Now I make a summary and “extract” it down. (But looking forward to the later practical operation) will start from the following aspects:

Part one: the problems faced in the development process of the actual website.

The second part: what kinds of segmentation, vertical and horizontal differences and application surface.

Part three: some open source products and technologies on the market, what are their advantages and disadvantages?

Part FOUR: probably the most important, why not recommend the level of sub – database sub – table! ? This allows you to be careful in the early stages of planning and avoid problems caused by sharding.

Noun explanation: database; Table: table; Distribution table: Sharding

In the beginning of the database architecture evolution, we could only use the stand-alone database. Then, faced with more and more requests, we separated the database write operation and read operation, using Slaver Replication for reading, using Master for writing, updating data synchronously from slave to Master. Keep your data consistent. The architecture is database master-slave synchronization. The library can scale horizontally, so more read requests are not a problem.

But what happens when the number of write requests increases? Adding a Master does not solve the problem, because to keep data consistent, write operations need to be synchronized between two masters, which is equivalent to duplication and more complex.

In this case, we need to use sharding to slice the write operation.

Any problem is too big or too small, we are facing here the amount of data is too big.

User requests are too large because the TPS, memory, and IO of a single server are limited. Solution: Spread requests across multiple servers; In fact, the user request and the execution of an SQL query is essentially the same, is to request a resource, but the user request will pass through the gateway, routing, HTTP server and so on.

A single database is too large. The processing capacity of a single database is limited. The disk space on the server where the single library resides is insufficient. IO bottleneck solution for operations on a single library: Split into more and smaller libraries

If a single table is too large, CRUD becomes a problem. Index bloat and query timeout solution: split into multiple tables with smaller data sets.

The methods of database and table are generally vertical and horizontal segmentation, which is a result set description segmentation method, and is the segmentation in physical space. We start from the problem we face, start to solve, explain: first of all, the user request is too large, we pile machine to solve (this is not the focus of this article).

Then, the single database is too large. In this case, we need to see whether there is too much data because there are too many tables, or because there is too much data in a single table. If there are too many tables and too much data, use vertical shard to split into different libraries based on the business.

If the amount of data in a single table is too large, then horizontal segmentation is used, that is, the data of the table is divided into multiple tables according to certain rules, or even multiple tables in multiple libraries. The order of the sub-database sub-table should be vertical points first, then horizontal points. Because vertical division is simpler and more in line with how we deal with real-world problems.

Vertical split vertical split table

That is, “big table split small table”, based on the column field. Generally, there are many fields in the table, and the uncommonly used, large data and long length (such as text type fields) are split into “extended table”. Generally for the kind of hundreds of columns of the large table, but also to avoid the query, the amount of data caused by the “cross-page” problem.

Vertical depots

Vertical branch library is aimed at the separation of different businesses in a system, such as User a library, commodity Producet a library, Order Order a library. Once shard, put it on multiple servers, not one. Why is that? Let’s imagine a shopping website offering services externally, there will be CRUD of users, goods, orders, etc. Before the split, everything falls into a single library, which makes the database’s single-library processing capacity a bottleneck. After dividing the database vertically, if it is still placed on a database server, as the number of users increases, it will make the processing capacity of a single database become a bottleneck, and the disk space, memory and TPS of a single server are very tight. So we split it into multiple servers, so that all the above problems are solved, and we don’t have to face stand-alone resource problems in the future.

Similar to the “governance” and “downgrade” mechanism of services, the separation of database services can also manage, maintain, monitor and expand the data of different services. Database is often the most easy to become the bottleneck of the application system, and the database itself belongs to the “stateful”, compared with the Web and application server, is more difficult to achieve “horizontal expansion”. The connection resources of database are precious and the processing capacity of single machine is limited. In high concurrency scenario, the vertical branch database can break the bottleneck of IO, connection number and hardware resources of single machine to some extent.

Horizontal split level table

For a single table with a large amount of data (such as order table), according to certain rules (RANGE,HASH modulus, etc.), cut into multiple tables. But these tables are still in the same library, so library-level database operations still have IO bottlenecks. Not recommended.

Horizontal sub-database sub-table

The data of a single table is shard to multiple servers, each server has the corresponding library and table, but the data set in the table is different. Horizontal library table can effectively relieve the performance bottleneck and pressure of single and single library, and break through the bottleneck of IO, connection number, hardware resources, etc.

Rules for horizontal database and table sharding

RANGE from 0 to 10000 tables, 10001 to 20000 tables; A shopping system that HASH users and orders as the main table, and then schedules those associated with them, does not cause problems such as cross-library transactions. Take the user ID, hash it, and assign it to different databases. Geographical regions such as east China, South China, north China to differentiate business, Qiniuyun should be like this. Time is segmented according to time, that is, the data from 6 months ago or even one year ago is cut out and put into another table. As time goes by, the data from these tables are less likely to be queried, so there is no need to put together with "hot data", which is also "cold and hot data separation".Copy the code

Once a transaction supports a database and a table, it becomes a distributed transaction. If you rely on the distributed transaction management function of the database itself to execute transactions, it will pay a high performance cost. If the application program to assist control, the formation of program logic transactions, and will cause programming burden.

Multi-library result sets and (group by, order by) TODO

After cross-library join TODO, the associated operations between tables will be limited. We cannot join tables in different sub-databases or join tables with different granularity of sub-tables. As a result, many queries may be needed to complete the services that can be completed by one query. Rough workaround: Global tables: base data, copies for all libraries. Field redundancy: So that some fields do not need to be queried by join. System layer assembly: query all separately, and then assemble, more complex.

At present, there are relatively many sub-database and sub-table middleware products on the market, among which MySQL Proxy and Amoeba are based on Proxy mode, Hibernate Shards is based on Hibernate framework, and Dangdang Sharding-JDBC is based on JDBC. Maven-like plugin based on Mybatis with Mushroom street TSharding, by rewriting spring ibatis template class Cobar Client.

There are also open source products from major companies: