Massive data under the sub-database sub-table best combat!

Every good programmer and architect should master the database and table. In the era of mobile Internet, massive users produce massive amounts every day, such as:

The users table
The order sheet
Transaction statement

Let’s take Alipay, which has 800 million users; Wechat has 1 billion users. And the order list is even more exaggerated, such as Meituan takeout, every day tens of millions of orders. Taobao’s total historical orders should be tens of billions, even hundreds of billions of levels, these massive data is far from a table can Hold.

In fact, MySQL can store a billion levels of data in a single table, but performance is poor at this point. It is generally accepted that MySQL is optimal when the size of a single table is less than 10 million, because its BTREE index tree height is between 3 and 5.

Since one table is not enough, you should try to put the data in multiple places. There are three common solutions at present:

Partition;
Sub-database sub-table;
No (/ NewSQL;

To clarify: only separate database, or only separate table, or separate database and separate table fusion scheme are uniformly considered as separate database and separate table scheme. Because sub-library, or sub-table is just a special kind of sub-library sub-table. NoSQL is more representative of MongoDB, ES. A good example of NewSQL is TiDB.

Why Not NoSQL / NewSQL?

First of all, why not choose the third solution NoSQL/NewSQL, I think mainly RDBMS has the following advantages:

RDBMS ecosystem improvement;
RDBMS is absolutely stable;
Transaction characteristics of RDBMS;

As a newborn, NoSQL/NewSQL cannot compete with RDBMSS when reliability is our primary concern. RDBMSS have evolved for decades, becoming the core storage of choice wherever there is software.

At present, most of the company’s core data are: RDBMS storage as the main, NoSQL/NewSQL storage for the auxiliary!

Internet companies and MySQL – based, state-owned enterprises & banks and other not bad money to Oracle/DB2 – based enterprises! No matter how awesome NoSQL/NewSQL is advertised, companies are now positioning it as a complement to, not a replacement for, RDBMS.

Why Not partition?

Let’s look at the partition table scheme. Before you get to the solution, understand how it works:

Partitioned tables are implemented by multiple related underlying tables, which are also represented by handle objects, so we can also access individual partitions directly.

The storage engine manages the underlying tables of a partition just as it manages regular tables (all underlying tables must use the same storage engine), and the index of a partitioned table simply adds the same index to each underlying table.

From the storage engine’s point of view, the underlying table is no different from a regular table, and the storage engine does not need to know whether it is a regular table or part of a partitioned table.

In fact, this is a good solution, as it shields the user from Sharding details and works fine even if the query does not have sharding columns (it just doesn’t perform well at this point).

However, its disadvantages are obvious: a lot of resources are limited by the single machine, such as connection number, network throughput and so on!

Although each partition can be stored independently, the overall entry to a partitioned table is a MySQL sample. As a result, its concurrency is very general, far less than the High concurrency requirements of the Internet!

As for some of the other drawbacks mentioned online, such as inability to use foreign keys and lack of support for full-text indexing. I don’t think this is a disadvantage, if the 21st century project is still using foreign keys and database full text index, I don’t even bother to ridicule!

So, if you use partitioned tables, your business should have the following two characteristics:

Data is not massive (limited number of partitions, limited storage capacity);
The requirement of concurrency is not high;

Why separate database and separate table?

Finally, the general method to deal with massive data in the Internet industry is introduced: separate database and separate table.

Although we all use the sub-database and sub-table scheme to deal with massive core data, there is no unified middleware. The author lists some well-known sub-database and sub-table middleware here:

Ali’s TDDL, DRDS and Cobar
Sharding-jdbc for the Open Source community (3.x has been renamed sharding-Sphere)
MyCAT, an ngo
360 the Atlas;
Meituan zebra

note

: Zhang Liang Dashen, author of Sharding-JDBC, used to work at Dangdang and now works at JD Finance. However, sharding- JDBC copyright belongs to the open source community, not the company, not Zhang Liang personal!

Other companies, such as netease, 58 and JINGdong, have developed their own middleware. In short, each for its own sake, or a hundred flowers bloom together.

However, so many sub-database and sub-table middleware can all be divided into two types:

CLIENT mode
The PROXY pattern

CLIENT mode represents TDDL of Ali and Sharding-JDBC of open source community (The 3.x version of Sharding-JDBC, namely sharding-Sphere, has supported proxy mode)

The structure is as follows:

PROXY mode represents Cobar of Ali and MyCAT of non-governmental organizations. The structure is as follows:

However, both CLIENT mode and PROXY mode. The core steps are the same: SQL parsing, rewriting, routing, execution, and result merging.

The author prefers CLIENT mode, which has simple architecture, small performance loss and low operation and maintenance cost.

Next, with a few common large table as a case, explains how to land sub-database sub-table!

Practical cases

The first and most important step is the selection of Sharding column. The selection of Sharding column will directly determine the final success of the whole database and table scheme.

The selection of Sharding column is strongly related to business. The author thinks that the method of selecting Sharding column is mainly to analyze your API traffic, give priority to the API with large traffic, and extract the SQL corresponding to the API with large traffic. These SQL common conditions are referred to as Sharding columns.

For example, general OLTP systems provide services to users, and the SQL corresponding to these apis has conditional user IDS. Therefore, the user ID is a very good Sharding column.

Here are several main processing ideas of sub-database and sub-table:

Select only one Sharding column for database and table division;
Multiple Sharding columns multiple sub-database sub-tables;
Sharding column + ES;

Take several actual tables as examples to illustrate how to divide the database into tables.

The order sheet

The core fields of the order table are as follows:

Take Ali order System as an example (refer to “Transformation of Enterprise IT Architecture: Alibaba’s Strategic Thought and Architecture Implementation in Central Taiwan”), IT selects three columns as three independent Sharding columns.

Namely: order_id, user_id, merchant_code

Among them, user_id and merchant_code are buyer ID and seller ID, because the query flow of both buyer and seller in Ali’s order system is relatively large, and the query has high real-time requirements. Select * from order_id; select * from order_id;

There is another point that needs to be mentioned here. We need to weigh whether the sub-database sub-table of multiple Sharding-column is redundant full or only redundant relational index table.

The situation of full redundancy is shown in the following figure. The data of the table corresponding to each Sharding column is full. The advantage of this method is that no second query is required, the performance is better, and the disadvantage is that it wastes storage space (the light green field is the Sharding-column) :

The situation of redundant relational index table is shown in the figure below. Only the sub-database sub-table of one Sharding column has full data, while other sub-database sub-tables are only relational tables related to this Sharding column.

The advantage of this method is to save space, but the disadvantage is that except for the query of the first Sharding column, all the other sharding column queries need to be queried twice.

The relationship between the three tables is shown below (the light green fields are sharding columns) :

Redundancy PK redundant relationship table of the full scale

Speed comparison: The redundant full table is faster, and the redundant relation table needs to be queried twice. Even with the introduction of cache, there is still one more network overhead.
Storage cost: the storage cost of redundant full tables is several times that of redundant relational tables.
Maintenance cost: The maintenance cost of redundant full scale is higher, and multiple tables need to be modified when data changes are involved.

conclusion: Choose redundant full scale or index relational table, which is a kind of trade off in architecture. Both of them have obvious advantages and disadvantages. Ali’s order table is redundant full scale.

The users table

User table several core fields are generally as follows:

In common scenarios, users can log in using mobile_no, email, or username.

However, some user-related apis all contain user_id, so it may be necessary to divide the database and table according to the four columns, that is, all the four columns are sharing-columns.

Account table

The core fields of the account table are generally as follows:

Account_no is the sharing-column of the account_no API.

Complex queries

All of the above are SQL executions with Sharding columns in the condition. However, there are always some query conditions that do not contain Sharding columns. Meanwhile, it is impossible for us to have unlimited redundant sub-database and sub-table for these queries with low request volume.

So what about SQL that doesn’t have a Sharding column in those conditions?

Take Sharding-JDBC as an example, the number of sub-database sub-tables will be concurrently routed to the number of sub-database sub-tables for execution, and then the results will be merged. Specific how to merge, you can see the author sharding- JDBC series of articles, analysis of the source code to explain the merger principle.

Compared with the conditional query with Sharding column, the performance of this kind of conditional query is obviously much lower. If there are dozens or even hundreds of sub-tables, as long as the execution of one table is slow due to some factor, the entire SQL execution will be slow, which is very consistent with the barrel theory.

Even more, those operating systems in the fuzzy query, or ten criteria filter. In this case, it is difficult to create an index even for a single table, let alone for separate tables.

So what to do? This is where the famous ElasticSearch, es, comes in handy. All data of sub-database and sub-table are fully redundant into ES, and those complicated queries are handed over to ES for processing.

All my order pages on Taobao are as follows, with multiple screening conditions and fuzzy matching of product titles, which cannot be solved even by a single table (the index cannot meet such a scenario), not to mention by database and table:

So, taking the order table as an example, the overall architecture is as follows:

Specific analysis: It is best not to use multi-Sharding column unless it is absolutely necessary, because it costs a lot. The author does not recommend using the user table mentioned above.

Because the user table has a great feature that its upper limit is certain, even if all the 7 billion people in the world are your users, this data volume is not large, so the author suggests to adopt the mode of single Sharding Column + ES to simplify the architecture

Es + HBase briefly

It should be noted in advance that solutions combined with Solr + HBase may appear more frequently in the community. In order to maintain consistency, the selection of all full-text indexing schemes in this paper is ES.

As for the superiority of ES +HBase and Solr +HBase, or the superiority of ES and SolR, it is not the scope of this article to discuss, in fact, there is not much significance to discuss.

Es and Solr are two very good and comparable middleware. Es has become even more popular in recent years:

If you put aside all the historical baggage in the selection process and consider the advantages and disadvantages of ES +HBase and Solr +HBase, it is clear that the latter is the better choice.

Solr +HBase is highly integrated. After the introduction of indexing service, solr+HBase has developed a mature solution to the most important index consistency problem, namely, Lily HBase Indexer.

read

The HBase version of the cloud database on Ali Cloud also uses Solr to achieve full-text indexing. Interested students can click the link to learn more:

https://help.aliyun.com/product/49055.html?spm=5176.124785.631202.con1.603452c0cz7bj2

Es + HBase principle

We just discussed the above scheme of MySQL as the core, separate database and table + ES. With the increasing amount of data, although the separate database and table can continue to expand exponentially, the pressure falls on ES at this time, and this architecture will gradually expose problems!

General order table, integral list and other core tables that need to be divided into sub-tables will have dozens of columns, or even hundreds of columns (suppose there are 50 columns), but the whole table really needs to participate in the conditional index may be less than 10 conditions (suppose there are 10 columns).

At this time, the data of all fields of 50 columns are fully indexed into ES, which puts great pressure on THE ES cluster, and it will take a long time to recover the following ES fragment faults.

At this time, we can consider reducing the pressure of ES, so that the limited resources of ES cluster can save the most needed and valuable data in conditional retrieval as much as possible

That is, only the fields that may participate in the condition retrieval are indexed into ES, so that the pressure of the whole ES cluster is reduced to 1/5 of the original (there are 50 fields in the core table, but only 10 fields participate in the condition).

Full data of 50 fields is stored in HBase. This is the classic ES +HBase combination solution, that is, indexes are isolated from data stores

We all know that HBase storage capacity under Hadoop system is massive, and according to its Rowkey query performance that is lightning fast, and ES multi-condition retrieval ability is very strong.

This solution makes full use of the advantages of ES and HBase while avoiding their disadvantages. It is a best practice to avoid the disadvantages.

The interaction between them is as follows: First, the USER searches for the Rowkey value that meets the filtering conditions in ES, and then uses the Rowkey value to query in HBase. The latter step takes almost no time, because HBase is good at this scenario.

The interaction diagram is as follows:

HBase search capability expansion

The image is from HBase Technical Community -HBase Application Practice Session -HBase for Solr

conclusion

Finally, several schemes are summarized as follows (Sharding Column is referred to as SC) :

–

A single sc

Multiple sc

sc+es

sc+es+HBase


Applicable scenario	A single	general	More extensive	A very wide range
Query timeliness	In a timely manner	In a timely manner	More timely	More timely
Storage capacity	general	general	larger	Huge amounts of
Code cost	A small	larger	general	general
Architectural complexity	simple	general	More difficult to	Very complicated

In a word, for massive data, and a certain amount of concurrent sub-database sub-table, by no means to introduce a sub-database sub-table middleware can solve the problem, but a systematic project. You need to analyze the entire table-related business and let the right middleware do what it does best.

For example, if a sharding column query goes to the database table, some fuzzy queries go to ES, or multiple unfixed criteria are filtered, and massive storage goes to HBase.

After doing so many things, there will be a lot of work to do, such as the consistency of data synchronization, and after running for a period of time, the amount of data in some tables slowly reached the single-table bottleneck, this time also need to do cold data migration.

In a word, database and table is a very complex system engineering. Any massive data processing, is not simple things, ready to fight!

End

Author: Alfy’s blog

Source:

https://www.jianshu.com/p/f29e73b97794

The copyright of this article belongs to the author

Long press the qr code below to pay immediate attention to [Tanuki technology Nest]

Top technical experts from Alibaba, JD.com, Meituan and Bytedance are in charge

Create a “temperature” technology nest for IT people!