Why separate database and table?

The interview questions

Why separate tables and databases (how to design the database level when designing a high concurrency system)? Which sub-database sub-table middleware have you used? What are the advantages and disadvantages of different sub-database and sub-table middleware? How exactly do you split the database vertically or horizontally?

Interviewer psychoanalysis

In fact, this must be pulled to the high concurrency, because the sub-database sub-table must be to support the high concurrency, large data two problems. And now to tell the truth, especially the Internet type of company interview, basically will come to so, sub-database sub-table so common technical problems, do not ask is not, and if you do not know that is really not the past!

Analysis of interview questions

Why separate database and table? (How to design database level when designing high concurrency system?)

To put it bluntly, sub – library sub – table is two things, we can not be confused, may be light sub – library sub – table, also may be light sub – table sub – library, are possible.

Let me show you a scene.

If we are now a small start-up company (or a new department just emerging from a BAT company), we have 200,000 registered users, 10,000 daily active users, 1,000 daily data volume of single table, and 10 concurrent requests per second at most at the peak. Oh, my God, that’s the system. Just pick someone with a few years of experience and bring in a couple of fresh graduates and do whatever you want.

We were lucky enough to meet a CEO who took us to the next level. The business grew rapidly, and within a few months it had 20 million registered users! 1 million daily active users! Daily single table data volume of 100,000! Peak peak of 1000 requests per second! At the same time, the company also incidentally with two rounds of financing, income several hundred million yuan ah! The valuation of the company reached a staggering several hundred million dollars! That’s the rhythm of the little unicorn!

Okay, that’s okay. Now you’re feeling a little stressed out. Why? Because 100,000 more data a day, 3 million more data a month, now we have millions of data in a single table, will soon be over 10 million. But I can barely hold on. Peak request is now 1000, we deployed several machines online, load balancing, database support 1000QPS is also ok. But everybody began to feel a little worried now, how to do next……

In the next few months, my god, the CEO is awesome, the company has 100 million users, the company continues to raise billions of yuan! The valuation of the company has reached a staggering billions of dollars, becoming the most awesome star start-up company in China this year! God, we are so lucky.

But we are also unfortunately, because at this time, the number of active users every day tens of millions, every day single table of new data up to 500,000, the total amount of a table has reached twenty or thirty million! Can’t carry ah! Database disk capacity is running out! The peak concurrency reaches an astonishing 5000~8000! Come on, brother. I promise you, your system won’t last until now. It’s down!

Okay, so you kind of get a sense of what’s going on with the split tables, and actually it’s going to follow your business, and the better your business grows, the more users you get, the more data you get, the more requests you get, and you can’t handle a single database.

table

For example, if you have tens of millions of data in a single table, are you sure you can handle it? No, the amount of data in a single table is too large, which will affect the performance of your SQL execution, and your SQL may run slowly later. Generally speaking, as far as my experience is concerned, when the single table reaches millions, the performance will be relatively poor, and you will score the table.

What do you mean by a separate table? It’s a way of putting data from one table into multiple tables, and then when you query, you look up one table. For example, to separate tables by user ID, one user’s data is placed in one table. And then when you do that you just do that table for a user. In this way, the amount of data in each table can be controlled within a controllable range, for example, each table is fixed within 2 million.

depots

What does separate library mean? Is you a library generally speaking, our experience, up to support 2000 concurrency, must expand, and a healthy single library concurrency value you had better keep at 1000 or so per second, not too big. Then you can split the data from one library into multiple libraries and access only one library.

This is called a separate list. Why do we have a separate list? You get the idea.

Which sub-database sub-table middleware have you used? What are the advantages and disadvantages of different sub-database and sub-table middleware?

What are the advantages and disadvantages of each middleware? And what kind of middleware do you use?

Some of the more common ones include:

Cobar
TDDL
Atlas
Sharding-jdbc
Mycat

Cobar

Ali B2B team development and open source, belongs to the proxy layer scheme, is between the application server and database server. The application accesses the Cobar cluster through the JDBC driver. Cobar decompresses the SQL according to the SQL and repository rules, and then distributes the SQL to different database instances in the MySQL cluster for execution. Earlier years can still use, but in recent years have not updated, basically no people use, almost be abandoned state. Operations such as read/write separation, stored procedures, cross-library joins, and paging are not supported.

TDDL

Taobao team development, belongs to the client layer scheme. Supports basic CRUD syntax and read-write separation, but does not support join, multi-table query, and other syntax. At present, it is not used much because it still relies on Taobao’s Diamond configuration management system.

Atlas

360 is open source and belongs to the proxy layer scheme. Some companies used it before, but there is a big problem that the latest maintenance of the community is 5 years ago. So, there are very few companies now.

Sharding-jdbc

Dangdang open source, belonging to the client layer scheme, is the client layer scheme of ShardingSphere, ShardingSphere also provides the proxy layer scheme sharding-proxy. Indeed, before the use of still more, because SQL syntax support is also more, there is no too many restrictions, and as of 2019.4, has been rolled out to 4.0.0-RC1 version, support sub-table, read and write separation, distributed ID generation, flexible transaction (maximum effort service type transaction, TCC transaction). It is true that more companies have used it before (the company registered to use it on the official website has been used by many companies since 2017). At present, the community is still developing and maintaining, which is relatively active. Personally, I think it is an option now.

Mycat

Cobar based transformation, belong to proxy layer solution, support the function is very perfect, and at present should be very hot and popular database middleware, the community is very active, there are also some companies began to use. But it is younger and less tempered than Sharding JDBC.

conclusion

In summary, it is recommended to consider sharding-JDBC and Mycat, both of which can be considered.

Sharding-jdbc client layer scheme has the advantages of no deployment, low operation and maintenance cost, no need for secondary forwarding requests from the proxy layer, and high performance. However, if any upgrade needs to be done, each system needs to re-upgrade and release the version, and each system needs to coupling sharding-JDBC dependency.

The disadvantage of Mycat proxy layer scheme is that it needs to be deployed and operate and maintain a set of middleware by itself, which has high operation and maintenance costs. However, the advantage is that it is transparent to each project. If there is any upgrade, it can be done by its own middleware.

Generally speaking, both of these two schemes can be chosen, but I personally recommend small and medium-sized companies to choose Sharing-JDBC. The client layer scheme is light, and the maintenance cost is low, there is no need to send additional personnel, and the system complexity of small and medium-sized companies will be lower, and there are not so many projects; However, it is better for medium and large companies to choose proxy layer schemes like Mycat, because large companies may have many systems and projects, large teams and sufficient personnel, so it is best to get individuals to study and maintain Mycat, and then use a large number of projects directly and transparently.

How exactly do you split the database vertically or horizontally?

Horizontal split means that the data of one table is divided into multiple tables in multiple libraries, but the table structure of each library is the same, but the data of each library table is different, and the data of all library tables adds up to the total data. The meaning of horizontal splitting is to evenly distribute data into more libraries, then use more libraries to support higher concurrency, and use the storage capacity of multiple libraries to expand.

Vertical split means to split a table with many fields into multiple tables or libraries. Each library table has a different structure, and each library table contains some fields. In general, fewer frequently accessed fields are placed in one table, and more infrequently accessed fields are placed in another table. Because the database is cached, the fewer rows you access frequently, the more rows you can cache in the cache, and the better the performance. This is usually done a little bit more at the table level.

This is actually quite common, I don’t have to say, many of you have probably done it yourself, take apart a big table, order table, order payment table, order goods table.

There are surface split, is a table, a table into N tables, is to make each table data volume control within a certain range, to ensure the performance of SQL. Otherwise, the greater the amount of data in a single table, the worse the SQL performance. It’s usually about 2 million lines, not too much, but depending on how you do it, it could be 5 million, it could be 1 million. The more complex your SQL is, the fewer rows per table it is best to have.

All of the database middleware mentioned above is supported, regardless of whether it is a repository or a table. Basically, the middleware can do that when you divide the database into tables, the middleware can automatically route to the corresponding library based on the value of a field you specify, such as userID, and then automatically route to the corresponding table.

You have to think about, how do you divide your inventory into tables for your project? In general, vertical split, you can do at the table level, some of the fields of the table to do a split; Horizontal split, you can say that the concurrency can not bear, or the amount of data is too large, the capacity can not bear, you give disassembly, according to what field to disassembly, you decide; Divide table, you consider, if you even split to each library, concurrency and capacity are OK, but each library table is too large, then you divide table, the table will be separated, to ensure that each table data volume is not very large.

And there are two ways to divide databases and tables:

One is divided according to range, that is, each library has a continuous section of data, which is generally based on time range, for example. However, this kind of data is seldom used, because it is easy to generate hot issues, and a large amount of traffic is on the latest data.
Or you can hash a field and spread it out evenly, which is more common.

Range, the advantage is that it is very easy to expand, because you just need to prepare, prepare a library for each month, when a new month, naturally, will write a new library; Disadvantages, but most requests are to access the latest data. Range is used in actual production, depending on the scene.

Hash distribution, which has the benefit of evenly distributing the amount of data and request pressure per library; The downside is that capacity expansion is cumbersome and involves a process of data migration, where the previous data needs to be re-hash and reassigned to different libraries or tables.

More Java learning can be added to my: Java Learning Site