This article is from the official account: Java Universe (wechat id: Javagogo)

The original link: mp.weixin.qq.com/s/9Cwf6dQQ5… Author: Pan Xinyu

Do we have to separate tables and databases?

Of course not.

Although many Internet companies are large and have many users, don’t be fooled by this phenomenon. In fact, more than 90% of the systems can grow to millions or tens of millions of data is good enough. Open source MySQL can handle tens of millions of data volumes well, let alone some commercial databases.

In addition, when the data grows to a certain level, some processing can be done at the business level. For example, according to the business characteristics, invalid data, soft delete data, and business will not query the data for unified archiving, this is a low-cost, effective way.

Do we have to separate storage?

Speaking of repositories, it can certainly solve the storage problem, assuming that a single repository can only store up to 20 million pieces of data. With the adoption of repositories, the storage architecture becomes the repository architecture shown below. Each repository can store 20 million data, which increases the capacity ceiling.

While the capacity has increased, it has brought many other problems.

  • The data between branches can no longer be queried directly by the database. For example, the data across multiple branches needs to be queried repeatedly or aggregated and then queried by other stores.

  • The more repositories there are, the greater the potential for problems and the higher the maintenance costs.

  • Cross-library transactions cannot be guaranteed, and ultimate consistency can only be achieved through other middleware.

Therefore, in solving the capacity problem, you can choose according to the business scenario, not to consider the branch library, branch table is also a choice.

Divided table refers to all the data in the same database instance, but the original large table according to certain rules, divided into many tables with fewer rows.

The difference between a split table and a repository is that the child table after a split table is still in the original library, whereas a split table is moved to a new database instance and physically deployed separately. The split architecture of sub-tables is shown in the figure below:

Whether it is the order of taxi, the payment order in the e-commerce, or the payment order of take-out or group purchase, are the most important part of the background service, which is related to the company’s revenue. Therefore, I will take the order business as a case for analysis.

The assumption is that orders are only large in quantity and each order has a small amount of data, which is suitable for the use of separate tables. A small data piece with a large number of rows results in slow writes (because indexes are being built) and queries, but the overall footprint is manageable.

With the use of split tables, large tables become small tables, which reduce the performance cost of building indexes when writing, followed by better query performance of small tables. If a separate database is used, although the problem of writing and querying is solved, each table occupies little disk space and wastes resources. The comparison of the two schemes is shown in the following figure:

In the actual scenario, because the bill of lading information of the user needs to be recorded in detail, the amount of data recorded in a single order is large, so there is no case that the number of lines is large but the amount of single data is small. However, in other write services, where the above scenario often occurs, you may prefer a split table solution. Because the partition table can not only solve the capacity problem, but also solve the three problems brought by the partition to a certain extent.

After table division, you can complete some rich queries through join, which is much simpler than library division.

The data of a sub-table is still stored in a database and there are not many sub-repositories. There is no need to introduce some sub-library middleware, so the maintenance cost and development cost are low.

Transaction problems can also be solved in the same database.

How to choose the repository dimension?

If you do want to split the database, how do you do it?

The first problem to be solved is how to choose the repository dimensions.

Different database dimensions determine whether some queries can directly use the database and whether data skew exists.

Introduce two common ways of dividing databases from different dimensions:

  1. Divided by business scenarios that directly meet the most important
  2. Divide randomly according to the finest granularity

Divided by business scenarios that directly meet the most important

In business, all order data belongs to a user. You can divide the database according to the field of the user to which the order belongs, so that the orders of the same user are in a certain database.

The scene after repository separation is as follows:

The order module not only provides the interface for submitting orders, but also provides the functions for sellers to inquire and modify their own orders. The query and modification requirements of these dimensions can not be directly satisfied after the adoption of sorting by purchasing users.

Here, please consider a question: what is the most important function of the order module?

The answer is to ensure that the various order functions of the customer (i.e. the buyer) are functioning properly, such as placing an order, viewing the ordered order information immediately after placing an order (without delay), the list of orders to be paid for, shipped, and shipped, etc.

Relatively speaking, the functionality used by the seller (i.e., the seller) of the goods in the order is not the highest priority. Because when we have to choose between the functions of the seller and the buyer, the seller is willing to lower the priority, after all, the seller is the beneficiary of the sale.

After dividing by purchasing users, users’ usage scenarios can be directly supported by the branch library, instead of using heterogeneous data (with data delay) and other means to solve the problem, which provides better experience for users. Secondly, in the same branch, it is easy to modify multiple data of the same user, so there is no distributed transaction problem.

From the order example above, we can abstract a branching criterion: the branching fields should be determined in terms of directly satisfying the most important business scenarios.

Many other businesses follow this guideline, such as —

  • For user-generated content (UGC) businesses such as Weibo and Zhihu, the database will be divided according to users. Because users will check the list when they post a new article.

  • In the payment system, the user’s payment records will also be divided into databases.

  • Technology, such as monitoring data under a microservice, will also be divided by microservice. The monitoring data of the same micro-service is stored in a database. You can view all the monitoring data of micro-service directly in a database.

Although the above division method directly meets the most important scenarios, data skewing may occur. For example, a super customer (such as an enterprise customer) may purchase a large amount of orders, resulting in a huge amount of data in a branch database, and the scene before the branch database will be reproduced. This is one of the most extreme cases.

The most fine particle random points

For skew problems, you can use the finest-grained splitting that is unique to the data.

The only indication for an order is the order number. After the order number is used for sorting, the user’s order will be randomly and evenly distributed to a certain sorting database according to Hash. In this way, the problem of uneven data in a branch database is solved.

Such as:

  • According to the user’s every microblog randomly divided database;

  • According to the user’s each payment record randomly divided database;

  • Random data repository for each monitoring point in the same microservice.

Although the problem of data balance is solved by using the finest-grained database segmentation, it also brings other problems.

  • First, there is no support for any dimension except for fine-grained queries. This needs to be solved through heterogeneous methods, but heterogeneous has delays and is harmful to services.

  • Second, the anti-replay logic is not supported at the database level. For example, the demand that users can only pay once for the same order cannot be directly satisfied after the payment system divides the database according to the payment number. Because the above method will lead to different payment orders scattered in different databases, at this time, it is impossible to implement the anti-duplicate payment through the unique index of the order number in the database.

All in all, these two ways to divide the database, while solving the problem, but also bring some new problems. In architecture, there is no one solution that can solve all problems. It is more important to choose a more suitable solution according to the scenario.


This article is from the official account: Java Universe (wechat id: Javagogo)

The original link: mp.weixin.qq.com/s/9Cwf6dQQ5… Author: Pan Xinyu