Start with a few questions:

(1) When do we need separate database and separate table? What are our criteria?

(2) How much data does a table store?

(3) The growth rate of data is very fast. How much data is generated every day before we need to consider making sub-database sub-table?

Have you cleared up all these questions? I believe there will be an answer after reading this article.

Why separate database and table?

The first answer to the question of why we need to separate databases and tables is simple: database performance bottlenecks. In plain English, the database is on the verge of collapse.

Database performance bottleneck, external performance of the following aspects:

  • A large number of requests are blocked In a high concurrency scenario, a large number of requests need to operate the database. As a result, the number of connections is insufficient and the requests are blocked.
  • SQL operation becomes slow If a table with hundreds of millions of data exists in the database, all tables are scanned if one SQL fails to match the index, which takes a long time.
  • Storage problems The volume of services increases rapidly, and the amount of data in a single database increases, causing great pressure on storage.

From the perspective of the machine, the performance bottleneck is nothing more than CPU, memory, disk, network and so on. The simplest and crude way to solve the performance bottleneck is to improve the performance of the machine, but through this method, the cost and benefit ratio is often too high to be cost-effective, so the key is to start from the perspective of software.

Database related optimization plan

There are many database optimization schemes, which are mainly divided into two categories: software level and hardware level.

Software level includes: SQL tuning, table structure optimization, read and write separation, database cluster, sub-database sub-table, etc.

The hardware aspect is mainly to increase machine performance.

SQL tuning

SQL tuning is often the first step in solving a database problem, and you can get a lot of bang for your buck with less effort.

The main purpose of SQL tuning is to make slow SQL as fast as possible by simply making SQL execution hit the index as fast as possible.

Enable slow SQL recording

If you are using Mysql, you need to set a few parameters in the Mysql configuration file.

slow_query_log=on
long_query_time=1
slow_query_log_file=/path/to/log
Copy the code

Tuning tools

The explain command is often used to view the execution plan of an SQL statement. It is easy to know whether the SQL statement is a full table scan or an index hit by observing the execution result.

Select id, age, gender from user where name = 'laughter architect ';Copy the code

Return a column called “type”, common values are:

ALL, index, range, ref, eq_ref, const, system, NULL (left to right, poor to good performance)

ALL indicates that ALL tables in the SQL statement are scanned and need to be optimized. Generally speaking, range level and above is required.

Table structure optimization

Take a scenario as an example:

Select * from user where username = ‘user’ and username = ‘nickname’; select * from user where username = ‘user’ and username = ‘nickname’; select * from user where username = ‘user’; In general, the join associated table operation is used to query the user table when querying the order table, so as to obtain the user nickname.

But with the increase of the volume of business, order table and user table is certainly a surge, at this time through the two tables associated data is more laborious, in order to take a nickname field and have to associate query tens of millions of user tables, its speed can be imagined.

At this point, you can try adding the nickname field to the ORDER table (order_id, user_id, nickname), which is often called a database table redundant field. The advantage of this is that there is no need to query the user table associated with the order list.

A drawback of redundant fields is that if the field update involves multiple table updates at the same time, you should try to select the fields that are not updated frequently when selecting redundant fields.

Structure optimization

When a single database instance cannot be supported, we can add instances to form a cluster for external services.

When it is found that the number of read requests is significantly more than that of write requests, we can let the master instance take charge of the write, and the slave instance provides read capability externally.

If the instance reading pressure is still high, you can add a cache such as Redis in front of the database to make requests to cache data first to reduce database access.

After the cache has taken some of the burden, the database is still the bottleneck. In this case, you can consider the idea of separate libraries and tables, which will be discussed later.

The hardware optimization

Hardware costs are very high, and it is generally not possible to upgrade hardware for database performance bottlenecks.

Upgrading the hardware database can greatly improve the performance when the initial service volume is small. But at a later stage, the benefits of upgrading hardware are less obvious.

Detailed explanation of sub-database sub-table

Here we take a mall system as an example to explain how the database is evolving step by step.

Single application single database

In the early stage of entrepreneurship, I wanted to build a shopping mall system, which basically means that a system contains multiple basic functional modules and is packaged into a WAR package for deployment, which is a typical single architecture application.

Mall projects use a single database

As shown in the figure above, the mall system includes the home page Portal template, user module, order module, inventory module, etc. All modules share a database, and there are usually many tables in the database.

Since the user base was small, this structure worked perfectly in the early days, when developers could walk around with demos to investors.

Once the investor money is raised, the business needs to be rolled out on a massive scale and the system architecture needs to match the rapid growth of the business.

Multiple applications and single database

In the early stage, in order to seize the market, this system kept iterating and updating, with more and more code, and the architecture became more and more bloated. Now, with the increasing pressure of system access, system splitting is imperative.

In order to ensure smooth business, system architecture reconstruction is also carried out in several stages.

In the first stage, the single structure of the mall system is divided into sub-services according to functional modules, such as Portal service, user service, order service, inventory service, etc.

Multiple applications and single database

As shown in the figure above, multiple services share a database so that the underlying database access logic is left untouched and the impact is minimized.

Multiple applications and multiple databases

With the increase of business promotion, database has finally become the bottleneck, this time multiple services to share a database is basically not feasible. We need to separate the tables related to each service to create a separate database, which is actually “repository”.

A single database can support only a limited amount of concurrency, so multiple libraries can be divided into services without competition and improve the performance of services.

Multiple applications and multiple databases

As shown in the figure above, separating a large number of smaller databases, one for each service, is a necessary “repository” operation when the system reaches a certain stage of development.

Now very hot microservices architecture is the same, if only split applications not split database, can not solve the fundamental problem, the whole system is also easy to reach the bottleneck.

table

Said sub – library, that when sub – table?

If the system is in the stage of high-speed development, take the mall system, a day order volume may be hundreds of thousands of, the order table growth in the database is particularly fast, growth to a certain stage of the database query efficiency will appear a significant decline.

Therefore, when the single table data increment too fast, the industry spread is more than 5 million data volume to consider the table. Of course, 5 million is just a rule of thumb, so you can make a decision based on the actual situation.

How to divide the tables?

The sub-table has several dimensions, one is horizontal segmentation and vertical segmentation, the other is single-database sub-table and multi-database sub-table.

Horizontal split and vertical split

Take the users table (user), the table has seven fields: id, name, age, sex, the nickname, the description, if the nickname and the description is not commonly used, we can be broken down into another table: User details table, so split by a user table into user basic information table + user details table, two table structure is not the same independent of each other. But from this point of view, vertical split does not fundamentally solve the problem of large amount of single table data, so we still need to do a horizontal split.

Down the table

There is another way to split the table. For example, if there are 10,000 entries in the table, we split the table into two entries with odd ids: 1,3,5,7… Put it in user1 with an even id: 2,4,6,8… In user2, the split is horizontal.

There are also many ways to split horizontally, in addition to the above said according to THE ID table, but also according to the time dimension to split, such as order table, can be divided by daily, monthly, etc.

  • Daily table: Stores only the data of the current day.
  • Monthly table: You can create a scheduled task to migrate all data of the previous day to the monthly table.
  • History table: You can also use scheduled tasks to migrate data that is more than 30 days old to the history table.

To summarize the characteristics of horizontal split and vertical split:

  • Vertical sharding: Partitioning based on tables or fields, with different table structures.
  • Horizontal sharding: Based on data partitioning, the table structure is the same, but the data is different.

Single – library split and multi-library split

Take horizontal splitting, where each table is split into multiple subtables that reside in the same database. For example, the following user table is split into user 1 and user 2 tables.

A single library split

Splitting a table into subtables in a database solves the problem of single-table query performance to some extent, but there is also a problem: single-database storage bottlenecks.

So more often than not, the industry is splitting child tables into multiple databases. For example, in the following figure, the user table is split into two sub-tables, each of which exists in a different database.

Many libraries split

Summary: The main purpose of table splitting is to reduce the size of a single table and solve the performance problem caused by the amount of data in a single table.

The complexity of separate databases and tables

Should we have adopted this approach at the beginning of the project, since it is so good to have a separate database and separate tables? Take a moment to calm down. Partitioning does solve a lot of problems, but it also introduces a lot of complexity to the system, which is briefly described below.

(1) Cross-library associated query

Before the single database is separated into tables, we can easily use join operation to associate multiple tables to query data, but after the database is divided into tables, the two tables may not be in the same database, how to use JOIN?

There are several solutions:

  • Field redundancy: Put the fields to be associated into the main table to avoid the join operation.
  • Data abstraction: data aggregation through ETL, etc., to generate new tables;
  • Global tables: For example, some basic tables can be placed in each database.
  • Application layer assembly: the basic data is checked out, and the assembly is calculated by the application program;

(2) Distributed transactions

A single database can be handled by local transactions, while multiple databases can only be handled by distributed transactions.

Common solutions include: Reliable message (MQ) based solutions, two-phase transaction commit, flexible transaction, etc.

(3) sorting, paging, function calculation problems

When using SQL, order by, limit and other keywords need special processing, generally adopt the idea of sharding:

The corresponding function is first performed on each shard, and then the result set of each shard is summarized and calculated again to obtain the final result.

(4) Distributed ID

Mysql > select * from db where id = 1; Mysql > select * from db where id = 1; Mysql > select * from DB where id = 1;

Common distributed ID solutions are:

  • UUID
  • Maintain a separate ID table based on database increment
  • Them roughly mode
  • Redis cache
  • Snowflake algorithm
  • Baidu uid – the generator
  • Meituan Leaf
  • Drops Tinyid

These solutions will be described in a later article, which will not be covered here.

(5) Multiple data sources

After dividing the database and table, it may be faced with obtaining data from multiple databases or sub-tables. The general solution is client adaptation and proxy layer adaptation.

Middleware commonly used in the industry include:

  • Shardingsphere (formerly Sharding-JDBC)
  • Mycat

conclusion

If you have a database problem, don’t worry about dividing the database and tables. See if it can be solved using conventional methods first.

Separate tables can bring huge complexity to the system, so it is not recommended to use them in advance. As a system architect, you can make your system flexible and extensible, but don’t over-design and over-design. At this point, the architect must be proactive and anticipate ahead of time. Did you get it?

Source: mp.weixin.qq.com/s/gmrrHz3Un…