How to optimize ten million large table, this is a very technical content of the problem, usually our intuitive thinking will jump to split or data partition, here I want to do some supplement and comb, and we want to do some experience summary in this aspect, also welcome everyone to put forward suggestions.

From the beginning, I had a flash in my mind, to constant self-criticism, and then I made the following outline based on the experience of some teams.

In order to understand this question, we must go back to our roots, and I divide the question into three parts:

“Tens of millions”, “large table”, “optimization”,

And they correspond to what we’ve labeled in the diagram

Data volume, Object, and Target.

Let me explain it step by step and give you a series of solutions.

1. Data volume: tens of millions

In fact, tens of millions is just a sensory number, which is the amount of data in our impression. Here we need to refine the concept, because the amount of data will change with the change of business and time, we should look at this indicator with a dynamic thinking, so that we should have different processing strategies for different scenarios.

  1. Data volumes are in the tens of millions, possibly hundreds of millions or more

Usually some data flow, log recording business, the data inside will gradually increase with the growth of time, more than ten million threshold is very easy to do a thing.

  1. The data amount is tens of millions, which is a relatively stable data amount

If the amount of data is relatively stable, usually in some state-biased data, for example, there are 10 million users, then the information of these users will be recorded in the corresponding row in the table. With the growth of services, this level of data is relatively stable.

  1. There are tens of millions of bytes of data. There shouldn’t be so much data

In this case, we find most of the data passively. Usually, we find it too late. For example, you see a configuration table with tens of millions of data. Or some of the data in the table has been stored for a long time, 99% of the data is outdated or garbage data.

Data volume is an overall understanding, we need to do a closer understanding of the data, which can lead to the content of the second part.

2. Objects: data tables

The process of data operation is just like the existence of multiple channels in the database, and the data to be processed flows through these channels, and the use and ownership of these data are not the same.

Generally, data is divided into three types according to service type:

(1) Flow data

Flow data is stateless and there is no correlation between multiple businesses. Every time a business comes, new documents will be generated, such as transaction flow and payment flow. As long as new documents can be inserted, the business can be completed.

(2) State-type data

Stateful data is stateful. Multiple businesses rely on stateful data, and the accuracy of the data should be guaranteed. For example, the original balance must be obtained before payment can be successful.

(3) Configuration data

This type of data has a small amount of data and a simple structure. Generally, it is static data with a low change frequency.

So far, we can have an understanding of the overall background, if we want to do optimization, in fact, we have to face such a matrix of 33, if we want to consider the read/write ratio of the table (read more write less, read less write more…). , 33*4=24. Obviously, exhaustive enumerations are not shown and are not necessary. Different business policies can be formulated according to different data storage features and business characteristics.

In this regard, we take the way to grasp the key points and sort out some common optimization ideas, especially the core idea, which is also a ruler of our entire optimization design, and the difficulty determines our motivation and risk to do this thing.

Data volume growth

Data table type

Business characteristics

Core idea of optimization

To optimize the difficulty

The data amount is tens of millions, which is a relatively stable data amount

State table

OLTP service direction

Read the horizontal expansion of demand without dismantling it


Data volumes are in the tens of millions, possibly hundreds of millions or more

The water table

Historical records of OLTP services

Service separation and distributed storage design


OLAP business statistics data source

Design distributed extensions for data statistical requirements storage


There are tens of millions of bytes of data. There shouldn’t be so much data

The configuration table

General business

Small and simple, avoid the big

As for the optimization scheme, I would like to introduce it from the business-oriented dimension.

3. Goal: Optimization

At this stage, we are going to talk about the optimization plan, the summary is a little bit more, relatively speaking, is more complete.

The whole is divided into five parts:

In fact, we usually said that the sub-database sub-table scheme is only a small part of them, if expanded after the more rich.

In fact, it is not difficult to understand that we need to support tens of millions of levels of table data, relatively large, DBA must maintain more than one table, how to better management, at the same time in the business development can support expansion, at the same time to ensure performance, which is placed in front of us several mountains.

Add VX: YDT676, get more Java notes, source code and interview materials for free

Let’s take a look at the five types of improvement schemes:

1. Standard design

Here we first mentioned is the standard design, rather than the other lofty design scheme.

Hegel said: order is the first condition of freedom. ** is especially important in collaborative work scenarios where teams can be too tied down to each other and cause problems.

Specification Design I would like to mention the following specifications, which are actually part of the development specification for reference.

The essence of the specification is not solve the problem, but effectively eliminate potential problems, for being a big table to comply with the specifications, I clean up some of the rules, the following basic can cover some of the design and use, we common field design such as tables, willy-nilly is varchar (500), in fact it is not a way of implementation, Let’s expand on these specifications.

1) Configuration specifications

(1) MySQL database uses InnoDB storage engine by default.

(2) Ensure uniform character set. Use UTF8 for all MySQL database systems, databases, and tables. Use UTF8 for all application connection and display Settings.

Note: UTF8MB4 is used to store emoticon data in the MySQL character set. In 8.0, the default value is UTF8MB4, which can be unified or customized based on the company service situation.

(3) The default transaction isolation level of the MySQL database is repeatable-read (RR). It is recommended that the transaction isolation level be set to READ-committed (RC) during initialization, which is more suitable for OLTP services.

(4) The tables in the database should be reasonably planned and the amount of data in a single table should be controlled. For MySQL database, it is recommended that the number of records in a single table be controlled within 2000W.

(5) In the MySQL instance, the number of databases and tables should be as small as possible; A maximum of 50 databases exist. Each database contains a maximum of 500 data tables (including partitioned tables).

2) Build table specifications

(1) InnoDB forbids the use of foreign key constraints, which can be guaranteed at the program level.

(2) Storing exact floating-point numbers must use DECIMAL instead of FLOAT and DOUBLE.

(3) There is no need to define the display width in the definition of an integer. For example, use INT instead of INT(4).

(4) ENUM is not recommended. TINYINT can be used instead.

(5) TEXT and BLOB types should not be used as far as possible. If they must be used, it is recommended to split large fields or infrequently used large descriptive fields into other tables; In addition, do not use the database to store images or files.

(6) Use YEAR(4) instead of YEAR(2) to store years.

(7) It is recommended to define the field as NOT NULL.

(8) IT is suggested that DBA provide SQL audit tool, and the standardization of table construction should be verified by the audit tool

3) Naming conventions

(1) Library, table, field all use lowercase.

(2) The database name, table name, field name, and index name are all lowercase letters and separated by underscores.

(3) It is recommended that the database name, table name, and field name contain no more than 12 characters. (The database name, table name, and field name support a maximum of 64 characters, but for standardization, easy identification, and reduction of transmission, the maximum number of characters is 12.)

(4) Library name, table name, field name know by name, do not need to add comments.

Table 4-1 lists a brief summary of object naming conventions for your reference.

The name list

Chinese name of object

Full name of object in English

Short for MySQL object

view

view

view_

function

function

func_

The stored procedure

procedure

proc_

The trigger

trigger

trig_

Normal index

index

idx_

The only index

unique index

uniq_

The primary key index

primary key

pk_

4) Index specification

(1) Recommended index naming rules: IDx_COL1_COL2 [_colN] and UNIq_COL1_COL2 [_colN] (If the field is too long, you are advised to use the abbreviation.)

(2) It is recommended that the number of fields in the index not exceed 5.

(3) The index number of a single table should be controlled within 5.

(4) InnoDB tables are generally recommended to have primary key columns, especially as a must in high availability cluster scenarios.

(5) When building compound indexes, give priority to the fields with high selectivity.

(6) Add index to UPDATE and DELETE statements based on WHERE condition.

(7) It is not recommended to use the % prefix fuzzy query, such as “%weibo”, cannot use the index, will lead to the whole table scan.

(8) Rational use of coverage indexes, such as:

SELECT idx_uid_email(uid,email) FROM user_email WHERE uid=xx; SELECT idx_uid_email(uid,email) FROM user_email WHERE uid=xx;

(10) Avoid using functions on index fields, otherwise it will cause index invalidation.

(11) Contact the DBA to confirm whether the index needs to be changed.

Add VX: YDT676, get more Java notes, source code and interview materials for free

5) Application specifications

(1) Avoid the use of stored procedures, triggers, custom functions, etc., easy to coupling the business logic and DB together, which will become a bottleneck in the later distributed scheme.

(2) Consider using UNION ALL and reduce the use of UNION, because UNION ALL is faster than UNION due to the lack of reweighting and sorting operations. If there is no need for reweighting, UNION ALL should be preferred.

(3) Consider using limit N, limit M, limit N, especially when large tables or M are large.

(4) Reduce or avoid sorting, such as: if the group by statement does not need sorting, can add order by NULL.

(5) Use COUNT() instead of COUNT(primary_key) and COUNT(1) when recording numbers in tables; InnoDB tables avoid the use of COUNT() operation, COUNT statistics can be used in real time Memcache or Redis, non-real time statistics can use a separate table, regularly updated.

(6) The original comment attribute must be added when you modify column/change column. Otherwise, the comment will be lost after modification.

(7) Use Prepared Statement to improve performance and avoid SQL injection.

(8) The value of IN IN the SQL statement should not be too much.

UPDATE and DELETE statements must have explicit WHERE conditions.

(10) The value of the field in the WHERE condition must conform to the data type of the field to avoid implicit type conversion by MySQL.

SELECT * from table_name; INSERT INTO table_name values();

INSERT INTO table_name VALUES() INSERT INTO table_name VALUES() , the number of values should not be excessive.

Optimization design Scheme 2: business layer optimization

Business layer optimization should be the most profitable optimization method, and fully visible to the business layer, there are mainly business split, data split and two common optimization scenarios (read more write less, read less write more)

1) Business split

U Split the mixed business into separate businesses

U Separates state and history data

A business split is the separation of a mixed business into a more distinct separate business, so that business 1, business 2… Independent business makes the total amount of business is still large, but each part is relatively independent, and the reliability is still guaranteed.

I can give an example of state and history data separation.

For example, we have a table Account and assume that the user has a balance of 100.

We need to be able to trace the historical information of data change after the data change. If the status data of the account is updated, the balance of 100 is increased, so that the balance is 200.

This procedure may correspond to an UPDATE statement or an INSERT statement.

For this we can change to two different data sources, account and account_hist

In account_hist there are two insert records like this:

In an account, it is an update statement like this:

This is also a very basic separation of hot and cold, which greatly reduces maintenance complexity and improves service response efficiency.

2) Data splitting

2.1 Split by date, this method is relatively common, especially split by date dimension, in fact, the change in the program level is very small, but the benefits of scalability are great.

  • The data is split by date dimension, such as test_20191021
  • The data is broken down into weekly and monthly dimensions, such as test_201910
  • Data is broken down by quarter and year dimensions, such as test_2019

2.2 Using partitioned mode, partitioned mode is also a common way to use, such as hash and range. I do not recommend using partitioned table in MySQL, because with the increase of storage capacity, although data is vertically split, it is difficult to achieve horizontal expansion in the final analysis. There are better ways to extend it in MySQL.

2.3 Read more but Write Less Optimization scenario

By using cache and Redis technology, read requests are placed on the cache level, which greatly reduces the pressure of hotspot data query on the MySQL level.

2.4 Read less write More optimization scenario, you can adopt three steps:

  1. The asynchronous submission mode is the most intuitive performance improvement for the application layer, resulting in the least synchronous wait.

  2. Using queue technology, a large number of write requests can be extended by queue to achieve batch data writing.

  3. Lower write frequency, this is a little bit harder to understand, but let me give you an example

For business data, such as points, the business priority is lower than the amount of money. If data updates are too frequent, you can adjust the data update range (for example, from the original minute to 10 minutes) to reduce the frequency of updates.

For example, update status data, the integral is 200, as shown in the following figure

Can be modified to, as shown in the figure below.

If the service data is updated too frequently in a short period of time, for example, 100 times in a minute with credits ranging from 100 to 10000, you can submit the data in batches according to the time frequency.

For example, update status data, the integral is 100, as shown in the following figure.

Instead of generating 100 transactions (200 SQL statements), you can transform it into 2 SQL statements, as shown in the figure below.

Business metrics, such as update frequency details, can be discussed and decided based on specific business scenarios.

Optimization design Scheme 3: architecture layer optimization

Architecture layer optimization is what we think of as a highly technical job, and we need to introduce some new twists at the architecture level depending on the business scenario.

3.1. System horizontal expansion scenario

3.1.1 Using middleware technology, data routing and horizontal expansion can be realized. Common middleware include MyCAT, ShardingSphere,ProxySQL, etc

3.1.2 separation technology using, speaking, reading and writing, which is an extension of the demand for reading, more focused on the status table, in the case of allowing a certain delay, can read demand is realized by using multiple copies of the patterns of horizontal extension, also can use middleware to implement, such as MyCAT ProxySQL, MaxScale, MySQL Router, etc

3.1.3 Load balancing technology, such as LVS technology or Consul technology based on domain name service is commonly used

3.2. Considering OLTP+OLAP business scenarios, NewSQL can be adopted and HTAP technology stack compatible with MySQL protocol, such as TiDB, is preferred

3.3. For offline statistics business scenarios, there are several options.

3.3.1 The NoSQL system is mainly divided into two categories: one is the data warehouse system compatible with MySQL protocol, such as Infobright or ColumnStore, and the other is column-based storage, which belongs to the heterogeneous direction, such as HBase technology

3.3.2 Adopt the data warehouse system based on MPP structure, such as Greenplum statistics, such as T+1 statistics

Optimization design scheme 4: database optimization

Database optimization, in fact, can play a lot of cards, but relatively speaking, the space is not so big, let’s say it one by one.

4.1 Transaction Optimization

Choose the transaction model based on the business scenario, whether it is strongly transaction dependent

Let’s give a few small examples of transaction dimension reduction strategies.

4.1.1 Dimension Reduction Strategy 1: Convert stored procedure calls to transparent SQL calls

Using stored procedures is obviously not a good idea for a new business. MySQL’s stored procedures are unproven in terms of functionality and performance compared to other commercial databases, and the stored procedures are too “heavy” in the current lightweight business process.

Some application architecture seems to be in accordance with the distributed deployment, but in the database layer is based on the stored procedure call way, because the stored procedure package a lot of logic, difficult to debug, and portability is not high, so business logic and the properties of stress all at the database level, makes the database layer is easy to become a bottleneck, and difficult to achieve real distributed.

Therefore, there is a clear direction for improvement is to transform stored procedure. Transforming it into SQL invocation can greatly improve the processing efficiency of business, and the interface invocation of database is simple and clear and controllable.

4.1.2 Dimension Reduction Policy 2: Convert DDL operations to DML operations

Some businesses often have an urgent need to add fields to a table, which makes DBA and business students very tired. Imagine that a table has hundreds of fields, and most of them are name1, name2… Name100, this design is problematic in itself, not to mention performance. The reason for this is that the business needs change dynamically. For example, a game equipment has 20 attributes, and then a month later it has 40 attributes, so all the equipment has 40 attributes and is not used, and there is a lot of redundancy in this approach.

In the design specification, we also mentioned some basic elements of design. On this basis, we need to supplement that we should keep limited fields. If we want to achieve the expansion of these functions, we can actually achieve it through configuration, such as converting some dynamically added fields into some configuration information. Configuration information can be modified and supplemented in DML mode, and data entry can be more dynamic and easily extended.

4.1.3 Dimension Reduction Strategy 3: Convert Delete operation to efficient operation

Some business need to clean up some periodic data on a regular basis, such as data table only keep a month, then the data will be beyond the time to clean up, and if the table under the condition of a larger scale, the Delete operation cost is too high, we can have two kinds of solution to convert the Delete operation to a more efficient way.

The first is to establish periodic table according to business, such as monthly table, weekly table, daily table and other dimensions to design, so that data cleaning is a relatively controllable and efficient way.

The second option is to use the MySQL rename operation mode, such as a 20 million large table to clear 99% of the data, then the need to retain 1% of the data can be filtered quickly according to the conditions, to achieve “shape shift”.

Add VX: YDT676, get more Java notes, source code and interview materials for free

4.2 SQL optimization

In fact, relatively speaking, the need for minimalist design, many points are in the standard design, if you comply with the specification, the problem will be eliminated in all likelihood, in this supplement several points:

4.2.1 SQL statement simplification, simplification is a great tool for SQL optimization, because simple, so superior.

4.2.2 Avoid or eliminate complex association of multiple tables as far as possible, large table association is the nightmare of large table processing. Once this opening is opened, more and more requirements need association, and there is no turning back for performance optimization. Moreover, large table association is a weak point of MySQL. It’s been around for a long time in commercial databases, but problems keep cropping up.

4.2.3 Avoid anti-join and semi-join as far as possible in SQL, which is the weakness of the optimizer. What is anti-join and semi-join? In fact, it is easier to understand, for example,not in, not exists is an anti-join, in,exists is a semi-join, in tens of millions of large tables occur this problem, performance is several orders of magnitude difference.

4.3 Index Optimization

Should be a large table optimization need to grasp a degree.

**4.3.1 must first have a primary key, the first of the ** specification design is that no refutation is accepted here.

4.3.2 Secondly, SQL queries are based on indexes or unique indexes, making the query model as simple as possible.

**4.3.3 Finally, eliminate range data query as far as possible, ** Range scan should be reduced as far as possible in the case of ten million large tables.

Optimization design Scheme 4: management optimization

This is probably the most overlooked part of the solution, so I put it at the end, and hats off to my operations colleagues for always doing their job on a lot of issues that they think should be normal.

Data cleaning for large tables is usually time-consuming, so it is recommended to improve the design of hot and cold data separation strategy, which may sound difficult to say. Let me give an example, convert the Drop operation of large tables into a reversible DDL operation.

Drop is committed by default, and is irreversible. In database operations, it is used as a proxy for running away. Currently, MySQL does not have the corresponding restore function of Drop, except through backup, but we can consider converting Drop into a reversible DDL operation.

MySQL > alter table testdb; MySQL > alter table testdb; MySQL > alter table testdb; In terms of permissions, testdb_ARCH is invisible to the business. The rename operation smooths the delete function. If it is cleared after a certain period of time, the data cleanup is not visible to the existing business process, as shown in the figure below.

In addition, there are two additional recommendations. One is that for large table changes, consider online changes at low peak times as much as possible, such as using pT-OSC tools or maintenance changes, without further elaboration.

To sum up, in fact, it is a sentence:

The optimization of ten million large tables is based on business scenarios and at the cost of optimization. It is definitely not an isolated level of optimization.