Mysql large table optimization scheme

The original copyright

When the number of MySQL single table records is too large, the performance of add, delete, alter, and query will decrease sharply. You can refer to the following steps to optimize:

Single table optimization

Unless the single table data will continue to grow in the future, do not consider splitting at first. Splitting will introduce various complexity of logic, deployment, operation and maintenance. Generally, tables with integer values under ten million, and strings under five million are not a big problem. In fact, in many cases, the performance of MySQL single table still has a lot of room for optimization, and even can normally support more than ten million levels of data volume:

field

Try to useTINYINT,SMALLINT,MEDIUM_INTAs an integer type instead ofINTIf it is not negative, plusUNSIGNED
VARCHARThe length allocates only the space that is really needed
Use enumerations or integers instead of string types
Try to useTIMESTAMPRather thanDATETIME.
Do not have too many fields in a single table, recommend less than 20
Avoid NULL fields, which are difficult to optimize queries and take up extra index space
Use an integer to store IP addresses

The index

The more the better, the more the better, according to the query to create specific, consider inWHEREandORDER BYThe columns involved in the command can be indexed according toEXPLAINTo see if an index or full table scan is used
Should be avoided as far as possibleWHEREClause for the fieldNULLValue, otherwise it will cause the engine to abandon the index and perform a full table scan
Fields with a sparse distribution of values, such as “gender,” have only two or three values
Character fields are indexed by prefix only
Character fields should preferably not be primary keys
No foreign keys, the program guarantees the constraint
As far as possible need notUNIQUE, guaranteed by the program
Use multi-column indexes in the same order as the query criteria, and remove unnecessary single-column indexes

The query SQL

Slow SQL can be found by enabling slow query logging
No column operations:SELECT id WHERE age + 1 = 10, any operation on a column will result in a table scan, which includes database tutorial functions, computed expressions, and so on. Move the operation to the right of the equals sign whenever possible
SQL statements are as simple as possible: an SQL statement can be run on only one CPU; Large statement disassembly small statement, reduce lock time; One large SQL can block the entire library
Don’t have toSELECT *
ORRewrite intoIN:ORThe efficiency is n,INThe efficiency of is log(n), and the number of in is recommended to be less than 200
No functions and triggers, implemented in the application
avoid%xxxType the query
To use lessJOIN
Use the same type for comparison, e.g'123'and'123'Than,123and123than
Try to avoidWHEREClause used in! = or <> otherwise the engine will abandon the index for a full table scan
For continuous values, useBETWEENDon’t have toIN:SELECT id FROM t WHERE num BETWEEN 1 AND 5
Table data do not take the full table, useLIMITDo pagination and keep the number of pages small

engine

MyISAM and InnoDB are two widely used engines:

MyISAM

The MyISAM engine is the default engine for MySQL 5.1 and earlier. It features:

Row locks are not supported. All tables that need to be read are locked when read, and all tables are locked when written
Transactions not supported
Foreign keys are not supported
Security recovery after a crash is not supported
New records can be inserted into a table while a query is being read
supportBLOBandTEXTIndex of the first 500 characters of, support full-text index
Delay index update is supported to greatly improve write performance
Tables that are not modified can be compressed to greatly reduce disk space usage

InnoDB

InnoDB became the default index after MySQL 5.5 and features:

Line locking is supported and MVCC is used to support high concurrency
Support transactions
Support foreign keys
Supports secure recovery after crashes
Full-text indexing is not supported

In general, MyISAM is good for SELECT intensive tables, while InnoDB is good for INSERT and UPDATE intensive tables

System tuning parameters

You can use the following tools to benchmark:

Sysbench: a modular, cross-platform, and multithreaded performance testing tool
Iitable-mysql: insert performance test tool based on Java mysql /Percona/MariaDB index
TPCC -mysql: TPC-C test tool developed by Percona

You can refer to official documents for details about tuning parameters. Here are some important parameters:

Back_log: The back_log value indicates how many requests can be placed on the stack in a short time before MySQL temporarily stops answering new requests. That is, if MySql connection data reaches max_connections, new requests will be placed on the stack waiting for a connection to release resources. The stack number is called back_log. If the number of connections waiting exceeds back_log, no connection resources will be granted. It can be raised from the default 50 to 500
Wait_timeout: indicates the idle time of the database connection. The idle connection occupies memory resources. It can be reduced from the default 8 hours to half an hour
Max_user_connection: specifies the maximum number of connections. The default value is 0
Thread_concurrency: set the number of concurrent threads to twice the number of CPU cores
Skip_name_resolve: disables DNS resolution for external connections, eliminating the DNS resolution time. However, all remote hosts must use IP addresses to access the IP address
Key_buffer_size: specifies the cache size of the index block. Increasing the cache size improves the index processing speed and has the greatest impact on MyISAM table performance. For about 4G memory, the value can be 256M or 384Mshow status like 'key_read%'To ensurekey_reads / key_read_requestsLess than 0.1% is best
Innodb_buffer_pool_size: Cache data blocks and index blocks, which have the greatest impact on InnoDB table performance. Through the queryshow status like 'Innodb_buffer_pool_read%'To ensure(Innodb_buffer_pool_read_requests -- Innodb_buffer_pool_reads)/Innodb_buffer_pool_read_requestsThe higher the better
Innodb_additional_mem_pool_size: InnoDB storage engines used to store data dictionary information as well as some internal data structure size of memory space, when the database object is very much, adjust the parameter to make sure that all the size of the data can be stored in the memory access efficiency, when is too small, MySQL will record the Warning information to the database in the error log, This is where you need to resize this parameter
Innodb_log_buffer_size: the buffer used by InnoDB to store transaction logs of the engine. Generally, it is recommended not to exceed 32MB
Query_cache_size: cache MySQL’s ResultSet, which is the ResultSet executed by an SQL statement, only for select statements. Any change in the data of a table invalidates the data cached in the Query Cache for all SELECT statements referencing that table. So, when our data changes very frequently, using Query Cache may not be worth the cost. Based on the hit rate(Qcache_hits/(Qcache_hits+Qcache_inserts)*100))Generally, it is not recommended to be too large. 256MB May be enough. Large configuration static data can be adjusted appropriately. You can do this by commandshow status like 'Qcache_%'View the current system Query catch usage size
Read_buffer_size: specifies the size of the buffer read by MySql. A sequential scan of the table will allocate a read buffer, and MySql will allocate a memory buffer for it. If sequential scan requests to the table are very frequent, you can improve performance by increasing the value of this variable as well as the memory buffer size
Sort_buffer_size: buffer size used by MySql to perform sorting. If you want to increaseORDER BYFirst, see if you can get MySQL to use indexes instead of the extra sort phase. If not, try increasing the size of the sort_buffer_SIZE variable
Read_rnd_buffer_size: random read buffer size of MySql. When rows are read in any order (for example, in sort order), a random read cache is allocated. When sorting a query, MySql scans the buffer first to avoid disk search and improve the query speed. If a large amount of data needs to be sorted, you can adjust the value to a higher value. MySql allocates this buffer space for each client connection, so try to set this value appropriately to avoid excessive memory overhead.
Record_buffer: Each thread doing a sequential scan allocates a buffer of this size for each table it scans. If you do a lot of sequential scans, you may want to increase this value
Thread_cache_size: Holds threads that are not currently associated with a connection but are ready to service a later connection, allowing quick response to connected thread requests without creating new ones
Table_cache: similar to thread_cache_size, but used to cache table files. Not very useful for InnoDB, mainly used for MyISAM

Upgrade the hardware

Scale up, depending on whether MySQL is CPU intensive or I/O intensive, can significantly improve MySQL performance by increasing CPU and memory, and using SSDS

Reading and writing separation

It is also a commonly used optimization, read from the library and write from the master library, generally do not use double master or multiple master to introduce a lot of complexity, try to use other schemes in the paper to improve performance. At the same time, many split solutions also consider read and write separation

The cache

Caching can occur at these levels:

MySQL Internal: Describes related Settings in system tuning parameters
Data access layer: For example, MyBatis caches SQL statements, whereas Hibernate can be down to a single record, where the cached objects are mainly persistent objectsPersistence Object
Application service layer: more precise control and more implementation strategies can be achieved for cache through programming means. The object of cache is data transmission objectData Transfer Object
Web layer: Cache Web pages
Browser client: the client’s cache

Caches can be added at one or more levels in combination depending on the situation. The cache implementation of the service layer is mainly introduced here. At present, there are two main ways:

Write Through: After data is written to the database, the cache is updated to ensure consistency between the database and cache. This is how most current application caching frameworks such as Spring Cache work. This implementation is very simple, synchronized well, but not very efficient.
Write Back: When data is written to the database, only the cache is updated, and then the cached data is asynchronously synchronized to the database in batches. This kind of implementation is more complex, requires more application logic, and may cause the database and cache synchronization, but the efficiency is very high.

Table partitioning

The partition introduced in MySQL 5.1 is a simple horizontal split that requires users to add partition parameters when building a table and is transparent to the application without changing the code

To the user, a partitioned table is a separate logical table, but the underlying table is composed of multiple physical sub-tables. The code that implements partitioning is actually a black box that encapsulates the underlying table through object encapsulation of a set of underlying tables, but is a black box that completely encapsulates the underlying table for the SQL layer. The way MySQL implements partitioning also means that indexes are defined as subtables of partitions, with no global indexes

The user’s SQL statement needs to be optimized for the partition table. The column of the partition condition should be included in the SQL condition to locate the query to a small number of PARTITIONS, otherwise all PARTITIONS will be scanned. You can check which PARTITIONS a CERTAIN SQL statement will fall on by EXPLAIN PARTITIONS to optimize the SQL. As shown below, 5 records fall on two partitions:

mysql> explain partitions select count(1) from user_partition where id in(1, 2, 3, 4, 5); +----+-------------+----------------+------------+-------+---------------+---------+---------+------+------+------------ --------------+ | id | select_type | table | partitions |type| possible_keys | key | key_len | ref | rows | Extra | +----+-------------+----------------+------------+-------+---------------+---------+---------+------+------+------------ --------------+ | 1 | SIMPLE | user_partition | p1,p4 | range | PRIMARY | PRIMARY | 8 | NULL | 5 | Usingwhere; Using index | +----+-------------+----------------+------------+-------+---------------+---------+---------+------+------+------------ --------------+ 1 rowin set (0.00 sec)
Copy the code

The benefits of partitioning are:

More data can be stored in a single table
The data in partitioned tables is easier to maintain, and large amounts of data can be removed in batches by cleaning the entire partition, or new partitions can be added to support newly inserted data. In addition, a separate partition can be optimized, checked, repaired, and so on
Partial queries can be determined from query criteria to fall on only a few partitions, which can be very fast
Partitioned table data can also be distributed across different physical devices to make fun of multiple hardware devices
You can use partitioned table latencies to avoid specific bottlenecks, such as mutually exclusive access for InnoDB single indexes and inode lock contention for ext3 file systems
Individual partitions can be backed up and restored

Limitations and disadvantages of partitioning:

A table can have a maximum of 1024 partitions
If a partitioned field has columns with a primary key or unique index, all primary key and unique index columns must be included
Partitioned tables cannot use foreign key constraints
A NULL value invalidates partition filtering
All partitions must use the same storage engine

Type of partition:

RANGE partition: Allocates multiple rows to a partition based on column values belonging to a given contiguous interval
LIST partitioning: Similar to partitioning by RANGE, except that LIST partitioning is selected based on column values matching a value from a discrete set of values
HASH partition: A partition selected based on the return value of a user-defined expression computed using the column values of the rows to be inserted into the table. This function can contain any valid expression in MySQL that produces a non-negative integer value
KEY partitioning: Similar to HASH partitioning, except that KEY partitioning supports only one or more columns and the MySQL server provides its own HASH function. One or more columns must contain integer values

Partitioning is suitable for the following scenarios:

The most suitable scene data has a strong time series, so it can be partitioned by time, as shown below:

CREATE TABLE members (
    firstname VARCHAR(25) NOT NULL,
    lastname VARCHAR(25) NOT NULL,
    username VARCHAR(16) NOT NULL,
    email VARCHAR(35),
    joined DATE NOT NULL
)
PARTITION BY RANGE( YEAR(joined) ) (
    PARTITION p0 VALUES LESS THAN (1960),
    PARTITION p1 VALUES LESS THAN (1970),
    PARTITION p2 VALUES LESS THAN (1980),
    PARTITION p3 VALUES LESS THAN (1990),
    PARTITION p4 VALUES LESS THAN MAXVALUE
);
Copy the code

Adding time range conditions to query is very efficient, and it is easy to delete unnecessary historical data in batches.

If the data has an obvious hot spot and is rarely accessed, you can place the hot spot in a separate partition so that the data in that partition has a chance to be cached in memory and only a small partitioned table is accessed for query purposes, making efficient use of indexes and caching

In addition, MySQL has an early simple partition implementation – merge table, which is more limited and lacks optimization. It is not recommended to use the new partition mechanism instead

The vertical resolution

Vertical partition is split according to the correlation of the data tables in the database. For example, a database has both user data and order data, so vertical partition can put user data into the user database and order data into the order database. Vertical split table is a method of vertical split of data tables. It is common to split a large multi-field table according to common and unusual fields. The number of data records in each table is generally the same, but the fields are different, and primary key association is used

For example, the original user table is:

The vertical split is:

The advantages of vertical splitting are:

You can make rows smaller so that more data can be stored in one Block, resulting in fewer I/O times (fewer blocks read per query)
To maximize the use of the Cache, you can place infrequently changed fields together and frequently changed fields together in vertical split
Simple data maintenance

The disadvantage is that:

The primary key is redundant and redundant columns need to be managed
The JOIN operation that causes a table JOIN (which increases CPU overhead) can reduce database stress by joining on the business server
There is still the problem of large amount of single table data (horizontal split)
Transaction complexity

Horizontal split

An overview of the

Horizontal split is a strategy to store data in fragments, which are divided into tables and libraries within a database. Each piece of data is dispersed to different MySQL tables or libraries to achieve a distributed effect, which can support a very large amount of data. The preceding table partition is essentially a special kind of in-library partition

Table, just simply solve the problem of the single table data is too large, because there is no table of the data distribution to different machines, so to alleviate the pressure of the MySQL server, the role of is not much, everyone still compete on the same physical IO, CPU, network, this will be addressed by depots

If the user table was split horizontally, the result would be:

In practice, this tends to be a combination of vertical split and horizontal split, with Users_A_M and Users_N_Z split into Users and UserExtras for a total of four tables

The advantages of horizontal splitting are:

There are no performance bottlenecks of single repository big data and high concurrency
Application side modification is less
The stability and load capacity of the system are improved

The disadvantage is that:

Shard transaction consistency is difficult to resolve
Cross-node Join has poor performance and complicated logic
It is very difficult to expand data for many times and maintain it

Subdivision principle

If you can, refer to single table optimization
The number of fragments should be as small as possible, and the fragments should be evenly distributed on multiple data nodes. The more cross-fragments a query has, the worse its overall performance is. Although it is better than the result of all data in one fragment, it should only be expanded when necessary to increase the number of fragments
The sharding rules need to be carefully selected and planned in advance. When selecting sharding rules, you need to consider the growth mode of data, access mode of data, correlation of sharding, and expansion of sharding. The latest sharding strategies are range sharding, enumeration sharding, and consistent Hash sharding, which are conducive to expansion
Try not to have SQL in a transaction that spans multiple shards. Distributed transactions are always a problem
Optimize the query conditions and avoid Select *. A large number of result sets consume a large amount of bandwidth and CPU resources. Avoid returning a large number of result sets and create indexes for frequently used query statements.
Reduce the likelihood of cross-library joins through data redundancy and table partitioning dependency

Here especially emphasize the selection problem of fragmentation rules, if a table of data has the obvious time characteristic, such as orders, transaction records, etc., are often more appropriate shard with time range, because has the timeliness of data, we tend to focus on the recent data, often with time field to filter query conditions, a better solution is to, Currently active data is sharded with a short span, while historical data is stored with a long span.

Generally speaking, the choice of shard depends on the condition of the most frequently queried SQL, because the query SQL without any Where statement will traverse all shards and has the worst performance. Therefore, the more such SQL, the greater the impact on the system, so we should try to avoid such SQL.

The solution

Due to the complicated logic involved in horizontal split, there are many mature solutions. These solutions fall into two broad categories: client architecture and proxy architecture.

Client architecture

By modifying the Data access layer, such as JDBC, Data Source and MyBatis, it can manage multiple Data sources through configuration, directly connect to the database, and complete the fragmentation integration of Data within the module, which is generally presented in the form of Jar packages

Here is an example of a client architecture:

You can see that sharding is implemented in conjunction with the application server by modifying the Spring JDBC layer

The advantages of the client architecture are:

Use directly connected databases to reduce downtime risks associated with peripheral system dependencies
Low integration cost, no additional o&M components

The disadvantage is that:

Limited to the database access layer, general scalability, for more complex systems may be inadequate
Placing the stress of sharding logic on the application server creates additional risk

Agent architecture

Unified management of all data sources and data fragmentation integration through independent middleware, back-end database cluster transparent to front-end applications, independent deployment and operation of agent components

Here is an example of a proxy architecture:

Proxy components exist in clusters to avoid single points and may require service components such as Zookeeper to manage them

The advantages of the proxy architecture are:

Able to handle very complex requirements, not limited by the original implementation of the database access layer, scalability
Transparent to the application server without any additional load

The disadvantage is that:

Independent proxy middleware needs to be deployed and operated, which is costly
The application needs to go through the proxy to connect to the database. An extra hop on the network leads to performance loss and additional risks

Comparison of different schemes

	producers	Architectural model	Support database	depots	table	Reading and writing separation	External dependencies	Whether open source	Implementation language	Support language	The last update	Lot number of star
MySQL Fabric	MySQL official	Agent architecture	MySQL	There are	There are	There are	There is no	is	python	unlimited	Four months ago	35
Cobar	alibaba	Agent architecture	MySQL	There are	There is no	There is no	There is no	is	Java	unlimited	Two years ago.	1287
Cobar Client	alibaba	Client architecture	MySQL	There are	There is no	There is no	There is no	is	Java	Java	Three years ago	344
TDDL	taobao	Client architecture	unlimited	There are	There are	There are	Diamond	Open source only	Java	Java	The unknown	519
Atlas	Qihoo 360	Agent architecture	MySQL	There are	There are	There are	There is no	is	C	unlimited	Ten months ago	1941
Heisenberg	Baidu Xiong Zhao	Agent architecture	MySQL	There are	There are	There are	There is no	is	Java	unlimited	2 months ago	197
TribeDB	personal	Agent architecture	MySQL	There are	There are	There are	There is no	is	NodeJS	unlimited	Three months ago	126
ShardingJDBC	dangdang	Client architecture	MySQL	There are	There are	There are	There is no	is	Java	Java	In the day	1144
Shark	personal	Client architecture	MySQL	There are	There are	There is no	There is no	is	Java	Java	Two days ago	84
KingShard	personal	Agent architecture	MySQL	There are	There are	There are	There is no	is	Golang	unlimited	Two days ago	1836
OneProxy	Civilian software	Agent architecture	MySQL	There are	There are	There are	There is no	no	The unknown	unlimited	The unknown	The unknown
MyCat	community	Agent architecture	MySQL	There are	There are	There are	There is no	is	Java	unlimited	Two days ago	1270
Vitess	Youtube	Agent architecture	MySQL	There are	There are	There are	There is no	is	Golang	unlimited	In the day	3636
Mixer	personal	Agent architecture	MySQL	There are	There are	There is no	There is no	is	Golang	unlimited	Nine months ago	472
JetPants	Tumblr	Client architecture	MySQL	There are	There are	There is no	There is no	is	Ruby	Ruby	Ten months ago	957
HibernateShard	Hibernate	Client architecture	unlimited	There are	There are	There is no	There is no	is	Java	Java	Four years ago	57
MybatisShard	MakerSoft	Client architecture	unlimited	There are	There are	There is no	There is no	is	Java	Java	11 months ago	119
Gizzard	Twitter	Agent architecture	unlimited	There are	There are	There is no	There is no	is	Java	unlimited	Three years ago	2087

So many programs, how to choose? Think of it in the following way:

Determine whether to use the proxy or client architecture. Small – to medium-sized or simple scenarios tend to choose the client architecture, while complex scenarios or large-scale systems tend to choose the agent architecture
Whether specific functions are met, such as cross-nodeORDER BY, then support this function is a priority
Don’t consider a product that hasn’t been updated in a year, which means development is stagnant, or even without maintenance or technical support
Had better press big company -> community -> small company -> individual such production square order will choose
Choose good reputation, such as github star count, user quantity and quality, and user feedback
Open source is preferred, often projects with special needs may need to modify the source code

According to the above ideas, the following options are recommended:

Client architecture: ShardingJDBC
Agent architecture: MyCat or Atlas

MySQL compatible and horizontally scalable database

There are also some open source databases that are compatible with the MySQL protocol, such as:

TiDB
Cubrid

However, there is still a gap between its industrial quality and MySQL, and large operation and maintenance investment is required. If you want to migrate the original MySQL to a new database that can be horizontally expanded, you can consider some cloud databases:

Ali cloud PetaData
Ali cloud OceanBase
Tencent cloud DCDB

NoSQL

Sharding on MySQL is a dance in shackles. In fact, many large tables themselves do not require ACID and MySQL RDBMS. Consider migrating these tables to NoSQL to completely solve the problem of horizontal scaling.

Log, monitoring, and statistics data
Unstructured or weakly structured data
The transaction requirements are not strong and there is not much data associated with the operation

Mysql large table optimization scheme

Single table optimization

field

The index

The query SQL

engine

MyISAM

InnoDB

System tuning parameters

Upgrade the hardware

Reading and writing separation

The cache

Table partitioning

The vertical resolution

Horizontal split

An overview of the

Subdivision principle

The solution

Client architecture

Agent architecture

Comparison of different schemes

MySQL compatible and horizontally scalable database

NoSQL

Related Posts

Queue synchronizer (AQS) source code analysis

It’s coming it’s coming!! Full collection of Java design patterns

Should programmers go to outsourcing companies?