When the number of MySQL records in a single table is too large, the performance of add, delete, alter, and query deteriorates dramatically

Single table optimization

Unless the single table data will continue to grow in the future, do not consider splitting at first. Splitting will introduce various complexity of logic, deployment, operation and maintenance. Generally, tables with integer values under ten million, and strings under five million are not a big problem.

In fact, in many cases, the performance of MySQL single table still has a lot of optimization space, and even can normally support more than ten million levels of data volume.

field

  • Try to useTINYINT,SMALLINT,MEDIUM_INTAs an integer type instead ofINTIf it is not negative, plusUNSIGNED
  • VARCHARThe length allocates only the space that is really needed
  • Use enumerations or integers instead of string types
  • Try to useTIMESTAMPRather thanDATETIME
  • Do not have too many fields in a single table, recommend less than 20
  • Avoid the use ofNULLFields that are difficult to query optimize and take up extra index space
  • Use an integer to store IP addresses

The index

  • The more the better, the more the better, according to the query to create specific, consider inWHEREORDER BYThe columns involved in the command can be indexed according toEXPLAINTo see if an index or full table scan is used
  • Should be avoided as far as possibleWHEREClause for the fieldNULLValue, otherwise it will cause the engine to abandon the index and perform a full table scan
  • Fields with a sparse distribution of values, such as “gender,” have only two or three values
  • Character fields are indexed by prefix only
  • Character fields should preferably not be primary keys
  • No foreign keys, the program guarantees the constraint
  • As far as possible need notUNIQUE, guaranteed by the program
  • Use multi-column indexes in the same order as the query criteria, and remove unnecessary single-column indexes

The query SQL

  • Slow SQL can be found by enabling slow query logging
  • No column operations:SELECT id WHERE age+1=10, any operation on a column will result in a table scan, which includes database tutorial functions, computed expressions, and so on. Move the operation to the right of the equals sign whenever possible
  • SQL statements are as simple as possible: an SQL statement can be run on only one CPU; Large statement disassembly small statement, reduce lock time; One large SQL can block the entire library
  • Don’t have toSELECT *
  • ORRewrite intoIN:ORThe efficiency is n,INIs log(n) efficient,INYou are advised to keep the number of values within 200
  • No functions and triggers, implemented in the application
  • avoid%xxxType the query
  • To use lessJOIN
  • Use the same type for comparison, such as ‘123’ to ‘123’, 123 to 123
  • Try to avoidWHEREClause! =<>Otherwise, the engine abandons the index for a full table scan
  • For continuous values, useBETWEENDon’t have toIN:SELECT id FROM t WHERE num BETWEEN 1 AND 5
  • Table data do not take the full table, useLIMITDo pagination and keep the number of pages small

engine

MyISAM and InnoDB are two widely used engines:

MyISAM

The MyISAM engine is the default engine for MySQL 5.1 and earlier. It features:

  • Row locks are not supported. All tables that need to be read are locked when read, and all tables are locked when written
  • Transactions not supported
  • Foreign keys are not supported
  • Security recovery after a crash is not supported
  • New records can be inserted into a table while a query is being read
  • supportBLOBTEXTIndex of the first 500 characters of, support full-text index
  • Delay index update is supported to greatly improve write performance
  • Tables that are not modified can be compressed to greatly reduce disk space usage

InnoDB

InnoDB became the default index after MySQL 5.5 and features:

  • Line locking is supported and MVCC is used to support high concurrency
  • Support transactions
  • Support foreign keys
  • Supports secure recovery after crashes
  • Full-text indexing is not supported

In general, MyISAM is good for SELECT intensive tables, while InnoDB is good for INSERT and UPDATE intensive tables

System tuning parameters

You can use the following tools to benchmark:

  • Sysbench: a modular, cross-platform, and multithreaded performance testing tool
  • Iitable-mysql: insert performance test tool based on Java mysql /Percona/MariaDB index
  • TPCC -mysql: TPC-C test tool developed by Percona

You can refer to official documents for details about tuning parameters. Here are some important parameters:

  • Backlog: The backlog value indicates how many requests can be stored on the stack for a short time before MySQL temporarily stops answering new requests. That is, if MySql connection data reaches maxconnections, incoming requests will be stored on the backlog waiting for a connection to release resources. The backlog is called the backlog. If the number of waiting connections exceeds back_log, no connection resources will be granted. It can be raised from the default 50 to 500
  • Wait_timeout: indicates the idle time of the database connection. The idle connection occupies memory resources. It can be reduced from the default 8 hours to half an hour
  • Maxuserconnection: Specifies the maximum number of connections. The default value is 0
  • Thread_concurrency: set the number of concurrent threads to twice the number of CPU cores
  • Skipnameresolve: disables DNS resolution for external connections, eliminating the DNS resolution time, but requires all remote hosts to use IP addresses for access
  • Keybuffersize: indicates the cache size of the index block. Increasing the cache size will improve the index processing speed and has the greatest impact on MyISAM table performance. For about 4G memory, the value can be 256M or 384Mshow status like'key_read%'Keep key_reads/key_read_requests below 0.1%
  • Innodbbufferpool_size: Cache data blocks and index blocks, which have the greatest impact on InnoDB table performance. Through the queryshow status like'Innodb_buffer_pool_read%', guarantee (Innodb_buffer_pool_read_requests — Innodb_buffer_pool_reads)/Innodb_buffer_pool_read_requests as high as possible
  • Innodbadditionalmempoolsize: InnoDB storage engine is used to store data dictionary information and some internal data structures. When there are many database objects, adjust the size of this parameter to ensure that all data can be stored in memory to improve access efficiency. When too small, MySQL will record Warning information in the database error log, so you need to adjust the size of this parameter
  • Innodblogbuffer_size: the buffer used by InnoDB to store transaction logs of the engine. Generally, it is recommended not to exceed 32MB
  • Querycachesize: caches a ResultSet in MySQL, which is the ResultSet of an SQL statementselectStatements. Any change to a table’s data results in all references to the tableselectStatements inQuery CacheThe cache data in. So, when our data changes very frequently, useQuery CacheYou may lose more than you gain. Adjust according to hit ratio (Qcache_hits/(Qcache_hits + Qcache_inserts) * 100)). Generally, large size is not recommended; 256MB May be sufficient, and large configuration static data can be scaled appropriately. You can do this by commandshow status like'Qcache_%'Viewing the Current SystemQuery CatchUse the size
  • Readbuffersize: the size of the MySql readbuffer. A sequential scan of the table will allocate a read buffer, and MySql will allocate a memory buffer for it. If sequential scan requests to the table are very frequent, you can improve performance by increasing the value of this variable as well as the memory buffer size
  • Sortbuffersize: buffersize used by MySql to perform sorting. If you want to increaseORDER BYFirst, see if you can get MySQL to use indexes instead of the extra sort phase. If not, try increasing the size of the sortBufferSize variable
  • Readrndbuffer_size: random read buffer size of MySql. When rows are read in any order (for example, in sort order), a random read cache is allocated. When sorting a query, MySql scans the buffer first to avoid disk search and improve the query speed. If a large amount of data needs to be sorted, you can adjust the value to a higher value. MySql allocates this buffer space for each client connection, so try to set this value appropriately to avoid excessive memory overhead
  • Record_buffer: Each thread doing a sequential scan allocates a buffer of this size for each table it scans. If you do a lot of sequential scans, you may want to increase this value
  • Threadcachesize: Stores threads that are not currently associated with a connection but are ready to service a later connection, allowing you to quickly respond to connected thread requests without creating a new one
  • Tablecache: similar to threadcache_size, but used to cache table files. Not very useful for InnoDB, mainly used for MyISAM

Upgrade the hardware

Scale up, depending on whether MySQL is CPU intensive or I/O intensive, can significantly improve MySQL performance by increasing CPU and memory, and using SSDS.

Reading and writing separation

It is also a commonly used optimization, read from the library and write from the master library, generally do not use double master or multiple master to introduce a lot of complexity, try to use other schemes in the paper to improve performance. At the same time, many split solutions also consider read and write separation.

The cache

Caching can occur at these levels:

  • MySQL Internal: Describes related Settings in system tuning parameters
  • Data access layer: For example, MyBatis caches SQL statements, whereas Hibernate caches down to a single record. The cached objects are persistenceObjects
  • Application service layer: more precise control and more implementation strategies can be achieved for cache through programming means. The object cached here is the DataTransferObject DataTransferObject (DTO).
  • Web layer: Cache Web pages
  • Browser client: the client’s cache

Caches can be added at one or more levels in combination depending on the situation. The cache implementation of the service layer is mainly introduced here. At present, there are two main ways:

  • Write Through: After data is written to the database, the cache is updated to ensure consistency between the database and cache. This is how most current application caching frameworks such as Spring Cache work. This implementation is very simple, synchronized well, but not very efficient.
  • Write Back: When data is written to the database, only the cache is updated, and then the cached data is asynchronously synchronized to the database in batches. This kind of implementation is more complex, requires more application logic, and may cause the database and cache synchronization, but the efficiency is very high.

Table partitioning

The partition introduced in MySQL 5.1 is a simple horizontal split that requires users to add partition parameters when building a table and is transparent to the application without changing the code

To the user, a partitioned table is a separate logical table, but the underlying table is composed of multiple physical sub-tables. The code that implements partitioning is actually a black box that encapsulates the underlying table through object encapsulation of a set of underlying tables, but is a black box that completely encapsulates the underlying table for the SQL layer. The way MySQL implements partitioning also means that indexes are defined as subtables of partitions, with no global indexes.

The user’s SQL statement needs to be optimized for the partition table. The column of the partition condition should be included in the SQL condition to locate the query to a small number of PARTITIONS, otherwise all PARTITIONS will be scanned. You can check which PARTITIONS a CERTAIN SQL statement will fall on by EXPLAIN PARTITIONS. In order to perform SQL optimization, as shown in the following figure, 5 records fall on two partitions:

The benefits of partitioning are:

  • More data can be stored in a single table
  • The data in partitioned tables is easier to maintain, and large amounts of data can be removed in batches by cleaning the entire partition, or new partitions can be added to support newly inserted data. In addition, a separate partition can be optimized, checked, repaired, and so on
  • Partial queries can be determined from query criteria to fall on only a few partitions, which can be very fast
  • Partitioned table data can also be distributed across different physical devices, making efficient use of multiple hardware devices
  • You can use partitioned table latencies to avoid specific bottlenecks, such as mutually exclusive access for InnoDB single indexes and inode lock contention for ext3 file systems
  • Individual partitions can be backed up and restored

Limitations and disadvantages of partitioning:

  • A table can have a maximum of 1024 partitions
  • If a partitioned field has columns with a primary key or unique index, all primary key and unique index columns must be included
  • Partitioned tables cannot use foreign key constraints
  • A NULL value invalidates partition filtering
  • All partitions must use the same storage engine

Type of partition:

  • RANGE partition: Allocates multiple rows to a partition based on column values belonging to a given contiguous interval
  • LIST partitioning: Similar to partitioning by RANGE, except that LIST partitioning is selected based on column values matching a value from a discrete set of values
  • HASH partition: A partition selected based on the return value of a user-defined expression computed using the column values of the rows to be inserted into the table. This function can contain any valid expression in MySQL that produces a non-negative integer value
  • KEY partitioning: Similar to HASH partitioning, except that KEY partitioning supports only one or more columns and the MySQL server provides its own HASH function. One or more columns must contain integer values

Partitioning is suitable for the following scenarios:

The most suitable scene data has a strong time series, so it can be partitioned by time, as shown below:

Adding time range conditions to query is very efficient, and it is easy to delete unnecessary historical data in batches.

If the data has an obvious hot spot and is rarely accessed, you can place the hot spot in a separate partition so that the data in that partition has a chance to be cached in memory and only a small partitioned table is accessed for query purposes, making efficient use of indexes and caching

In addition, MySQL has an early simple partition implementation – merge table, which is more limited and lacks optimization. It is not recommended to use the new partition mechanism instead

The vertical resolution

Vertical segmentation is based on the correlation of data tables in the database to split. For example, if a database has both user data and order data, vertical split can put user data into the user library and order data into the order library. Vertical split table is a method of vertical split of data tables. It is common to split a large multi-field table according to common and unusual fields. The number of data records in each table is generally the same, but the fields are different, and primary key association is used

For example, the original user table is:

The vertical split is:

The advantages of vertical splitting are:

  • You can make rows smaller so that more data can be stored in one Block, resulting in fewer I/O times (fewer blocks read per query)
  • To maximize the use of the Cache, you can place infrequently changed fields together and frequently changed fields together in vertical split
  • Simple data maintenance

The disadvantage is that:

  • The primary key is redundant and redundant columns need to be managed
  • The JOIN operation that causes a table JOIN (which increases CPU overhead) can reduce database stress by joining on the business server
  • There is still the problem of large amount of single table data (horizontal split)
  • Transaction complexity

Horizontal split

An overview of the

Horizontal split is a strategy to store data in fragments, which are divided into tables and libraries within a database. Each piece of data is dispersed to different MySQL tables or libraries to achieve a distributed effect, which can support a very large amount of data. The preceding table partition is essentially a special kind of in-library partition

Table, just simply solve the problem of the single table data is too large, because there is no table of the data distribution to different machines, so to alleviate the pressure of the MySQL server, the role of is not much, everyone still compete on the same physical IO, CPU, network, this will be addressed by depots

If the user table was split horizontally, the result would be:

In practice, this tends to be a combination of vertical split and horizontal split, with Users_A_M and Users_N_Z split into Users and UserExtras for a total of four tables

The advantages of horizontal splitting are:

  • There are no performance bottlenecks of single repository big data and high concurrency
  • Application side modification is less
  • The stability and load capacity of the system are improved

The disadvantage is that:

  • Shard transaction consistency is difficult to resolve
  • Cross-node Join has poor performance and complicated logic
  • It is very difficult to expand data for many times and maintain it

Subdivision principle

  • If you can, refer to single table optimization
  • The number of fragments should be as small as possible, and the fragments should be evenly distributed on multiple data nodes. The more cross-fragments a query has, the worse its overall performance is. Although it is better than the result of all data in one fragment, it should only be expanded when necessary to increase the number of fragments
  • The sharding rules need to be carefully selected and planned in advance. When selecting sharding rules, you need to consider the growth mode of data, access mode of data, correlation of sharding, and expansion of sharding. The latest sharding strategies are range sharding, enumeration sharding, and consistent Hash sharding, which are conducive to expansion
  • Try not to have SQL in a transaction that spans multiple shards. Distributed transactions are always a problem
  • Query conditions should be optimized and avoidedSelect *A large number of result sets consume a large amount of bandwidth and CPU resources. Avoid returning a large number of result sets and create indexes for frequently used query statements.
  • Reduce the likelihood of cross-library joins through data redundancy and table partition dependencies

Here especially emphasize the selection problem of fragmentation rules, if a table of data has the obvious time characteristic, such as orders, transaction records, etc., are often more appropriate shard with time range, because has the timeliness of data, we tend to focus on the recent data, often with time field to filter query conditions, a better solution is to, Currently active data is sharded with a short span, while historical data is stored with a long span.

Generally speaking, the choice of shard depends on the condition of the most frequently queried SQL, because the query SQL without any Where statement will traverse all shards and has the worst performance. Therefore, the more such SQL, the greater the impact on the system, so we should try to avoid such SQL.

The solution

Due to the complicated logic involved in horizontal split, there are many mature solutions. These programmes fall into two broad categories:

  • Client architecture
  • Agent architecture

Client architecture

By modifying the Data access layer, such as JDBC, Data Source and MyBatis, it can manage multiple Data sources through configuration, directly connect to the database, and complete the fragmentation integration of Data within the module, which is generally presented in the form of Jar packages

Here is an example of a client architecture:

You can see that sharding is implemented in conjunction with the application server by modifying the Spring JDBC layer

The advantages of the client architecture are:

  • Use directly connected databases to reduce downtime risks associated with peripheral system dependencies
  • Low integration cost, no additional o&M components

The disadvantage is that:

  • Limited to the database access layer, general scalability, for more complex systems may be inadequate
  • Placing the stress of sharding logic on the application server creates additional risk

Agent architecture

Unified management of all data sources and data fragmentation integration through independent middleware, back-end database cluster transparent to front-end applications, independent deployment and operation of agent components

Here is an example of a proxy architecture:

Proxy components exist in clusters to avoid single points and may require service components such as Zookeeper to manage them

The advantages of the proxy architecture are:

  • Able to handle very complex requirements, not limited by the original implementation of the database access layer, scalability
  • Transparent to the application server without any additional load

The disadvantage is that:

  • Independent proxy middleware needs to be deployed and operated, which is costly
  • The application needs to go through the proxy to connect to the database. An extra hop on the network leads to performance loss and additional risks

Comparison of different schemes

The framework producers Architectural model Support database depots table Reading and writing separation External dependencies Whether open source Implementation language Support language Lot number of star
MySQL Fabric MySQL official Agent architecture MySQL There are There are There are There is no is python unlimited 35
Cobar alibaba Agent architecture MySQL There are There is no There is no There is no is Java unlimited 1287
Cobar Client alibaba Client architecture MySQL There are There is no There is no There is no is Java Java 344
TDDL taobao Client architecture unlimited There are There are There are Diamond Open source only Java Java 519
Atlas Qihoo 360 Agent architecture MySQL There are There are There are There is no is C unlimited 1941
Heisenberg Baidu Xiong Zhao Agent architecture MySQL There are There are There are There is no is Java unlimited 197
TribeDB personal Agent architecture MySQL There are There are There are There is no is NodeJS unlimited 126
Sharding JDBC dangdang Client architecture MySQL There are There are There are There is no is Java Java 1144
Shark personal Client architecture MySQL There are There are There is no There is no is Java Java 84
KingShard personal Agent architecture MySQL There are There are There are There is no is Golang unlimited 1836
OneProxy Civilian software Agent architecture MySQL There are There are There are There is no no The unknown unlimited The unknown
MyCat community Agent architecture MySQL There are There are There are There is no is Java unlimited 1270
Vitess Youtube Agent architecture MySQL There are There are There are There is no is Golang unlimited 3636
Mixer personal Agent architecture MySQL There are There are There is no There is no is Golang unlimited 472
JetPants Tumblr Client architecture MySQL There are There are There is no There is no is Ruby Ruby 957
HibernateShard Hibernate Client architecture unlimited There are There are There is no There is no is Java Java 57
MybatisShard MakerSoft Client architecture unlimited There are There are There is no There is no is Java Java 119
Gizzard Twitter Agent architecture unlimited There are There are There is no There is no is Java unlimited 2087

So many programs, how to choose? Think of it in the following way:

  1. Determine whether to use the proxy or client architecture. Small – to medium-sized or simple scenarios tend to choose the client architecture, while complex scenarios or large-scale systems tend to choose the agent architecture
  2. Whether specific functions are met, such as cross-nodeORDER BY, then support this function is a priority
  3. Don’t consider a product that hasn’t been updated in a year, which means development is stagnant, or even without maintenance or technical support
  4. Had better press big company -> community -> small company -> individual such production square order will choose
  5. Choose good reputation, such as GitHub star count, user quantity and quality, and user feedback
  6. Open source is preferred, often projects with special needs may need to modify the source code

According to the above ideas, the following options are recommended:

  • Client architecture: ShardingJDBC
  • Agent architecture: MyCat or Atlas

MySQL compatible and horizontally scalable database

There are also some open source databases that are compatible with the MySQL protocol, such as:

  • TiDB
  • Cubrid

However, there is still a gap between its industrial quality and MySQL, and large operation and maintenance investment is required. If you want to migrate the original MySQL to a new database that can be horizontally expanded, you can consider some cloud databases:

  • Ali cloud PetaData
  • Ali cloud OceanBase
  • Tencent cloud DCDB

NoSQL

Sharding on MySQL is a dance in shackles. In fact, many large tables themselves do not require ACID and MySQL RDBMS. Consider migrating these tables to NoSQL to completely solve the problem of horizontal scaling.

  • Log, monitoring, and statistics data
  • Unstructured or weakly structured data
  • The transaction requirements are not strong and there is not much data associated with the operation