The original copyright

When the number of MySQL single table records is too large, the performance of add, delete, alter, and query will decrease sharply. You can refer to the following steps to optimize:

Single table optimization

Unless the single table data will continue to grow in the future, do not consider splitting at first. Splitting will introduce various complexity of logic, deployment, operation and maintenance. Generally, tables with integer values under ten million, and strings under five million are not a big problem. In fact, in many cases, the performance of MySQL single table still has a lot of room for optimization, and even can normally support more than ten million levels of data volume:

field

  • Try to useTINYINT,SMALLINT,MEDIUM_INTAs an integer type instead ofINTIf it is not negative, plusUNSIGNED
  • VARCHARThe length allocates only the space that is really needed
  • Use enumerations or integers instead of string types
  • Try to useTIMESTAMPRather thanDATETIME.
  • Do not have too many fields in a single table, recommend less than 20
  • Avoid NULL fields, which are difficult to optimize queries and take up extra index space
  • Use an integer to store IP addresses

The index

  • The more the better, the more the better, according to the query to create specific, consider inWHEREandORDER BYThe columns involved in the command can be indexed according toEXPLAINTo see if an index or full table scan is used
  • Should be avoided as far as possibleWHEREClause for the fieldNULLValue, otherwise it will cause the engine to abandon the index and perform a full table scan
  • Fields with a sparse distribution of values, such as “gender,” have only two or three values
  • Character fields are indexed by prefix only
  • Character fields should preferably not be primary keys
  • No foreign keys, the program guarantees the constraint
  • As far as possible need notUNIQUE, guaranteed by the program
  • Use multi-column indexes in the same order as the query criteria, and remove unnecessary single-column indexes

The query SQL

  • Slow SQL can be found by enabling slow query logging
  • No column operations:SELECT id WHERE age + 1 = 10, any operation on a column will result in a table scan, which includes database tutorial functions, computed expressions, and so on. Move the operation to the right of the equals sign whenever possible
  • SQL statements are as simple as possible: an SQL statement can be run on only one CPU; Large statement disassembly small statement, reduce lock time; One large SQL can block the entire library
  • Don’t have toSELECT *
  • ORRewrite intoIN:ORThe efficiency is n,INThe efficiency of is log(n), and the number of in is recommended to be less than 200
  • No functions and triggers, implemented in the application
  • avoid%xxxType the query
  • To use lessJOIN
  • Use the same type for comparison, e.g'123'and'123'Than,123and123than
  • Try to avoidWHEREClause used in! = or <> otherwise the engine will abandon the index for a full table scan
  • For continuous values, useBETWEENDon’t have toIN:SELECT id FROM t WHERE num BETWEEN 1 AND 5
  • Table data do not take the full table, useLIMITDo pagination and keep the number of pages small

engine

MyISAM and InnoDB are two widely used engines:

MyISAM

The MyISAM engine is the default engine for MySQL 5.1 and earlier. It features:

  • Row locks are not supported. All tables that need to be read are locked when read, and all tables are locked when written
  • Transactions not supported
  • Foreign keys are not supported
  • Security recovery after a crash is not supported
  • New records can be inserted into a table while a query is being read
  • supportBLOBandTEXTIndex of the first 500 characters of, support full-text index
  • Delay index update is supported to greatly improve write performance
  • Tables that are not modified can be compressed to greatly reduce disk space usage
InnoDB

InnoDB became the default index after MySQL 5.5 and features:

  • Line locking is supported and MVCC is used to support high concurrency
  • Support transactions
  • Support foreign keys
  • Supports secure recovery after crashes
  • Full-text indexing is not supported

In general, MyISAM is good for SELECT intensive tables, while InnoDB is good for INSERT and UPDATE intensive tables

System tuning parameters

You can use the following tools to benchmark:

  • Sysbench: a modular, cross-platform, and multithreaded performance testing tool
  • Iitable-mysql: insert performance test tool based on Java mysql /Percona/MariaDB index
  • TPCC -mysql: TPC-C test tool developed by Percona

You can refer to official documents for details about tuning parameters. Here are some important parameters:

  • Back_log: The back_log value indicates how many requests can be placed on the stack in a short time before MySQL temporarily stops answering new requests. That is, if MySql connection data reaches max_connections, new requests will be placed on the stack waiting for a connection to release resources. The stack number is called back_log. If the number of connections waiting exceeds back_log, no connection resources will be granted. It can be raised from the default 50 to 500
  • Wait_timeout: indicates the idle time of the database connection. The idle connection occupies memory resources. It can be reduced from the default 8 hours to half an hour
  • Max_user_connection: specifies the maximum number of connections. The default value is 0
  • Thread_concurrency: set the number of concurrent threads to twice the number of CPU cores
  • Skip_name_resolve: disables DNS resolution for external connections, eliminating the DNS resolution time. However, all remote hosts must use IP addresses to access the IP address
  • Key_buffer_size: specifies the cache size of the index block. Increasing the cache size improves the index processing speed and has the greatest impact on MyISAM table performance. For about 4G memory, the value can be 256M or 384Mshow status like 'key_read%'To ensurekey_reads / key_read_requestsLess than 0.1% is best
  • Innodb_buffer_pool_size: Cache data blocks and index blocks, which have the greatest impact on InnoDB table performance. Through the queryshow status like 'Innodb_buffer_pool_read%'To ensure(Innodb_buffer_pool_read_requests -- Innodb_buffer_pool_reads)/Innodb_buffer_pool_read_requestsThe higher the better
  • Innodb_additional_mem_pool_size: InnoDB storage engines used to store data dictionary information as well as some internal data structure size of memory space, when the database object is very much, adjust the parameter to make sure that all the size of the data can be stored in the memory access efficiency, when is too small, MySQL will record the Warning information to the database in the error log, This is where you need to resize this parameter
  • Innodb_log_buffer_size: the buffer used by InnoDB to store transaction logs of the engine. Generally, it is recommended not to exceed 32MB
  • Query_cache_size: cache MySQL’s ResultSet, which is the ResultSet executed by an SQL statement, only for select statements. Any change in the data of a table invalidates the data cached in the Query Cache for all SELECT statements referencing that table. So, when our data changes very frequently, using Query Cache may not be worth the cost. Based on the hit rate(Qcache_hits/(Qcache_hits+Qcache_inserts)*100))Generally, it is not recommended to be too large. 256MB May be enough. Large configuration static data can be adjusted appropriately. You can do this by commandshow status like 'Qcache_%'View the current system Query catch usage size
  • Read_buffer_size: specifies the size of the buffer read by MySql. A sequential scan of the table will allocate a read buffer, and MySql will allocate a memory buffer for it. If sequential scan requests to the table are very frequent, you can improve performance by increasing the value of this variable as well as the memory buffer size
  • Sort_buffer_size: buffer size used by MySql to perform sorting. If you want to increaseORDER BYFirst, see if you can get MySQL to use indexes instead of the extra sort phase. If not, try increasing the size of the sort_buffer_SIZE variable
  • Read_rnd_buffer_size: random read buffer size of MySql. When rows are read in any order (for example, in sort order), a random read cache is allocated. When sorting a query, MySql scans the buffer first to avoid disk search and improve the query speed. If a large amount of data needs to be sorted, you can adjust the value to a higher value. MySql allocates this buffer space for each client connection, so try to set this value appropriately to avoid excessive memory overhead.
  • Record_buffer: Each thread doing a sequential scan allocates a buffer of this size for each table it scans. If you do a lot of sequential scans, you may want to increase this value
  • Thread_cache_size: Holds threads that are not currently associated with a connection but are ready to service a later connection, allowing quick response to connected thread requests without creating new ones
  • Table_cache: similar to thread_cache_size, but used to cache table files. Not very useful for InnoDB, mainly used for MyISAM

Upgrade the hardware

Scale up, depending on whether MySQL is CPU intensive or I/O intensive, can significantly improve MySQL performance by increasing CPU and memory, and using SSDS

Reading and writing separation

It is also a commonly used optimization, read from the library and write from the master library, generally do not use double master or multiple master to introduce a lot of complexity, try to use other schemes in the paper to improve performance. At the same time, many split solutions also consider read and write separation

The cache

Caching can occur at these levels:

  • MySQL Internal: Describes related Settings in system tuning parameters
  • Data access layer: For example, MyBatis caches SQL statements, whereas Hibernate can be down to a single record, where the cached objects are mainly persistent objectsPersistence Object
  • Application service layer: more precise control and more implementation strategies can be achieved for cache through programming means. The object of cache is data transmission objectData Transfer Object
  • Web layer: Cache Web pages
  • Browser client: the client’s cache

Caches can be added at one or more levels in combination depending on the situation. The cache implementation of the service layer is mainly introduced here. At present, there are two main ways:

  • Write Through: After data is written to the database, the cache is updated to ensure consistency between the database and cache. This is how most current application caching frameworks such as Spring Cache work. This implementation is very simple, synchronized well, but not very efficient.
  • Write Back: When data is written to the database, only the cache is updated, and then the cached data is asynchronously synchronized to the database in batches. This kind of implementation is more complex, requires more application logic, and may cause the database and cache synchronization, but the efficiency is very high.

Table partitioning

The partition introduced in MySQL 5.1 is a simple horizontal split that requires users to add partition parameters when building a table and is transparent to the application without changing the code

To the user, a partitioned table is a separate logical table, but the underlying table is composed of multiple physical sub-tables. The code that implements partitioning is actually a black box that encapsulates the underlying table through object encapsulation of a set of underlying tables, but is a black box that completely encapsulates the underlying table for the SQL layer. The way MySQL implements partitioning also means that indexes are defined as subtables of partitions, with no global indexes

The user’s SQL statement needs to be optimized for the partition table. The column of the partition condition should be included in the SQL condition to locate the query to a small number of PARTITIONS, otherwise all PARTITIONS will be scanned. You can check which PARTITIONS a CERTAIN SQL statement will fall on by EXPLAIN PARTITIONS to optimize the SQL. As shown below, 5 records fall on two partitions:

mysql> explain partitions select count(1) from user_partition where id in(1, 2, 3, 4, 5); +----+-------------+----------------+------------+-------+---------------+---------+---------+------+------+------------ --------------+ | id | select_type | table | partitions |type| possible_keys | key | key_len | ref | rows | Extra | +----+-------------+----------------+------------+-------+---------------+---------+---------+------+------+------------ --------------+ | 1 | SIMPLE | user_partition | p1,p4 | range | PRIMARY | PRIMARY | 8 | NULL | 5 | Usingwhere; Using index | +----+-------------+----------------+------------+-------+---------------+---------+---------+------+------+------------ --------------+ 1 rowin set (0.00 sec)
Copy the code

The benefits of partitioning are:

  • More data can be stored in a single table
  • The data in partitioned tables is easier to maintain, and large amounts of data can be removed in batches by cleaning the entire partition, or new partitions can be added to support newly inserted data. In addition, a separate partition can be optimized, checked, repaired, and so on
  • Partial queries can be determined from query criteria to fall on only a few partitions, which can be very fast
  • Partitioned table data can also be distributed across different physical devices to make fun of multiple hardware devices
  • You can use partitioned table latencies to avoid specific bottlenecks, such as mutually exclusive access for InnoDB single indexes and inode lock contention for ext3 file systems
  • Individual partitions can be backed up and restored

Limitations and disadvantages of partitioning:

  • A table can have a maximum of 1024 partitions
  • If a partitioned field has columns with a primary key or unique index, all primary key and unique index columns must be included
  • Partitioned tables cannot use foreign key constraints
  • A NULL value invalidates partition filtering
  • All partitions must use the same storage engine

Type of partition:

  • RANGE partition: Allocates multiple rows to a partition based on column values belonging to a given contiguous interval
  • LIST partitioning: Similar to partitioning by RANGE, except that LIST partitioning is selected based on column values matching a value from a discrete set of values
  • HASH partition: A partition selected based on the return value of a user-defined expression computed using the column values of the rows to be inserted into the table. This function can contain any valid expression in MySQL that produces a non-negative integer value
  • KEY partitioning: Similar to HASH partitioning, except that KEY partitioning supports only one or more columns and the MySQL server provides its own HASH function. One or more columns must contain integer values

Partitioning is suitable for the following scenarios:

  • The most suitable scene data has a strong time series, so it can be partitioned by time, as shown below:
CREATE TABLE members (
    firstname VARCHAR(25) NOT NULL,
    lastname VARCHAR(25) NOT NULL,
    username VARCHAR(16) NOT NULL,
    email VARCHAR(35),
    joined DATE NOT NULL
)
PARTITION BY RANGE( YEAR(joined) ) (
    PARTITION p0 VALUES LESS THAN (1960),
    PARTITION p1 VALUES LESS THAN (1970),
    PARTITION p2 VALUES LESS THAN (1980),
    PARTITION p3 VALUES LESS THAN (1990),
    PARTITION p4 VALUES LESS THAN MAXVALUE
);
Copy the code

Adding time range conditions to query is very efficient, and it is easy to delete unnecessary historical data in batches.

  • If the data has an obvious hot spot and is rarely accessed, you can place the hot spot in a separate partition so that the data in that partition has a chance to be cached in memory and only a small partitioned table is accessed for query purposes, making efficient use of indexes and caching

In addition, MySQL has an early simple partition implementation – merge table, which is more limited and lacks optimization. It is not recommended to use the new partition mechanism instead

The vertical resolution

Vertical partition is split according to the correlation of the data tables in the database. For example, a database has both user data and order data, so vertical partition can put user data into the user database and order data into the order database. Vertical split table is a method of vertical split of data tables. It is common to split a large multi-field table according to common and unusual fields. The number of data records in each table is generally the same, but the fields are different, and primary key association is used

For example, the original user table is:

The vertical split is:

The advantages of vertical splitting are:

  • You can make rows smaller so that more data can be stored in one Block, resulting in fewer I/O times (fewer blocks read per query)
  • To maximize the use of the Cache, you can place infrequently changed fields together and frequently changed fields together in vertical split
  • Simple data maintenance

The disadvantage is that:

  • The primary key is redundant and redundant columns need to be managed
  • The JOIN operation that causes a table JOIN (which increases CPU overhead) can reduce database stress by joining on the business server
  • There is still the problem of large amount of single table data (horizontal split)
  • Transaction complexity

Horizontal split

An overview of the

Horizontal split is a strategy to store data in fragments, which are divided into tables and libraries within a database. Each piece of data is dispersed to different MySQL tables or libraries to achieve a distributed effect, which can support a very large amount of data. The preceding table partition is essentially a special kind of in-library partition

Table, just simply solve the problem of the single table data is too large, because there is no table of the data distribution to different machines, so to alleviate the pressure of the MySQL server, the role of is not much, everyone still compete on the same physical IO, CPU, network, this will be addressed by depots

If the user table was split horizontally, the result would be:

In practice, this tends to be a combination of vertical split and horizontal split, with Users_A_M and Users_N_Z split into Users and UserExtras for a total of four tables

The advantages of horizontal splitting are:

  • There are no performance bottlenecks of single repository big data and high concurrency
  • Application side modification is less
  • The stability and load capacity of the system are improved

The disadvantage is that:

  • Shard transaction consistency is difficult to resolve
  • Cross-node Join has poor performance and complicated logic
  • It is very difficult to expand data for many times and maintain it

Subdivision principle

  • If you can, refer to single table optimization
  • The number of fragments should be as small as possible, and the fragments should be evenly distributed on multiple data nodes. The more cross-fragments a query has, the worse its overall performance is. Although it is better than the result of all data in one fragment, it should only be expanded when necessary to increase the number of fragments
  • The sharding rules need to be carefully selected and planned in advance. When selecting sharding rules, you need to consider the growth mode of data, access mode of data, correlation of sharding, and expansion of sharding. The latest sharding strategies are range sharding, enumeration sharding, and consistent Hash sharding, which are conducive to expansion
  • Try not to have SQL in a transaction that spans multiple shards. Distributed transactions are always a problem
  • Optimize the query conditions and avoid Select *. A large number of result sets consume a large amount of bandwidth and CPU resources. Avoid returning a large number of result sets and create indexes for frequently used query statements.
  • Reduce the likelihood of cross-library joins through data redundancy and table partitioning dependency

Here especially emphasize the selection problem of fragmentation rules, if a table of data has the obvious time characteristic, such as orders, transaction records, etc., are often more appropriate shard with time range, because has the timeliness of data, we tend to focus on the recent data, often with time field to filter query conditions, a better solution is to, Currently active data is sharded with a short span, while historical data is stored with a long span.

Generally speaking, the choice of shard depends on the condition of the most frequently queried SQL, because the query SQL without any Where statement will traverse all shards and has the worst performance. Therefore, the more such SQL, the greater the impact on the system, so we should try to avoid such SQL.

The solution

Due to the complicated logic involved in horizontal split, there are many mature solutions. These solutions fall into two broad categories: client architecture and proxy architecture.

Client architecture

By modifying the Data access layer, such as JDBC, Data Source and MyBatis, it can manage multiple Data sources through configuration, directly connect to the database, and complete the fragmentation integration of Data within the module, which is generally presented in the form of Jar packages

Here is an example of a client architecture:

You can see that sharding is implemented in conjunction with the application server by modifying the Spring JDBC layer

The advantages of the client architecture are:

  • Use directly connected databases to reduce downtime risks associated with peripheral system dependencies
  • Low integration cost, no additional o&M components

The disadvantage is that:

  • Limited to the database access layer, general scalability, for more complex systems may be inadequate
  • Placing the stress of sharding logic on the application server creates additional risk
Agent architecture

Unified management of all data sources and data fragmentation integration through independent middleware, back-end database cluster transparent to front-end applications, independent deployment and operation of agent components

Here is an example of a proxy architecture:

Proxy components exist in clusters to avoid single points and may require service components such as Zookeeper to manage them

The advantages of the proxy architecture are:

  • Able to handle very complex requirements, not limited by the original implementation of the database access layer, scalability
  • Transparent to the application server without any additional load

The disadvantage is that:

  • Independent proxy middleware needs to be deployed and operated, which is costly
  • The application needs to go through the proxy to connect to the database. An extra hop on the network leads to performance loss and additional risks
Comparison of different schemes
producers Architectural model Support database depots table Reading and writing separation External dependencies Whether open source Implementation language Support language The last update Lot number of star
MySQL Fabric MySQL official Agent architecture MySQL There are There are There are There is no is python unlimited Four months ago 35
Cobar alibaba Agent architecture MySQL There are There is no There is no There is no is Java unlimited Two years ago. 1287
Cobar Client alibaba Client architecture MySQL There are There is no There is no There is no is Java Java Three years ago 344
TDDL taobao Client architecture unlimited There are There are There are Diamond Open source only Java Java The unknown 519
Atlas Qihoo 360 Agent architecture MySQL There are There are There are There is no is C unlimited Ten months ago 1941
Heisenberg Baidu Xiong Zhao Agent architecture MySQL There are There are There are There is no is Java unlimited 2 months ago 197
TribeDB personal Agent architecture MySQL There are There are There are There is no is NodeJS unlimited Three months ago 126
ShardingJDBC dangdang Client architecture MySQL There are There are There are There is no is Java Java In the day 1144
Shark personal Client architecture MySQL There are There are There is no There is no is Java Java Two days ago 84
KingShard personal Agent architecture MySQL There are There are There are There is no is Golang unlimited Two days ago 1836
OneProxy Civilian software Agent architecture MySQL There are There are There are There is no no The unknown unlimited The unknown The unknown
MyCat community Agent architecture MySQL There are There are There are There is no is Java unlimited Two days ago 1270
Vitess Youtube Agent architecture MySQL There are There are There are There is no is Golang unlimited In the day 3636
Mixer personal Agent architecture MySQL There are There are There is no There is no is Golang unlimited Nine months ago 472
JetPants Tumblr Client architecture MySQL There are There are There is no There is no is Ruby Ruby Ten months ago 957
HibernateShard Hibernate Client architecture unlimited There are There are There is no There is no is Java Java Four years ago 57
MybatisShard MakerSoft Client architecture unlimited There are There are There is no There is no is Java Java 11 months ago 119
Gizzard Twitter Agent architecture unlimited There are There are There is no There is no is Java unlimited Three years ago 2087

So many programs, how to choose? Think of it in the following way:

  1. Determine whether to use the proxy or client architecture. Small – to medium-sized or simple scenarios tend to choose the client architecture, while complex scenarios or large-scale systems tend to choose the agent architecture
  2. Whether specific functions are met, such as cross-nodeORDER BY, then support this function is a priority
  3. Don’t consider a product that hasn’t been updated in a year, which means development is stagnant, or even without maintenance or technical support
  4. Had better press big company -> community -> small company -> individual such production square order will choose
  5. Choose good reputation, such as github star count, user quantity and quality, and user feedback
  6. Open source is preferred, often projects with special needs may need to modify the source code

According to the above ideas, the following options are recommended:

  • Client architecture: ShardingJDBC
  • Agent architecture: MyCat or Atlas

MySQL compatible and horizontally scalable database

There are also some open source databases that are compatible with the MySQL protocol, such as:

  • TiDB
  • Cubrid

However, there is still a gap between its industrial quality and MySQL, and large operation and maintenance investment is required. If you want to migrate the original MySQL to a new database that can be horizontally expanded, you can consider some cloud databases:

  • Ali cloud PetaData
  • Ali cloud OceanBase
  • Tencent cloud DCDB

NoSQL

Sharding on MySQL is a dance in shackles. In fact, many large tables themselves do not require ACID and MySQL RDBMS. Consider migrating these tables to NoSQL to completely solve the problem of horizontal scaling.

  • Log, monitoring, and statistics data
  • Unstructured or weakly structured data
  • The transaction requirements are not strong and there is not much data associated with the operation