Issue an overview

Using Ali Cloud RDS for MySQL database (MySQL5.6 version), a user’s Internet record table has nearly 20 million data in 6 months, and the data in the last year has reached 40 million. The query speed is extremely slow and the daily life is stuck. Services are severely affected.

Question premise: the old system, when the design of the system is probably not graduated from university, table design and SQL statement writing is not only garbage, it is impossible to look at. The original developers have left, to me to maintain, this is the legend of maintenance can not run, and then I am the pit of that!!

I tried to solve the problem, so, there is this log.

Solution overview

Solution 1: Optimize the existing mysql database. Advantages: does not affect the existing business, the source program does not need to modify the code, the lowest cost. Disadvantages: optimization bottleneck, data volume over 100 million on the end.
Solution 2: Upgrade the database type to a database that is 100% compatible with mysql. Pros: No impact on the existing business, no code changes to the source program, you can improve database performance with almost no action, cons: more money
Solution 3: One-step, big data solution, replace the NewSQL/NOSQL database. Advantages: Strong scalability, low cost, no data capacity bottleneck, disadvantages: need to modify the source code

The above three schemes can be used in sequence. There is no need to change noSQL if the data volume is below 100 million level, because the development cost is too high. I tried all three solutions once, and all of them resulted in landing solutions. This process in the heart of the several developers who ran away ten thousand times 🙂

Solution 1: Optimize the existing mysql database

Telephone communication with ali Cloud database leaders and Google solutions and questions group leaders, summarized as follows (are the essence) :

1. Consider performance when designing databases and creating tables
2. The compilation of SQL needs to pay attention to optimization
4. The partition
4. The table
5. Depots

1. Consider performance when designing databases and creating tables

The mysql database itself is highly flexible, resulting in poor performance and heavy reliance on developer capabilities. That is to say, if the developer is competent, the mysql performance is high. This is also a common problem with many relational databases, so corporate DBAs are often overpaid.

Note when designing tables:

Table fields avoid null values. Null values are difficult to optimize and take up extra index space. The default value 0 is recommended.
Use INT instead of BIGINT, or UNSIGNED if non-negative (which doubles the size of the value). TINYINT, SMALLINT, MEDIUM_INT are better.
Use enumerations or integers instead of string types
Use TIMESTAMP instead of DATETIME whenever possible
Do not have too many fields in a single table, recommend less than 20
Use an integer to store IP addresses

The index

More indexes are not always better. To create specific indexes based on the query, consider creating indexes on the columns involved in the WHERE and ORDER BY commands. You can use EXPLAIN to check whether an index or full table scan is used
Try to avoid NULL values for fields in the WHERE clause, as this will cause the engine to abandon the index for a full table scan
Fields with a sparse distribution of values, such as “gender,” have only two or three values
Character fields are indexed by prefix only
Character fields should preferably not be primary keys
No foreign keys, the program guarantees the constraint
Try not to use UNIQUE, let the program guarantee constraint
Use multi-column indexes in the same order as the query criteria, and remove unnecessary single-column indexes

In short, use the right data type and choose the right index

Select the right data type (1) Use the smallest data type that can hold data. Integers < date,time < char,varchar < blob (2) Use simple data types. Integers are less expensive than character processing because string comparisons are more complex. For example, int stores the time type, bigint transfers IP function (3) uses a reasonable length of field attributes, fixed length tables are faster. Select * from table where group by, order by, select * from table where group by, select * from table where group by, select * from table where group by, select * from table where group by, select * from table where group by, select * from table where group by, select * from table where group by, select * from table where group by <, <=, =, >, >=, between, in, like string + wildcard (%) The more data you can store on a page, the better. (4) Columns with high dispersion (many different values) are placed before the joint index. Check the dispersion by counting different column values. The larger count is, the higher the dispersion is:

The original developer has run away, the table has been established, I can not modify, therefore: this wording cannot be executed, abandoned!

2. The compilation of SQL needs to pay attention to optimization

Use limit to limit the records of query results
Avoid select * and list the fields you want to find
Use joins instead of subqueries
Split large DELETE or INSERT statements
Slow SQL can be found by enabling slow query logging
SELECT id WHERE age + 1 = 10, any operation on a column will cause a table scan. This includes database tutorial functions, computed expressions, and so on. If possible, move the operation to the right of the equal sign
SQL statements are as simple as possible: an SQL statement can be run on only one CPU; Large statement disassembly small statement, reduce lock time; One large SQL can block the entire library
IN: The efficiency of OR is at the n level, and that of IN is at the log(n) level. It is recommended that the number of Ins be less than 200
No functions and triggers, implemented in the application
Avoid % XXX type queries
Do not use the JOIN
Use the same type for comparison, such as ‘123’ to ‘123’, 123 to 123
Avoid using it in WHERE clauses! = or <> otherwise the engine will abandon the index for a full table scan
IN: SELECT ID FROM t WHERE num BETWEEN 1 AND 5
Do not fill the table with list data, use LIMIT pagination, and do not have too many pages

The original developer has run away, the program has been completed online, I can not modify SQL, so: the wording can not be executed, give up!

engine

MyISAM and InnoDB are two widely used engines:

MyISAM
The MyISAM engine is the default engine for MySQL 5.1 and earlier. It features:

Row locks are not supported. All tables that need to be read are locked when read, and all tables are locked when written
Transactions not supported
Foreign keys are not supported
Security recovery after a crash is not supported
New records can be inserted into a table while a query is being read
Support BLOB and TEXT first 500 character index, support full-text index
Delay index update is supported to greatly improve write performance
Tables that are not modified can be compressed to greatly reduce disk space usage

InnoDB
InnoDB became the default index after MySQL 5.5 and features:

Line locking is supported and MVCC is used to support high concurrency
Support transactions
Support foreign keys
Supports secure recovery after crashes
Full-text indexing is not supported

In general, MyISAM is good for SELECT intensive tables, while InnoDB is good for INSERT and UPDATE intensive tables

MyISAM may be super fast and occupies little storage space, but the program requires transaction support, so InnoDB is a must, so this scheme cannot be executed, abandon!

3. The partition

The partition introduced in MySQL 5.1 is a simple horizontal split that requires users to add partition parameters when building a table and is transparent to the application without changing the code

To the user, a partitioned table is a separate logical table, but the underlying table is composed of multiple physical sub-tables. The code that implements partitioning is actually a black box that encapsulates the underlying table through object encapsulation of a set of underlying tables, but is a black box that completely encapsulates the underlying table for the SQL layer. The way MySQL implements partitioning also means that indexes are defined as subtables of partitions, with no global indexes

The user’s SQL statement needs to be optimized for the partition table. The columns of the partition condition should be included in the SQL condition to locate the query to a small number of PARTITIONS, otherwise all PARTITIONS will be scanned. We can check which PARTITIONS a CERTAIN SQL statement will fall on by EXPLAIN PARTITIONS for SQL optimization. Querying without columns with partitioning criteria also improves speed, so this measure is worth a try.

The benefits of partitioning are:

More data can be stored in a single table
The data in partitioned tables is easier to maintain, and large amounts of data can be removed in batches by cleaning the entire partition, or new partitions can be added to support newly inserted data. In addition, a separate partition can be optimized, checked, repaired, and so on
Partial queries can be determined from query criteria to fall on only a few partitions, which can be very fast
Partitioned table data can also be distributed across different physical devices to make fun of multiple hardware devices
You can use partitioned table latencies to avoid specific bottlenecks, such as mutually exclusive access for InnoDB single indexes and inode lock contention for ext3 file systems
Individual partitions can be backed up and restored

Limitations and disadvantages of partitioning:

A table can have a maximum of 1024 partitions
If a partitioned field has columns with a primary key or unique index, all primary key and unique index columns must be included
Partitioned tables cannot use foreign key constraints
A NULL value invalidates partition filtering
All partitions must use the same storage engine

Type of partition:

RANGE partition: Allocates multiple rows to a partition based on column values belonging to a given contiguous interval
LIST partitioning: Similar to partitioning by RANGE, except that LIST partitioning is selected based on column values matching a value from a discrete set of values
HASH partition: A partition selected based on the return value of a user-defined expression computed using the column values of the rows to be inserted into the table. This function can contain any valid expression in MySQL that produces a non-negative integer value
KEY partitioning: Similar to HASH partitioning, except that KEY partitioning supports only one or more columns and the MySQL server provides its own HASH function. One or more columns must contain integer values
For details on the concept of mysql partition, please Google or check the official documentation.

First of all, I divided the Internet record table into 12 partitions according to the month, which improved the query efficiency by about 6 times, but the effect was not obvious. Therefore, I changed the ID to HASH partition and divided it into 64 partitions, which significantly improved the query speed. Problem solved!

Result: PARTITION BY HASH (ID)PARTITIONS 64

select count() from readroom_website; – 11901336 rows

/ Affected rows: 0 Records found: 1 Warning: 0 Duration 1 Query: 5.734 SEC. /

select * from readroom_website where month(accesstime) =11 limit 10;

/ Affected rows: 0 Record found: 10 Warning: 0 Duration 1 Query: 0.719 SEC. */

4. The table

Split table is a large table, according to the above process are optimized, or query stuck, then divide the table into multiple tables, a query into multiple queries, and then return the result combination to the user.

The split table is divided into vertical split and horizontal split, usually with a field as the split item. For example, split the ID field into 100 tables: tableName_id%100

But: sub-table need to modify the source code, will bring a lot of work to the development, greatly increased the development cost, so: only suitable for the development of a large number of data is considered in the early stage, do a good job of sub-table processing, not suitable for the application online to do modification, the cost is too high!! And the choice of this plan, are not as good as the choice OF the second and third plan I provide low cost! Therefore, it is not recommended.

5. Depots

Divide a database into many, suggest to do a read and write separation on the line, the real library will also bring a lot of development costs, outweighs the loss! Not recommended.

Solution 2: Upgrade the database to a database that is 100% compatible with mysql

Mysql does not perform well, so change it. To ensure that the source code is not modified and the existing services migrate smoothly, you need to change a database that is 100% compatible with mysql.

Open source selection

tiDB https://github.com/pingcap/tidb
Cubrid https://www.cubrid.org/
Open source database will bring a lot of operation and maintenance costs and its industrial quality and MySQL still have a gap, there are many pits to step on, if your company requires you to build your own database, then choose this type of product.

Cloud Data selection

Ali cloud POLARDB
https://www.aliyun.com/product/polardb?spm=a2c4g.11174283.cloudEssentials.47.7a984b5cS7h4wH

POLARDB is ali Cloud research from the next generation of relational distributed cloud native database, 100% compatible with MySQL, storage capacity up to 100T, the highest performance to improve to MySQL 6 times. POLARDB not only integrates the characteristics of stable, reliable and high performance of commercial database, but also has the advantages of simple, scalable and continuous iteration of open source database, while the cost is only 1/10 of that of commercial database.

I opened the test, support free mysql data migration, no operation cost, performance improvement in about 10 times, the price is similar to RDS, is a good alternative solution!

Ali cloud OcenanBase
Taobao use, carry double eleven, outstanding performance, but in the public test, I can not try, but worth looking forward to
Ali Cloud HybridDB for MySQL (originally PetaData)
https://www.aliyun.com/product/petadata?spm=a2c4g.11174283.cloudEssentials.54.7a984b5cS7h4wH

Official introduction: Cloud database HybridDB for MySQL (formerly PetaData) is a Hybrid Transaction/Analytical Processing (HTAP) relational database that supports both OLTP and OLAP.

I have also tested it, and it is an OLAP and OLTP-compatible solution, but the price is too high, up to $10 per hour, and it is too wasteful for storage, suitable for both storage and analysis business.

Tencent cloud DCDB
https://cloud.tencent.com/product/dcdb_for_tdsql

Official introduction: DCDB, also known as TDSQL, is a high-performance distributed database compatible with MySQL protocol and syntax that supports automatic horizontal split — that is, the business is displayed as a complete logical table, but the data is evenly split into multiple fragments; Each fragment works in active/standby mode by default, providing a full range of solutions such as DISASTER recovery, recovery, monitoring, and capacity expansion. This solution applies to TB or PB massive data scenarios.

Tencent I do not like to use, not to say. The reason is that there is a problem can not find people, online problems can not be solved headache! But it is cheap, suitable for the ultra small company, play.

Plan 3: Remove mysql and replace it with a big data engine to process data

Data volume over 100 million, there is no choice but to go on the big data.

Open Source solutions
Hadoop family. Hbase/Hive if you want to rename it. But there are high operation and maintenance costs, the general company can not afford to play, not one hundred thousand investment is not a good output!
Cloud Solutions
Big data is a trend in the future. Professional companies provide professional services, and small companies or individuals buy services. Big data exists in every aspect of society just like water/electricity and other public facilities.
The best one in China is Aliyun.
I chose Aliyun’s MaxCompute with DataWorks, which is super comfortable to use, pay-per-volume and extremely low cost.
MaxCompute is the open source Hive that provides SQL, MapReduce, AI, Python, and shell scripts to manipulate data. The data is presented in tables, stored in distributed mode, and processed in scheduled tasks and batch mode. DataWorks provides a workflow way to manage your data processing tasks and schedule monitoring.
You can also choose ali Cloud hbase and other products. Here I mainly use offline processing, so I choose MaxCompute. Basically, it is a graphical interface operation, about 300 lines of SQL, the cost is less than 100 yuan to solve the data processing problem.

Author: Wang Shuai

The original link

MySQL > select * from large table; select * from large table

Issue an overview

Solution overview

Solution 1: Optimize the existing mysql database

engine

Solution 2: Upgrade the database to a database that is 100% compatible with mysql

Plan 3: Remove mysql and replace it with a big data engine to process data

Related Posts

What is the relationship between Spring Boot and Spring Cloud

Jeesz-sso solution

Some understanding of reflection