Cabbage Java self study room covers core knowledge

Database and table solution MyCat series Basic concepts database and table solution MyCat series data and table solution MyCat series cluster transactions database and table solution MyCat series route distribution solution

1. Mycat practice note

1.1. Principle of separate tables and libraries

Although dividing tables and libraries can solve the pressure of large tables on database system, but it is not omnipotent, there are also some disadvantages, so the first problem is not dividing libraries, which libraries, what rules, how many fragments.

  • Principle 1: If possible, it is not recommended to fragment tables less than 10 million. Proper indexes and read/write separation can solve performance problems.
  • Principle 2: The number of fragments should be as small as possible, and fragments should be evenly distributed on multiple Datahosts. Because the more cross-fragments a query SQL has, the worse the overall performance will be. Although it is better than the result of all data in a fragment, it should only be expanded when necessary to increase the number of fragments.
  • Rule 3: Carefully select sharding rules. When selecting sharding rules, consider the data growth mode, data access mode, fragment correlation, and fragment expansion. The latest sharding strategies are range sharding, enumeration sharding, and consistent Hash sharding, which are conducive to expansion.
  • Rule 4: Try not to have SQL in a transaction that spans multiple shards. Distributed transactions are always a problem.
  • Rule 5: Optimize the query conditions and avoid Select *. A large number of result sets consumes a large amount of bandwidth and CPU resources. Avoid returning a large number of result sets and create indexes for frequently used query statements.

Here especially emphasize the selection problem of fragmentation rules, if a table of data has the obvious time characteristic, such as orders, transaction records, etc., are often more appropriate shard with time range, because has the timeliness of data, we tend to focus on the recent data, often with time field to filter query conditions, a better solution is to, Currently active data is sharded with a short span, while historical data is stored with a long span.

Generally speaking, the choice of shard depends on the condition of the most frequently queried SQL, because the query SQL without any Where statement, will benefit all shards, the performance is relatively worst, so the more such SQL, the greater the impact on the system, so we should try to avoid this SQL.

How do you accurately count and analyze the most frequent SQL in your current system? There are a few simple ways to do this:

  1. Use special JDBC driver, intercept all business SQL, and write programs to analyze;
  2. Using Mycat SQL interceptor mechanism, write a plug-in, intercept the SQL, and statistical analysis;
  3. Open MySQL log, analyze all SQL statistics;
  4. Now various clouds (Ali cloud, Huawei cloud, etc.) provide RDS with SQL analysis;

Find out the most frequent SQL of each table, analyze its query conditions, as well as the relationship between them, combined with ER graph, we can choose the sharding strategy of each table more accurately.

For the problem that we often mention in the same database sub-table, here is some analysis and explanation. Sub-table in the same database, it simply solves the problem that the data of a single table is too large. Because it does not distribute the data of the table to different machines, it does not play a great role in reducing the pressure of MySQL server. Everyone is still competing for IO, CPU, network on the same physical machine. In addition, when the database is divided into tables, the SQL issued by the user program needs to be modified. We can imagine how anti-human the Join SQL will be when the two tables A and B are divided into 5 sub-tables respectively. This kind of complex SQL is also a big problem for DBA tuning. For this reason, Mycat and some mainstream database middleware do not support in-database sub-tables, but MySQL has its own solution for this, so it can be combined with Mycat sub-tables to achieve the best effect.

  • MySQL partition;
  • MERGE table (MERGE storage engine) : Informally speaking, MySQL partition is a large table, according to the conditions into several smaller tables. Mysql5.1 now supports table partitioning. For example, if the number of records in a user table exceeds 6 million, the table can be partitioned by the date of entry or by location. Of course, you can also partition according to other conditions.
  • RANGE: Allocates multiple rows to a partition based on the values of the columns that belong to a given contiguous RANGE.
  1. LIST partitioning: Similar to partitioning by RANGE, except that LIST partitioning is selected based on column values matching a value from a discrete set of values.
  2. HASH partition: A partition selected based on the return value of a user-defined expression computed using the column values of the rows to be inserted into the table. This function can contain any valid expression in MySQL that produces a non-negative integer value.
  3. KEY partitioning: Similar to HASH partitioning, except that KEY partitioning supports only one or more columns and the MySQL server provides its own HASH function. One or more columns must contain integer values.

In Mysql databases, Merge tables are somewhat similar to views. Mysql’s Merge engine type allows you to Merge many tables with the same structure into a single table. After that, you can execute the query and return the same results from multiple tables as from a single table. Each merged table must have exactly the same table definition and structure, but only support the MyISAM engine.

Mysql Merge table Merge table

  • Separate static and dynamic data;
  • Optimize queries with structured data;
  • Less data can be accessed when querying;
  • Easier to maintain large data sets.

In the case of large data volume and large query volume, do not attempt to use Merge tables to achieve the function similar to Oracle table partition, which will affect performance. My feeling is that it’s almost equivalent to union. Mycat recommended scheme is Mycat branch library +MySQL partition, which has the following advantages:

  • Fully combine distributed parallel capability and MySQL partition table optimization;
  • Can flexibly control the data scale of the table;
  • The table can be shard in two dimensions, MyCAT has a dimension branch library, MySQL has a dimension partition.

1.2. Principle of data splitting

  1. To reach a certain order of magnitude before splitting (8 million);
  2. Tables less than 8 million but associated with large tables (tables over 8 million) should also be split, which is called large table associated table;
  3. How to disassemble large table associated table: use global table less than 1 million; If the table is larger than 1 million and smaller than 8 million, use the same split strategy as the large table. If you cannot use the same rules as a large table, you can consider a step-by-step query from Java code, without an associative query, or use a global table as an exception.
  4. Item_sku table 2.5 million, associated with the large table, but can not use the same split strategy with the large table, also made global table. The exception global table must meet the following conditions: there are no aggressive concurrent updates, such as multiple threads updating the same record with id=1. Multithreaded updates that do not operate on the same row are not included in this column. Multithreaded update records on the same row of a global table are deadlocked. Batch inserts are ok.
  5. Split fields are not modifiable;
  6. The split field can be only one field. If you want to split two fields, you must create a redundant field. The redundant field value is combined with the values of the two fields (for example, the zone_YYYYmm field is combined with the values of the large region and month).
  7. Selection and evaluation of split algorithm: the number of single tables in each library shall not exceed 8 million after splitting according to the selected algorithm;
  8. Try not to dismantle what you can. If a table is not queried associated with other tables and the amount of data is small, use a single database instead of splitting the table.

1.3. Back-end storage selection

Try to use newer stable versions of Mysql. For now, 5.6 and 5.7 are a reasonable choice because Mysq has been heavily optimized. In addition, various Mysql variants can be considered. Here are some general guidelines:

  1. For transactional data tables, you can consider the official stable version of Mysql. If transactional data tables require high reliability, such as replacing Oracle, you can also choose Galera Cluster, which brings high data availability and high concurrent access with a certain loss of write performance.
  2. Based on data reliability requirements, various data synchronization schemes can be adopted, such as one master, multiple slave, and read/write separation to improve the concurrent read capability of data tables.
  3. Some tables can be stored in NoSQL mode, while the front-end access mode remains unchanged. Mycat supports back-end MongoDB and many NoSQL systems to enhance the query ability.
  4. Some tables can use MySQL in-memory tables to improve query and write speed, instead of part of the complex cache scheme.

1.4. Distribution of Datanodes

Datanodes represent a Database in the MySQL Database. Therefore, the distribution of datanodes in a fragmented table may be as follows:

  • All on the same DataHost;
  • On several datahosts, but there is continuity, such as Dn1 to DN5 on Server1, DN6 to DN10 on Server2, and so on;
  • For example, DN1, DN2 and D3 were distributed evenly on Server1,Server2 and Server3 respectively, dn4 and DN5 repeated the same;

In general, the first kind is not recommended, and the second for range shard, in most cases, the last one is the most ideal, because when a table of the data is uniformly distributed in several physical machine, across the shard query or random query, are carried out to a different machine, the highest parallelism, IO competition is minimal, so the performance is best.

When we have dozens of tables that are fragmented, how to design the distribution of Datanodes becomes a problem. The best way to solve this problem is to test run for a period of time, statistically observe the SQL execution of each DataNode, and see whether there is serious uneven phenomenon. Then, according to the statistical results, Remap the relationship between DataNode and DataHost.

Mycat 1.4 added the distribute function, which can be used on the dataNode attribute of the Table, to rearrange the reference order of these Datanodes in the Table’s fragmentation rule, so that they can be evenly distributed to several datahosts:

<table name="oc_call" primaryKey="ID" dataNode="distribute(dn1$0-372,dn2$0-372)" rule="latest-month-calldate" /> 
Copy the code

Where DN1xxx and DN2XXXX are 377 fragments defined on DataHost1 and DataHost2 respectively.

2. Routing and distribution process of Mycat

2.1. Functions of routing

In principle, you can think of myCat as an SQL forwarder. Mycat receives the SQL from the front end and forwards it to the mysql server in the background for execution. However, there are many mysql nodes (such as Dn1, dn2, dn3), which nodes should be forwarded to? That’s where route resolution comes in.

Routing ensures that the SQL is forwarded to the correct node. The range of forwarding is just right, not many hair. There are two kinds of problems:

  • Waste performance: for example, a select * from Orders where pro= ‘Wuhan’ statement, only dN1 node, can find the data, if the statement is forwarded to dN1, DN2, DN3 three nodes, such a range is multiple, performance is a waste.
  • Table not found: If a new node dn4 is added, but the datanode range of Orders is only DN1,dn2, and dn3, if the node is forwarded to dn1,dn2,dn3, and dn4 at the same time, the table Orders not exists is returned when the node is sent to DN4. If select * from Orders is forwarded only to Dn1, only the result set on Dn1 will be returned, and the result set on dn2 and DN3 will not be returned.

2.2. Route resolver

Mycat1.3 used FDB Parser (FoundationDB SQL Parser). The druid parser was introduced in 1.3, and the FDBParser was removed in 1.4. Keep only the DruidParser method.

Problems with the fdbParser:

  1. The barrier to modifying the parser source is too high. Use javacc parser, if you want to modify the source code of the parser must understand javacc principle (modify the source code of the parser is sometimes encountered unsupported syntax, to modify the parser to support).
  2. There is no good API interface to get the table names, split field conditions, etc., in the AST syntax tree, so it is difficult to have a good structure of the route parsing code, is written very difficult to understand.
  3. Too few supported statements. Such as insert into… . On duplicate key update… The annotated CREATE table statement is not supported, and many more are not listed.
  4. Parsing performance is poor. Our company’s SQL is typically very long (SELECT statements), and a longer SQL parse takes 3-4 seconds to parse the AST syntax tree, which is unacceptable in business.

Performance comparison of several parsers

Compare the performance of fDBParser, JSqlParser and DruidParser. for the same SQL statement, use the three parsers to parse the AST syntax tree. In SQL parses, the statement type may be defined by the parser), and the time comparison is executed 100,000 times or one million times.

100,000 times: Druid is 10 times faster than FDBParser and 6 times faster than JSQLParser. 1 million times: Druid is 15 times faster than FDBParser and nearly 10 times faster than JSQLParser.Copy the code

2.3. Druid: Two ways to resolve routes

Druid parsing can be done in vistor or Statement mode.

The use of Vistor mode

String sql = “select * from tableName”; 
MySqlStatementParser parser = new MySqlStatementParser(sql); 
SQLStatement statement = parser.parseStatement(); 
MycatSchemaStatVisitor visitor = new MycatSchemaStatVisitor(); 
stmt.accept(visitor); 
Copy the code

After the above steps, you can easily get the table name, condition, table alias map, field list, value class table, and so on from the visitor. You can use this information to do routing calculations.

The Statement mode is used

String sql = “select * from tableName”; 
MySqlStatementParser parser = new MySqlStatementParser(sql); 
SQLStatement statement = parser.parseStatement(); 
SQLSelectStatement selectStmt = (SQLSelectStatement) statement; 
Copy the code

Then you can get the information you want from selectStmt. If SQL = “delete from tableName”; MySqlDeleteStatement deleteStmt = (MySqlDeleteStatement) Statement

2.4. Route calculation

2.4.1. Route calculation interface

Entry method to compute the routing for IO. Mycat. Route. RouteService class method of the route. The method signature is as follows:

public RouteResultset route(SystemConfig sysconf, SchemaConfig schema,int sqlType, 
String stmt, String charset, ServerConnection sc) throws SQLNonTransientException 
Copy the code

2.4.2. Summary data flow diagram of route calculation

Input an SQL, after routing calculation, output routing results.

This figure is actually a simplification of the routing interface. The route interface also contains other input parameters, such as SystemConfig, SchemaConfig, sqlType, Charset, and ServerConnection. However, these parameters are not the most important parameters for route calculation. Parameters such as SystemConfig and SchemaConfig can be obtained by other methods, such as:

SystemConfig sysconf =MycatServer.getInstance().getConfig().getSystem(); 
SchemaConfig schema = MycatServer.getInstance().getConfig().getSchemas().get(sc.getSchema()); 
Copy the code

These parameters can be understood as some secondary parameters (secondary to route calculation itself, but useful to other processes, the specific use of which is not emphasized here). Another reason for passing these parameters is that the process of route calculation is long and requires many method calls. It is also a performance loss if each method takes a zigzag approach to calculate and obtain these parameters.

2.4.3. Data flow diagram of route calculation decomposition

Where each condition in conditions is a triplet < table name, field name, field value >.

2.4.4. Route calculation process

Overall route resolution process:

The RouteStrategy route is a subflow. See the RouteStrategy routing subflow to explain the subflow. HintHandler routing is also a sub-process, but it is not the main process, so this article will not focus on it. Route resolution sequence diagram:

The route resolution entry is entered from the Route method of the RouteService class, and depending on whether there are annotations, a HintHandler or RouteStrategy is used for route resolution. RouteStrategy Routing subprocess:

The process is common to both fDBParser and DruidParser parsing strategies. This process is encapsulated in the Route method of the AbstractRouteStrategy class and acts as a template method for both strategies. The Ast syntax tree parsing process corresponds to the routeNormalSqlWithAST method. In the next section, the Ast syntax tree parsing process is explained again (DruidMysqlRouteStrategy policy class is used as an example).

DruidMysqlRouteStrategy AST syntax tree parsing

DruidParser parses subprocesses:

DruidParser parses table names, conditional expressions, field lists, and value lists using the AST syntax tree (SQLStatement). The DruidParser parser parser uses table names, conditional expressions, field lists, and value lists to compute routes. This process is encapsulated in the Parser method of the DefaultDruidParser class.

2.4.5. Core elements of routing calculation

  1. The table name contained in the SQL;
  2. Conditions (Conditons) contained in SQL. Each Condition is a 3-tuple < table name, field name, and field value >.
  3. The schema corresponding to the table.
  4. Is the table sharded, and if so, what are the sharding fields? What is the sharding algorithm? The information in point 4 can be calculated in accordance with article 3.

Some of the above data can be used to calculate routes, so route calculation needs to solve the following problems:

Extract table names, conditions (fields, table to which fields belong, and field values) from SQL statements. With the table name, conditions, and then according to the table sharding rules can calculate the exact route.

2.4.6. Route calculation of a single table

Route calculation without table sentences

For example, in the SELECT 1 statement, return any dataNode of the schema.

/ / not the from the select statement or other the if (druidParser. GetCtx () getTables (). The size () = = 0) {return RouterUtil. RouteToSingleNode (RRS, schema.getRandomDataNode(), druidParser.getCtx().getSql());Copy the code

2.4.7. Route calculation of multiple tables

In multi-table routing calculation, there is a sub-flow named Single-table routing calculation, which references the preceding single-table routing calculation flow.

2.4.8. Route calculation of the global table

Global table INSERT, UPDATE statements: route to all nodes. Global table SELECT statement: route to any node.

// Global table if(tc.isGlobalTable()) {if(isSelect) {rs.setCacheable (false); return routeToSingleNode(rrs, tc.getRandomDataNode(),ctx.getSql()); } else { return routeToMultiNode(false, rrs, tc.getDataNodes(), ctx.getSql()); }}Copy the code

2.4.9. Route calculation of OR statement

The route of the OR statement needs special design and processing. If the general calculation flow is used, logical errors may occur, resulting in incorrect query results. The following scenario:

Travelrecord table is a fragment table, which is sharded according to the id range, id in the range of 1-2000000 in the first fragment, ID in the range of 2000001 — 5000000 in the second fragment, Select * from student where id =1 or 1=1; If the route is routed to the first shard, the query result is incorrect.

Or statement problem solution idea – equivalent substitution

  1. Split the equivalent substitution of the OR statement with the union statement

This is an equivalent substitution that we should all know.

Select * from travelrecord where id = 1 or id = 5000001 ; Select * from travelRecord where id = 1 unioin Select * from travelrecord where ID = 5000001;Copy the code
  1. The result set Union of a Union statement is equivalent to the Union of routes

There is no clear theoretical basis for this equivalence, but we can prove by contradiction: if the route sets are different, the result sets must be different, so the result sets must be the same, the route sets must be the same.

Select * from travelrecord where id = 1 or id = 5000001 ; Select * from travelRecord where id = 1; Select * from travelRecord where id = 5000001; The union of the route sets of.Copy the code

Select * from TravelRecord where ID = 1 and Select * from TravelRecord where ID = 5000001.

Or statement route parsing data structure decomposition

Each time a where condition is encountered, if the where condition contains or, the entire where condition is used as a single unit WhereUnit. If the unit is always true (or 1=1, 2>1), Discard (Discard the where condition, such as select * from tableName, and route to all nodes without any condition). Each WhereUnit splits into multiple splitedExpr according to or to form a splitedExprList. Each splitedExpr contains some and conditions (e.g. ClassId = 1 and age >20).

Since there can be multiple OR’s in a WHERE condition, and each OR has left and right expressions, one of the left and right expressions must be undetachable and the other may be detachable, so step by step, Until it is no longer separable.

2.4.10. Route calculation of system statements

Select @@xxx, show statement, desc, etc.

Such as:

show tables; 
show full tables from databaseName; 
show fields from tableName; 
show variables; 
Copy the code

These statements are not currently parsed using an SQL parser, but are processed specifically through string parsing and should be considered.

Database and table solution MyCat series Basic concepts database and table solution MyCat series data and table solution MyCat series cluster transactions database and table solution MyCat series route distribution solution