@[TOC] in this part, we mainly understand the underlying principle of ShardingSphere for sub-database and sub-table, and go deep into the source code to understand the actual operation process of sub-database and sub-table.

On the one hand, we have accumulated a large number of test instances when learning ShardingJDBC, which is a very good entry point for learning the underlying principles.

On the other hand, it is also to prepare for learning ShardingProxy in the future. Because for ShardingProxy, if you only learn a few simple configurations and instructions, it is not good in the actual work. However, ShardingProxy as a black box product, it is difficult to understand the underlying principle through ShardingProxy.

First, kernel analysis

Although ShardingSphere has multiple products, their main data sharding process is completely consistent.

Parsing engine

The parsing process is divided into lexical parsing and grammatical parsing. A lexical parser is used to disassemble SQL into non-divisible atomic symbols called tokens. It classifies them into keywords, expressions, literals, and operators based on the dictionaries provided by different database dialects. The SQL is then converted into an Abstract Syntax Tree (AST) using a Syntax parser.

For example, for the following SQL statement:

SELECT id, name FROM t_user WHERE status = 'ACTIVE' AND age > 18
Copy the code

Will be resolved into a tree like this:

For ease of understanding, the tokens for keywords in the abstract syntax tree are green, the tokens for variables are red, and gray indicates that further splitting is required. By traversing the abstract syntax tree, you can mark all possible locations that need rewriting. The SQL parse process is irreversible. All tokens are parsed in the original SQL sequence, providing high performance. And in the process of parsing, it is necessary to consider the similarities and differences of various database SQL dialects and provide different parsing templates.

Among them, SQL parsing is the core of the whole sub-database and sub-table products, and its performance and compatibility are the most important indicators to measure. ShardingSphere used the faster Druid as the SQL parser prior to 1.4.x. After the 1.5.x version, the self-developed SQL parser is adopted to improve the performance and compatibility of SQL parsing in the scenario of separate databases and tables. Then from 3.0.x onwards, ANLTR was used as the SQL parsing engine. This is an open source SQL parsing engine, ShardingSphere uses ANLTR with some AST caching capabilities. For ANLTR4 features, the PreparedStatement precompilation method is recommended to improve SQL execution performance.

SQL parse overall structure:

The routing engine

Routing paths are generated by matching sharding policies for databases and tables based on the parsing context.

Sharding strategies of ShardingSphere are mainly divided into single-slice routing (the operator of sharding key is equal sign), multi-slice routing (the operator of sharding key is IN) and range routing (the operator of sharding key is Between). SQL that does not carry sharding keys is broadcast routing.

Sharding policies can usually be built into the database or configured by the user. The built-in sharding strategy can be roughly divided into mantissa module, hash, range, label, time, etc. The sharding policy configured by the user is more flexible. You can customize the compound sharding policy based on the user’s requirements.

In practice, fragment routes should be used to clarify routing policies. Because the broadcast route has too much impact, it is not conducive to cluster management and expansion.

Full-library table routing: For DQL, DML, and DDL statements without sharding keys, all the library tables are traversed and executed one by one. For example select * from course or select * from course where ustatus=’1′

Full-library routing: Operations on databases traverse all real libraries. For example, set the autocommit = 0

Full-instance routing: For DCL statements, execute each database instance only once, for example, CREATE USER [email protected] identified BY ‘123’;

Unicast routing: You only need to fetch data from any library. For example the DESCRIBE course

Block routes: Blocks SQL operations on the database. For example, USE coursedb. It will not be executed in a real library because there is no need to switch databases for virtual table operations.

Rewrite the engine

Users only need to write SQL against logical libraries and logical tables, and ShardigSphere’s rewriting engine will eventually rewrite SQL into statements that can be executed correctly in a real database. SQL rewriting is divided into correctness rewriting and optimization rewriting.

Execution engine

ShardingSphere does not simply submit overwritten SQL to the database for execution. The goal of the execution engine is to automate the balance between resource control and execution efficiency.

For example, his connection mode is divided into MEMORY_STRICTLY and CONNECTION_STRICTLY. The memory-limited pattern only focuses on the number of database connections processed, typically one database connection per real table. In connection limiting mode, only the number of database connections is concerned, and larger queries are performed sequentially.

ShardingSphere introduces the concept of connection patterns, divided into MEMORY_STRICTLY and CONNECTION_STRICTLY.

The distinction between the two model involves a parameter spring. Shardingsphere. Props. Max. Connections. The size, per. The query = 50 (see the source code for the default value of 1, configuration of ConfigurationPropertyKey class). ShardingSphere will calculate the number of SQL to be executed on the database based on the result of routing to a certain data source and divide this number by the user’s configuration items to obtain the number of SQL to be executed on each database connection. Quantity >1 selects connection limited mode, quantity <=1 selects memory limited mode.

Memory limited mode does not limit the number of connections, which means that multiple data connections are established and each connection is concurrently controlled to read only one data shard. This is the fastest way to read out all the data you need. And in the later merging stage, it will choose to merge on the basis of each piece of data, which is the streaming merging mentioned later. After merging a batch of data in this way, memory can be released, which can improve the efficiency of data merging and prevent memory overflow or frequent garbage collection. It has high throughput and is suitable for OLAP scenarios.

The connection restriction mode limits the number of connections, meaning that at least one database connection will read multiple shards of data. In this way, he will read multiple data fragments sequentially from the database connection. In this way, all data is read into memory for unified data merge, which is referred to later in memory merge. This method of merging is more efficient. For example, a MAX merge can get the maximum value directly, while streaming merge requires comparison of one bar. Suitable for OLTP scenarios.

Merge engine

Combining multiple data result sets obtained from each data node into one result set and returning it to the requesting client correctly is called result merging.

Among them, streaming merging refers to merging data one by one, while in-memory merging refers to querying all result sets into memory for unified merging.

For example, AVG merge cannot be directly merged into fragments. Instead, it needs to be converted into COUNT&SUM’s cumulative merge and then calculate the average value.

The sorting and merging process is shown as follows:

Distributed primary key

Built-in generator support: UUID, SNOWFLAKE, and removed from the distributed primary key generator interface, easy to implement custom self-increment primary key generator.

UUID

Use uuID.randomuuid () to generate a unique and non-repeating distributed primary key. Finally, a primary key of type string is generated. The disadvantage is that the generated primary key is out of order.

SNOWFLAKE

Snowflake algorithm, can ensure the different process of primary key repeatability, the same process of primary key order. The binary format contains four parts, from high to low: 1bit symbol bit, 41bit timestamp bit, 10bit worker process bit, and 12bit serial number bit.

  • The sign bit (1 -)

Reserved sign bit, always zero.

  • Timestamp bit (41bit)

Math.pow(2, 41)/(365 * 24 * 60 * 60 * 1000L) = 69.73 years without repetition;

  • Worker process bit (10bit)

This flag is unique within a Java process, and ensure that the ID of each worker process is different for distributed application deployment. This value defaults to 0 and can be set using properties.

  • Serial number bit (12bit)

This sequence is used to generate different ids in the same millisecond. If the number generated in this millisecond exceeds 4096(2 to the 12th power), the generator waits until the next millisecond to continue building.

Advantages:

  • The number of milliseconds is high, the increment sequence is low, and the whole ID is trending upward.

  • Independent of third-party components, it has high stability and high performance in ID generation.

  • You can allocate bits according to your service characteristics, which is very flexible

    Disadvantages:

    If the clock on the machine is dialed back, the number will be sent repeatedly.

Two, the source code environment installation

After importing the source package from the supporting materials into IDEA, Skip =true -dmaven.javadoc. Skip =true -dcheckstyle.skip =true – Drat. NumUnapprovedLicenses = 100 to complete compilation.

Then our source code debugging starts with the test class ShardingjdbcDemo. Java. This example recreates the repository and table rules configured in our previous example application02.properties.

ShardingSphere’s library and table functions, no matter JDBC or Proxy, will eventually be transformed into the Configuration mode of Java API. See website for specific configuration instructions at https://shardingsphere.apache.org/document/legacy/4.x/document/cn/manual/sharding-jdbc/configuration/config-j ava/

3. SPI extension points of ShardingSphere

ShardingSphere retains a large number of SPI extension points in the source code in order to be compatible with more application scenarios. So before looking at the source code, you need to have a good understanding of JAVA’s SPI mechanism.

1. SPI mechanism

SPI’s full name is: Service Provider Interface. This is explained in more detail in the java.util.ServiceLoader documentation.

Briefly summarize the idea behind the Java SPI mechanism. Abstract modules in our system often have many different implementation schemes, such as log module scheme, XML parsing module, JDBC module scheme and so on. In object-oriented design, we generally recommend interface programming between modules rather than hard-coding implementation classes between modules.

Once a specific implementation class is involved in the code, it violates the principle of pluggability, and if an implementation needs to be replaced, the code needs to be changed. A service discovery mechanism is needed in order to realize the dynamic specification during module assembly.

The Java SPI provides a mechanism for finding a service implementation for an interface. Similar to the IOC idea of moving control of assembly out of the program, this mechanism is especially important in modular design

The convention of the Java SPI is that when a service provider provides an implementation of the service interface, it creates a file named after the service interface in the META-INF/services/ directory of the JAR package. This file contains the concrete implementation class that implements the service interface.

When an external application assembs the module, it can find the implementation class name in the meta-INF /services/ jar file and load the instantiation to complete the module injection.

Based on such a convention, it is easy to find the implementation class of the service interface without having to specify it in the code. The JDK provides a utility class for service implementation lookup: java.util.Serviceloader.

2. SPI extension points in ShardingSphere

The development idea of ShardingSphere is to close the principal process in the source code and open it to SPI. All SPI extension points of ShardingSphere are listed in detail in the developer manual section of the accompanying official document shardingsphere_docs_cn.pdf.

3. Customize the primary key generation strategy

Implement a custom distributed primary key generation strategy using SPI extension points provided by ShardingSphere. See sample code.

Four, big picture of the source code

Cooperate with video and source code to understand. Notice the SPI extension points.