@[toc]

A, ShardingSphere

ShardingSphere is an application framework originated from dangdang. It was born inside Dangdang in 2015 and was originally called ShardingJDBC. In 2016, Zhang Liang, one of the main developers, was brought into JINGdong Data Department and the component team continued to develop. In China, it has been tested by dangdang, Telecom Payment, JINGdong And other large Internet enterprises, and began to open source in 2017. And gradually from the original focus on relational database enhancement tools ShardingJDBC upgraded to a set of data sharding based data ecosystem, renamed ShardingSphere. As of April 2020, it has become a top project of the Apache Software Foundation.

ShardingSphere contains three important products, ShardingJDBC, ShardingProxy and ShardingSidecar. Sidecar is a branch library and table plug-in for service mesh positioning, which is currently under planning. Today we will focus on the JDBC and Proxy components of ShardingSphere.

Among them, ShardingJDBC is used to do the client side sub-database sub-table product, and ShardingProxy is used to do the server side sub-database sub-table product. What’s the difference between the two? Let’s take a look at two important figures from official sources:

ShardingJDBC:

ShardingJDBC is positioned as a lightweight Java framework that provides additional services in Java’s JDBC layer. It uses the client directly connected to the database, in the form of JAR package to provide services, without additional deployment and dependence, can be understood as an enhanced VERSION of THE JDBC driver, fully compatible with JDBC and various ORM frameworks.

ShardingProxy

ShardingProxy is positioned as a transparent database agent and provides server versions that encapsulate database binary protocols to support heterogeneous languages. Currently available in MySQL and PostgreSQL versions, it can use any MySQL/PostgreSQL compliant access client.

So what’s the difference between these two approaches?

Sharding-JDBC Sharding-Proxy
The database any MySQL/PostgreSQL
Connection consumption high low
Heterogeneous language Only the Java any
performance Low loss Loss is slightly high
There is no centralized is no
Static entry There is no There are

It is obvious that ShardingJDBC is only a toolkit of the client, which can be understood as a special JDBC driver package. All sub-database sub-table logic is controlled by the business side, so his function is relatively flexible and the database support is very much, but the business invasion is large, and the business side needs to customize all sub-database sub-table logic. ShardingProxy is an independently deployed service, which does not invade the business side. The business side can conduct data interaction just like using a common MySQL service. Basically, it does not feel the existence of back-end branch library and table logic, but this also means that the function is relatively fixed and the database that can be supported is relatively small. Both have their advantages and disadvantages.

Two, ShardingJDBC combat

Shardingjdbc’s core function is data sharding and read and write separation. Through shardingjdbc, applications can transparently use JDBC to access multiple data sources that have been divided into database, table and read and write separation, without caring about the number of data sources and how data is distributed.

1. Core Concepts:

  • Logical table: A collective name for tables of the same logical and data structure of a horizontally split database

  • Real table: A physical table that actually exists in a sharded database.

  • Data node: The smallest unit of a data fragment. Consists of a data source name and a data table

  • Binding table: Primary table and sub-table with the same sharding rules.

  • Broadcast table: Also known as a common table, refers to a table that exists in all primitive shard data sources, the table structure and the data in the table are exactly the same in each database. For example, dictionary tables.

  • Sharding key: A database field used for sharding and a key field for horizontal splitting of a database (table). If there is no fragment field in SQL, full routing is performed, resulting in poor performance.

  • Sharding algorithm: Data is sharded using the sharding algorithm. Data can be sharded by =, BETWEEN, and IN. The sharding algorithm needs to be implemented by the application developers themselves and can be implemented with high flexibility.

  • Sharding strategy: The sharding key + sharding algorithm is used for sharding. In ShardingJDBC, inline sharding policy is generally adopted based on Groovy expression. An algorithm expression containing sharding key is used to set sharding policy, such as t_user_$->{u_id%8} identifier according to u_id module 8, divided into 8 tables. The table names are T_user_0 through t_user_7.

2. Test project introduction

See the corresponding ShardingDemo item for the test project. First of all, let’s make a simple arrangement of the structure of the test project:

Note: 1. Introduce MyBatisPlus dependency to simplify JDBC operations so that we do not need to write SQL statements in code.

Entity corresponds to the table structure in the database. Interfaces in mapper correspond to JDBC operations.

3. All operations are performed using JUnit test cases. All subsequent testing will be done in conjunction with the configuration in application.properties and the JUnit test case.

4. As for ShardingSphere version, the latest version 5.0 is still being incubated, so we use the released version 4.1.1 to learn.

3. Fast combat

Let’s run a simple example to see how ShardingJDBC works.

Write the contents of the application01.properties file to the application.properties configuration file:

# Vertical table splitting strategy
Configure the real data source
spring.shardingsphere.datasource.names=m1
Configure data source 1
spring.shardingsphere.datasource.m1.type=com.alibaba.druid.pool.DruidDataSource
spring.shardingsphere.datasource.m1.driver-class-name=com.mysql.cj.jdbc.Driver
spring.shardingsphere.datasource.m1.url=jdbc:mysql://localhost:3306/coursedb? serverTimezone=GMT%2B8
spring.shardingsphere.datasource.m1.username=root
spring.shardingsphere.datasource.m1.password=root
Select * from database where table is stored and what table name is used. Course_1 and m1.course_2
spring.shardingsphere.sharding.tables.course.actual-data-nodes=m1.course_$->{1.. 2}
# specify the primary key generation strategy for the table
spring.shardingsphere.sharding.tables.course.key-generator.column=cid
spring.shardingsphere.sharding.tables.course.key-generator.type=SNOWFLAKE
# An optional parameter to the snowflake algorithm
spring.shardingsphere.sharding.tables.course.key-generator.props.worker.id=1
Use a custom primary key generation strategy
#spring.shardingsphere.sharding.tables.course.key-generator.type=MYKEY
#spring.shardingsphere.sharding.tables.course.key-generator.props.mykey-offset=88
Add the sharding policy convention CID value to the course_1 table with an even value. If it's odd, add it to the course_2 table.
Select the field to calculate
spring.shardingsphere.sharding.tables.course.table-strategy.inline.sharding-column= cid
Select * from table_name;
spring.shardingsphere.sharding.tables.course.table-strategy.inline.algorithm-expression=course_$->{cid%2+1}
Open SQL log output.
spring.shardingsphere.props.sql.show=true

spring.main.allow-bean-definition-overriding=true
Copy the code

1. First define a data source M1 and perform the actual JDBC parameter configuration for M1

2, spring. Shardingsphere. Sharding. Tables. Course beginning a series of properties that define a logical table called course.

The actual-data-nodes attribute defines the actual data distribution of the course logical table, which is distributed between two tables m1.course_1 and m1.course_2.

The key-generator property configures its primary key column and primary key generation policy. ShardingJDBC provides two distributed primary key generation strategies by default, UUID and SNOWFLAKE.

The table-strategy attribute configures the database and table strategy. The shard key is the CID attribute. The sharding algorithm is course_$->{cid%2+1}, which indicates the result of the cid module 2+1 and is prefixed with the course__ part of the table. Note that the result of this expression must correspond to a case in the actual data distribution, otherwise an error will be reported.

The sql.show property indicates that the actual SQL is to be printed in the log

3. See the SQL statements in the SQL folder in the example for the table structure of the CourseDB.

Then we execute the addcourse case in the test case.

After execution, we can see many logs like this on the console:

. 2020-12-15 18:35:16.426 INFO 22412 -- [main] ShardingSphere: Logic SQL: INSERT INTO course ( cname, user_id, cstatus ) VALUES ( ? ,? ,?) 2020-12-15 18:35:16.427 INFO 22412 -- [main] shardingsphere-sql: SQLStatement InsertStatementContext(super=CommonSQLStatementContext(sqlStatement=org.apache.shardingsphere.sql.parser.sql.statement.d ml.InsertStatement@1cbc5693, tablesContext=org.apache.shardingsphere.sql.parser.binder.segment.table.TablesContext@124d26ba), tablesContext=org.apache.shardingsphere.sql.parser.binder.segment.table.TablesContext@124d26ba, columnNames=[cname, user_id, cstatus], insertValueContexts=[InsertValueContext(parametersCount=3, valueExpressions=[ParameterMarkerExpressionSegment(startIndex=59, stopIndex=59, parameterMarkerIndex=0), ParameterMarkerExpressionSegment(startIndex=62, stopIndex=62, parameterMarkerIndex=1), ParameterMarkerExpressionSegment(startIndex=65, stopIndex=65, parameterMarkerIndex=2), DerivedParameterMarkerExpressionSegment(super=ParameterMarkerExpressionSegment(startIndex=0, stopIndex=0, parameterMarkerIndex=3))], parameters=[java, 1001, 1])], generatedKeyContext=Optional[GeneratedKeyContext(columnName=cid, generated=true, GeneratedValues =[545674405561237505])]) 2020-12-15 18:35:16.427 INFO 22412 -- [main] Shardingsphere-SQL: Actual SQL: m1 ::: INSERT INTO course_2 ( cname, user_id, cstatus , cid) VALUES (? ,? ,? ,?) ::: [java, 1001, 1, 545674405561237505] .....Copy the code

From this log, we can see that the Logic SQL executed in the program is processed by ShardingJDBC and converted into Actual SQL to be executed in the database. The result of this execution can be seen in MySQL, where five messages are inserted into both the course_1 and course_2 tables. This is ShardingJDBC help us to carry out the database sub-database sub-table operation.

The other profiles, in turn, correspond to several other repository and table strategies, which we can demonstrate one by one.

  • Application02. Properties: depots table sample configuration. Built-in sharding algorithm examples, inline, Standard, Complex, hint. Example of configuring broadcast tables.

  • Application03. Properties: binding representation example configuration

  • Application04. Properties: reading and writing separate sample configuration

Note that ShardingJDBC can only distribute read and write operations to different databases, and data synchronization between databases is still performed by the MySQL master/slave cluster.

4, ShardingJDBC sharding algorithm

ShardingJDBC after the completion of the whole combat, it can be seen that the core of the whole sub-database sub-table is the configuration of the sharding algorithm. All of our exercises use inline sharding, which provides a sharding key and a sharding expression to specify the sharding algorithm. This way is simple configuration, flexible function, is the best way to configure the sub-table, and for the vast majority of sub-library sharding scene, has been very good. However, when dealing with more complex sharding strategies, such as multiple sharding keys, sharding by range, etc., inline sharding is a bit inadequate. Therefore, we also need to learn several other sharding strategies provided by ShardingSphere.

ShardingSphere currently provides a total of five sharding strategies:

  • NoneShardingStrategy

    No shard. This is not strictly a sharding strategy anymore. However, ShardingSphere also provides such a configuration.

  • InlineShardingStrategy

    The most common sharding method

    • Configuration parameters: inline-shardingColumn Sharding key; The inline. AlgorithmExpression shard expression
    • Implementation: according to the fragment expression to fragment.
  • StandardShardingStrategy

    Standard sharding policy that supports only a single sharding key.

    • Parameters: standard.sharding-column Specifies the sharding key. Standard. precise-algorithm-class-name Specifies the name of the precise fragment algorithm. Standard. range-algorithm-class-name Specifies the class name of the range sharding algorithm

    • Implementation method:

      ShardingColumn specifies the sharding algorithm.

      PreciseAlgorithmClassName pointing to an implementation of a IO. Shardingsphere. API. Algorithm. Sharding. Standard. PreciseShardingAlgorithm the Java class name of the interface, Provided IN accordance with the = or logic IN the precise subdivision example: com. Roy. ShardingDemo. Algorithm. MyPreciseShardingAlgorithm

      RangeAlgorithmClassName pointing to an implementation of a IO. Shardingsphere. API. Algorithm. Sharding. Standard. RangeShardingAlgorithm the Java class name of the interface, Provides range sharding according to the Between condition. Example: com. Roy. ShardingDemo. Algorithm. MyRangeShardingAlgorithm

    • Description:

      The exact sharding algorithm is mandatory, while the range sharding algorithm is optional.

  • ComplexShardingStrategy

    Complex sharding strategies that support multiple sharding keys.

    • Configuration parameters: complex.sharding-columns Specifies the fragment key (multiple). Complex. algorithm-class-name Specifies the implementation class of the sharding algorithm.

    • Implementation method:

      ShardingColumn specifies multiple shard columns.

      AlgorithmClassName pointing to an implementation of a org.apache.shardingsphere.api.sharding.complex.Com plexKeysShardingAlgorithm the Java class name of the interface. Provides a comprehensive sharding algorithm based on multiple sharding columns. Example: com. Roy. ShardingDemo. Algorithm. MyComplexKeysShardingAlgorithm

  • HintShardingStrategy

    A mandatory sharding policy that does not require a sharding key. The sharding strategy, in a nutshell, means that its sharding key is no longer associated with the SQL statement, but specified separately by the program. For some complicated cases, such as SQL statements such as select count(*) from (select userid from t_user where userid in (1,3,5,7,9)), there is no way to specify a shard key through SQL statements. At this time, the program can execute a separate sharding key for it. For example, in the policy of odd and even sharding by userID, you can specify 1 as the sharding key, and then specify its own sharding policy.

    • Hint. Algorithm -class-name Indicates the implementation class of the sharding algorithm.

    • Implementation method:

      AlgorithmClassName pointing to an implementation of a org. Apache. Shardingsphere. API. Sharding. Hint. HintShardingAlgorithm the Java class name of the interface. Example: com. Roy. ShardingDemo. Algorithm. MyHintShardingAlgorithm

      In this algorithm class, we also need sharding keys. The shard key specified by HintManager. AddDatabaseShardingValue method (depots) and HintManager addTableShardingValue (table) to specify.

      Note that this shard key is thread-isolated and only valid for the current thread, so it is usually recommended to turn it off immediately after use or turn it on with a try resource.

    The Hint sharding strategy does not completely follow the SQL parse tree to construct the sharding strategy, and avoids SQL parsing. Therefore, the Hint sharding strategy may perform better for some complex statements (there are too many cases to analyze one by one).

    Note, however, that Hint enforcement routes have many limitations when used:

    -- No support for UNION
    SELECT * FROM t_order1 UNION SELECT * FROM t_order2
    INSERT INTOTbl_name (col1, col2,...).SELECTCol1, col2,...FROM tbl_name WHERE col3 = ?
    
    -- No support for multi-layer subqueries
    SELECT COUNT(*) FROM (SELECT * FROM t_order o WHERE o.id IN (SELECT id FROM t_order WHERE status =?). )-- Function calculation is not supported. ShardingSphere can only extract values for sharding through SQL literals
    SELECT * FROM t_order WHERE to_date(create_time, 'yyyy-mm-dd') = '2019-01-01';
    Copy the code

See the application02.properties configuration for an example.

It can also be seen from here that even with ShardingSphere framework, the support for SQL statements after database and table division is still very fragile.

5. SQL usage restrictions of ShardingSphere

See the website document: shardingsphere.apache.org/document/cu… The documentation details a wide range of SQL types supported and not supported by the current version of ShardingSphere. Keep an eye on these things.

Support the SQL

SQL A necessary condition for
SELECT * FROM tbl_name
SELECT * FROM tbl_name WHERE (col1 = ? or col2 = ?) and col3 = ?
SELECT * FROM tbl_name WHERE col1 = ? ORDER BY col2 DESC LIMIT ?
SELECT COUNT(*), SUM(col1), MIN(col1), MAX(col1), AVG(col1) FROM tbl_name WHERE col1 = ?
SELECT COUNT(col1) FROM tbl_name WHERE col2 = ? GROUP BY col1 ORDER BY col3 DESC LIMIT ?, ?
INSERT INTO tbl_name (col1, col2…) VALUES (? ,? ,… .).
INSERT INTO tbl_name VALUES (? ,? ,… .).
INSERT INTO tbl_name (col1, col2…) VALUES (? ,? ,… .). . (? ,? ,… .).
INSERT INTO tbl_name (col1, col2…) SELECT col1, col2,… FROM tbl_name WHERE col3 = ? INSERT and SELECT tables must be the same or bound table
REPLACE INTO tbl_name (col1, col2…) SELECT col1, col2,… FROM tbl_name WHERE col3 = ? The REPLACE and SELECT tables must be the same table or bound table
UPDATE tbl_name SET col1 = ? WHERE col2 = ?
DELETE FROM tbl_name WHERE col1 = ?
CREATE TABLE tbl_name (col1 int,…)
ALTER TABLE tbl_name ADD col1 varchar(10)
DROP TABLE tbl_name
TRUNCATE TABLE tbl_name
CREATE INDEX idx_name ON tbl_name
DROP INDEX idx_name ON tbl_name
DROP INDEX idx_name
SELECT DISTINCT * FROM tbl_name WHERE col1 = ?
SELECT COUNT(DISTINCT col1) FROM tbl_name
SELECT subquery_alias.col1 FROM (select tbl_name.col1 from tbl_name where tbl_name.col2=?) subquery_alias

Unsupported SQL

SQL Cause Not supported
INSERT INTO tbl_name (col1, col2…) VALUES(1+2, ? ,…). The VALUES statement does not support operational expressions
INSERT INTO tbl_name (col1, col2, …) SELECT * FROM tbl_name WHERE col3 = ? The SELECT clause does not support the asterisk abbreviation and the built-in distributed primary key generator
REPLACE INTO tbl_name (col1, col2, …) SELECT * FROM tbl_name WHERE col3 = ? The SELECT clause does not support the asterisk abbreviation and the built-in distributed primary key generator
SELECT * FROM tbl_name1 UNION SELECT * FROM tbl_name2 UNION
SELECT * FROM tbl_name1 UNION ALL SELECT * FROM tbl_name2 UNION ALL
SELECT SUM(DISTINCT col1), SUM(col1) FROM tbl_name For details about the DISTINCT support, see the description
SELECT * FROM tbl_name WHERE to_date(create_time, ‘YYYY-MM-dd’) =? It can lead to total routing
(SELECT * FROM tbl_name) The query in parentheses is not currently supported
SELECT MAX(tbl_name.col1) FROM tbl_name When a query column is a function expression, the table name cannot be used before the query column. If the query table has an alias, use the alias of the table

Description of the DISTINCT support

Support the SQL

SQL
SELECT DISTINCT * FROM tbl_name WHERE col1 = ?
SELECT DISTINCT col1 FROM tbl_name
SELECT DISTINCT col1, col2, col3 FROM tbl_name
SELECT DISTINCT col1 FROM tbl_name ORDER BY col1
SELECT DISTINCT col1 FROM tbl_name ORDER BY col2
SELECT DISTINCT(col1) FROM tbl_name
SELECT AVG(DISTINCT col1) FROM tbl_name
SELECT SUM(DISTINCT col1) FROM tbl_name
SELECT COUNT(DISTINCT col1) FROM tbl_name
SELECT COUNT(DISTINCT col1) FROM tbl_name GROUP BY col1
SELECT COUNT(DISTINCT col1 + col2) FROM tbl_name
SELECT COUNT(DISTINCT col1), SUM(DISTINCT col1) FROM tbl_name
SELECT COUNT(DISTINCT col1), col1 FROM tbl_name GROUP BY col1
SELECT col1, COUNT(DISTINCT col1) FROM tbl_name GROUP BY col1

Unsupported SQL

SQL Cause Not supported
SELECT SUM(DISTINCT tbl_name.col1), SUM(tbl_name.col1) FROM tbl_name When a query column is a function expression, the table name cannot be used before the query column. If the query table has an alias, use the alias of the table

6. Problems caused by separate database and separate table

1, sub-database sub-table, in fact, around is a core problem, is the problem of single database capacity. It is important to understand that there are many solutions to this problem, not just a database and a table. However, ShardingSphere hopes to manage hardware resources at the software level, so as to facilitate the horizontal expansion of the database, which is undoubtedly a low-cost way.

Can you think of a better solution?

2, under normal circumstances, if the stand-alone database capacity can not hold, should start from the cache technology to reduce the database access pressure. If the database traffic is still heavy after the cache is used, you can consider a database read-write separation policy. If the database pressure is still very large, and the continuous growth of business data is immeasurable, the last consideration is to divide the database and table, and the split data of a single table should be controlled within 10 million.

Of course, as Internet technology continues to evolve, so do the options for handling massive amounts of data. In the actual system design, it is best to use MySQL database only to store hot data with strong relationship, and take some other distributed storage products for mass data. Such as PostGreSQL, VoltDB and even HBase, Hive, ES and other big data components.

3, from the last part of ShardingJDBC sharding algorithm we can see, because the function of the SQL statement is too much too comprehensive, so the database table, support for SQL statements, is actually step by step difficult, a little not careful, will cause SQL statements do not support, business data chaos and many many problems. Therefore, in practical use, we will suggest this sub-library sub-table, if possible, do not use it.

If you want to use it in OLTP scenarios, the query speed in a large amount of data is preferred. In OLAP scenarios, where there is often a lot of complex SQL involved, the limitations of separate libraries and separate tables become more pronounced. Of course, this is also a direction for future improvement of ShardingSphere.

4. If it is determined to use sub-database and sub-table, we should consider the coupling degree and usage of business data at the beginning of system design, try to control the scope of use of business SQL statements, and weaken the database to the direction of simple data storage layer of increase, deletion, change and check. First, the vertical split strategy is planned in detail to make the data layer architecture clear. As for the horizontal split, will bring very, very many data problems to the later period, so it should be cautious, cautious and cautious. It is only occasionally used in marginal scenarios such as log tables and operation records.

7, sub-database sub-table scheme design actual combat

Next, we will design a scheme for the commodity management module of e-commerce to understand how to land the sub-database sub-table.

In a typical emporium scene, the general functional components of the commodity management module are shown as follows:

In view of this scenario, considering that commodity information will continue to grow, more and more situations, how to design the plan of database and table?

1. Consider vertical fragmentation of data in terms of business units. The business data of stores, products and commodities are vertically divided into three different libraries. Dictionary tables are redundant into three different libraries as broadcast tables.

2. Considering the growth of data, goods will be the fastest growing data in the future, while the growth of stores and products will gradually decrease. So we shard the goods list.

3. Configure the highly correlated commodity information table and supplementary commodity information table as the binding table. The overall database table is roughly as follows:

Consider: What problems does this kind of inventory and inventory solve? What are the supported scenarios? What are the unsupported scenarios?