There are many open source solutions for horizontal scaling based on relational databases, but few mature and stable products. Sharding-jdbc, the database middle layer developed by Dangdang itself, has been widely used in the company and promoted in the open source community with initial results. Sharding-jdbc has gone from infancy to stable operation to critical point of change.

Sharding-JDBC is a lightweight component that provides services in the form of JAR. Its core idea is to accomplish the most core things in a small and beautiful way.

For such an excellent project, in the master q&A 144, we planned the “lightweight database middle layer Sharding-JDBC in-depth analysis” topic, and invited @dang_liang (Zhang Liang) as a master guest.

This article has sorted out some of the best questions and answers.

For such a project, I think everyone will be concerned about its original intention and application scenarios and other related issues. Let’s take a look at zhang Liang’s answers to these questions.

Q: What was sharding-JDBC designed for? What scenario is it designed to solve?

Sharding-jdbc is designed to provide a database middle layer for transparent processing of sub-libraries and sub-tables without the need for business developers to generate SQL based on shard keys in business code.

The first version of the sub-database sub-table is not the existing Sharding-JDBC, but dangdang an internal framework ddFrame database module, one of the functions of DD-RDB is sub-library, there is no sub-table function, just do a simple SQL analysis at that time. Later, as ddFrame was adopted by various teams, the requirement of only dividing database was gradually not enough, and there were a lot of database ORM related things in DD-RDB. In order to make the core requirement of dividing database and table more pure, we separately extracted the Sharding part and named it Sharding-JDBC. Provides a layer of drivers at the Java JDBC level that handles this requirement seamlessly.

Q: What scenarios does Sharding-JDBC apply to and not to? Is there a performance evaluation?

Sharding-jdbc is a good use for scenarios where large volumes of relational database data require horizontal repository and table splitting.

For example: if a user database with 100 million data is placed in the MySQL database, the query performance will be relatively low. However, if the database is split horizontally, it will be divided into 10 libraries according to the user ID module 10, so that the data can be evenly divided into 10 libraries, each library has only 1000W records, and the query performance will be greatly improved. There are many types of sharding policies, including Hash + Mod, Range, and Tag.

Sharding-jdbc also provides read/write separation capabilities to reduce the stress on the write library.

In addition, Sharing-JDBC can be used in JPA scenarios, such as JPA, Hibernate, Mybatis, Spring JDBC Template, any Java ORM framework.

Java ORM framework is also the use of JDBC and database interaction. This is why we chose to develop in the JDBC layer rather than an ORM framework. We want Sharding-JDBC to be as compatible with all Java database access layers as possible and seamlessly access business applications.

There are two main aspects of inappropriate scenarios:

  1. Not suitable for OLAP scenarios. Although sharding-JDBC can also do aggregated grouped queries, many OLAP scenarios are still slow, and complex SQL (such as subqueries) are not currently supported. This kind of query is not suitable for big data and high concurrency Internet online database, it is recommended to use reasonable OLTP query.
  2. Not suitable for the requirement of strong transaction consistency. Currently sharding-JDBC transactions support two types, one is weak XA and the other is flexible transaction (BASE). Because of the low performance of XA two-phase or three-phase commit, Internet companies rarely adopt it. However, neither weak XA nor flexible transaction can guarantee the complete consistency of the transaction in any time period. Flexible transaction can guarantee the final consistency of data, but the time to achieve the final consistency is still uncontrollable. Therefore, it is necessary to consider the rationality of database schema from the aspect of design for the scenarios requiring strong consistency of cross-library transactions.

For JTA transactions, currently Shariding-JDBC does not implement the JTA standard. And because the use of JTA in Internet scenarios is rare, JIA transactions are not supported for the time being.

Performance test documentation is available on OSGit. In the single-library scenario, the performance loss is 0.02% due to SQL parsing and routing. In a dual-library scenario, data access is distributed and the performance is improved by 94%.

So what do teachers think about similar projects?

Q: What is the difference between Sharding-JDBC and other similar products? Does it integrate into SparkSQL or Hive?

As far as I know, there is only TDDL, which is similar to Sharding-JDBC based on JDBC layer architecture, and TDDL does not open source the library and table. The benefits of sharding based on the JDBC layer are lightweight, simple, compatible, and no additional o&M effort. The disadvantage is that it cannot cross languages and currently only supports Java.

Integration with SparkSQL or Hive is not currently considered. Because sharding-JDBC is positioned in the middle of a relational database, the storage engine of the database will not be changed for the sake of stability. In the future, we will make sharding-JDBC-Server based on Proxy version, and gradually consider the introduction of Spark and other big data query methods.

Q: Sharding-JDBC is similar to Mycat. The difference lies in the self-parsing of SQL statements.

Look from the design concept does have certain similarity. The main process is SQL parsing -> SQL rewriting -> SQL routing -> SQL execution -> result merging. But the architectural design is different. Mycat is based on Proxy, which copies the MySQL protocol and disguises Mycat Server as a MySQL database, while Sharding-JDBC is based on JDBC interface extension and provides lightweight services in the form of JAR packages.

The SQL parsing area is now similar to shariding-JDBC and Mycat, both of which use Druid as the base library for SQL parsing. But Sharding-JDBC is rewriting the SQL parsing part, a completely homegrown version without Duird. Druid is an excellent connection pool and does a good job of parsing SQL, but it’s not a Sharding SQL Parser. It’s Lexer -> Parser -> AST -> Vistor. Users need to implement its Vistor interface and realize their business logic in Vistor, so they need to regenerate SharidingContext through Vistor, and abstract syntax tree AST also need to fully understand SQL. Sharding-JDBC developed SQL Parser, for Sharding unrelated keywords to skip the method, the overall parsing process is simplified as Lexer -> Parser -> SharidingContext, in performance and implementation complexity have a breakthrough.

Next, the teacher shared some questions about the Sharding-JDBC function

Q: Does Sharding-JDBC support read/write separation?

Sharding-jdbc supports read/write separation since 1.3.0. Its features include:

  1. Write libraries and multiple read libraries are distinguished according to the configuration. Currently, only the rotation strategy can select read libraries, which can be used with the sub-library and sub-table.
  2. Hint forces a query write library to be specified.
  3. If a DML statement is found in the same thread and database connection, all queries after the DML are queried from the write library. The DQL statement before the DML is still queried from the read library. The goal is to maintain data consistency across the same user thread.

However, sharding-JDBC is not responsible for master-slave switchover and master-slave data synchronization at the database level due to design considerations of sharding-JDBC itself. Sharding-jdbc positioning is still a lightweight enhanced version of the database driver. Therefore, data inconsistencies due to synchronization delays between master and slave libraries are not a sharding-JDBC domain.

In addition, as Sharding-JDBC itself is a database and table middleware, read and write separation is also a function added later, so it can support database and table + read and write separation. However, it is not easy to configure read and write separation only, and read and write separation will be refined as an independent API in the future.

Q: On the basis of the existing system architecture, can Sharding-JDBC be integrated with third-party database connection pool (such as C3P0, Druid, etc.) to achieve the separation of library and table + read and write?

Yes, you can. The purpose of Shariding JDBC is to do only sharding + read/write separation, and the connection pool should be handled by the connection pool instead of each other.

Q: Can the database and table be queried using like? How about performance? Will all libraries and tables be queried?

  • The use of like queries is limited. Currently, Shariding-JDBC does not support shard keys in LIKE statements, but like statements that do not contain shard keys can be executed correctly.
  • As for like performance issues, which are database related, Shariding-JDBC is just about parsing SQL and routing to the correct data source.
  • If the SQL does not include shard keys, all libraries and tables will be queried, regardless of whether there is a like.

Q: Does Sharding-JDBC have a complete management configuration interface?

Currently Sharding-JDBC configuration interface has not been done. At present, services are mainly provided in the form of JAR packages, which are released together with business applications to simplify development and have no impact on DBAs. Therefore, DBAs still see scattered databases after repository separation.

In the future, a configuration center will be built to dynamically modify the fragment data source, and a supporting management interface will be provided. In the future, the database Metadata will be unified management, to provide more friendly services for DBAs.

Q: How does sharding-JDBC force a query to go to the primary database?

Read/write separate document address: dangdangdotcom. Making. IO/sharding – jd…

The general usage is as follows:

HintManager hintManager = HintManager.getInstance(); hintManager.setMasterRouteOnly(); // Continue JDBC operationsCopy the code

There are also some Sharding-JDBC related questions, Teacher Zhang Liang also carried out detailed answers

Q: How does Sharding-JDBC solve the system robustness problem? Our background has high requirements on service reliability, and we are still considering remote DISASTER recovery. If sharding-JDBC is used, does the fragmented library table structure increase the operation difficulty?

Because Sharding-JDBC is a JAR, it is consistent with the life cycle of the business application, living and dying together. So it’s just a matter of addressing the robustness of a business system using Sharding-JDBC.

The problem of fragmentation of library tables would be the same without the use of Shariding-JDBC repository tables, which is certainly not addressed but is not worse.

After the core is stable, we will consider making operation and maintenance tools for DBA in the future.

Q: Is sharding-JDBC SQL parser open source? Is there a big performance improvement? In addition, dangdang’s business now, can the database sub-table migration be automated?

Sharding-jdbc’s own Parser is open source and currently in the Parser branch, but it is not finished yet and is still in rapid iteration.

SQL parsing works in much the same way as Druid, but simplifies the process. Duird is designed to monitor, analyze, and normalize SQL, so its SQL parsing scenarios require a complete understanding of SQL. Sharding JDBC parsing process :(Lexer -> Parser -> AST -> Vistor) -> SharidingContext. Sharding-jdbc only needs to understand sharding-related keywords, and irrelevant content is skipped. Therefore, Sharding context is generated directly, and no longer needs to be retrieved through the accessor of the abstract syntax tree.

The New Parser process is simplified as Lexer -> Parser -> SharidingContext. So there are breakthroughs in performance and implementation complexity. Specific performance testing reports are yet to be done and will be fully released with the New Parser branch.

As for the automatic transfer of sub-database sub-table, Dangdang has not done automation. Since data migration is more closely aligned with database operations tools and less related to JDBC, shariding-JDBC has also been ruled out for the time being.

About the future planning of Sharding-JDBC, Teacher Zhang Liang also shared a lot of dry goods with us

Q: What’s next for Sharding-JDBC?

Sharding-jdbc is currently undergoing a rewrite of the SQL Parser part. Previous Sharding-JDBC uses Duird as the basic tool for SQL parsing, but based on various considerations, we decided to use our own way to parse SQL, which can further improve the performance, accuracy and compatibility of Sharding.

After the completion of the New SQL Parser, we will focus on the TCC part of flexible transactions that were not completed before, support for more types of SQL, and configuration dynamics.

After this, sharding-JDBC will be divided into sharding-driver and Sharding-JdbC-server versions to meet the needs of different users. Sharding-jdbc-driver will be more lightweight, while Sharding-JdbC-Server will provide services in the form of Proxy, which can better handle data migration, transactions and OLAP requirements.

At the heart of Shariding-JDBC should be an enhancement of the JDBC driver to provide lightweight services in the form of jars. In general, the Sharding-JDBC ecology should be roughly divided into three loops:

  • The first ring is the core functions related to JDBC, including library and table, read and write separation, and distributed primary keys. This is the core of small but beautiful.
  • The second loop is database-related, but not JDBC, and will be provided in the form of plug-ins, including flexible transactions, database HA, database Metadata management, dynamic configuration, and so on.
  • The third ring is business or usage friendly, including multi-tenant, account security, Spring custom namespaces, Yaml configuration, etc.

Many have mentioned support for other languages, since Shariding-JDBC was developed based on an interface to the JDBC specification provided by Java, so Python, Node.js, etc., are not currently supported.

However, its core function modules such as parsing, routing and result merging are almost the same as those based on Proxy version development. Therefore, in the future we intend to offer a version of Shariding-JDBC-Server that will support the full language.

A friend saw that we were considering developing Sharding-JDBC-Server and thought that the current Sharding-JDBC mode had some unsolvable problems. Shariding JDBC doesn’t have an unsolvable problem in its current position, but if you want to do more (previously mentioned data migration, distributed transactions, metadata management, etc.), you need to move toward a Proxy approach. Shariding-jdbc wants to offer two different ways of using it to give users more freedom of choice.

There is also support for SQL statements. Due to limited time and energy, it is currently impossible to achieve full SQL compatibility. Our current goal is to support as much as 80% of the SQL that OLTP uses most. Currently, queries such as aggregation, grouping, and sorting are supported. For the time being, distinct is not supported, and THE SUPPORT for OR is not perfect. However, DISTINCT and group BY can be interchanged, and OR can also be replaced by IN. So most SQL can be used with modifications.

A friend mentioned that since distinct and group BY are interchangeable, or can be replaced by IN, is it possible to switch the distinct and group by automatically during SQL parsing?

While it’s technically possible to do so, the design is questionable.

If you already do the resolution of distinct and OR, there is no need to rewrite the SQL, just support it. Rewriting SQL is painful for DBA debugging, so we want to keep the SQL unchanged and change only the necessary parts, such as the name of the sub-table, avG conversion to count and sum, offset and rowCount for limit.

SQL that is too complex, such as subqueries, may not be suitable for use in sharded databases with large amounts of data and may need to be reorganized.

We will continue to improve SQL compatibility in the future.

Currently sharding-JDBC only supports MySQL, and support for other databases (such as Oracle) is in the works. New SQL Parser is currently in progress. Support for Oracle, PG, and SQLServer is planned for this release. The main support for Oracle and SQLServer is special keyword recognition and paging.

New SQL Parser does require a lot of work. Although the overall code has been sorted out, it will take some time to provide stable services. Snapshot is expected to be available in April and stable in June.

Git Parser New SQL Parser is on the Parser branch of Git.

Also shared with you a lot of database related experience and experience

Q: NOW I am using Sharding written by myself, but it is awkward to parse statements. Some statements cannot be parsed, so I package jSQLParser, Druid and self-written re to get table names. Do you have any suggestions?

Jsqlparser parses SQL in a JavaCC fashion, which is less efficient than Druid. With regular parsing, performance should be lower, and both are harder to tune.

Druid’s lexical and parsing approach to SQL can be a lot of coding, but can lead to significant performance gains. Druid’s SQL parser is a bit tricky for developers to get started with. Shariding-JDBC uses the same SQL parsing method as Druid, but has been optimized for Sharding.

Jsqlparser and Duird work just to get the table name, but Druid or sharing-JDBC is recommended for performance reasons.

Q: Do you have any experience with distributed transactions?

In terms of distributed transaction, we believe that XA multi-stage submission, although it has a good guarantee for the integrity of distributed data, will greatly affect the application performance, so it is not considered. We use two approaches, one is called weak XA and the other is a flexible transaction, or BASE.

Weak XA means that the database is responsible for the commit and rollback of its own transactions, and there is no unified scheduler for centralized processing. The advantage of this is that it is naturally supported and has no impact on performance. However, if there is a problem, for example, the data of two libraries needs to be submitted, one is submitted successfully, and the other fails due to disconnection during the submission, the data inconsistency will occur, and this data inconsistency will be permanent.

Flexible transactions are a useful complement to weak XA. There are many types of flexible transactions.

Sharding-JDBC mainly implements maximum effort delivery. The belief that a transaction will succeed through trial and error. If each transaction fails to execute, it will be recorded to the transaction library, and continue to try through asynchronous means until the transaction succeeds (you can set the number of attempts, if too many attempts still fail, it will be entered into the library and need manual intervention). On the way, the data will be inconsistent for a while, but eventually it will be consistent. In this way, the consistency of data can be achieved at the expense of strong consistency without compromising performance. The disadvantage of best-effort delivery transactions is that they are not flexible because they assume that the transaction must be successful and cannot be rolled back.

Another flexible transaction type is TCC, or Try Confirm Cancel. You can control the commit or rollback of a transaction through a transaction manager, closer to the original transaction, but still final consistency. The disadvantage is that the business code needs to implement the interface of Try Confirm Cancel by itself, which has a certain impact on existing services. Sharding-jdbc will bring TCC support in the future.

Other distributed transactions, such as Google’s F1, are not currently supported because Shariding-JDBC still uses the original storage engine of the database, unchanged.

Q: When do I need a sub-table? At present, we are divided according to business.

Sub-database sub-table is divided into horizontal split and vertical split. Vertical split by service database or table. Horizontal split is to split the same library or table into multiple tables according to certain sharding rules.

Both sub-database and sub-table can effectively deal with the problem of query performance degradation caused by large amount of data. Branches can also relieve the strain of high concurrency on the database, but only branches can use local transactions instead of distributed transactions. Therefore, the proper use of sub-libraries and sub-tables depends on business scenarios.

Sharding-JDBC as the development of the basic class library, support sub-library and sub-table, the choice left to business development engineers.

Q: How do you know which shard the user is in?

If the user name is a primary key, you can calculate directly from the sharding strategy you define to figure out which table in which library the user ends up on.

If the user name is not the primary key, you must search through the full route, one by one, until it is found.

Zhang Liang also talked about many other problems about database

Q: I have a big question, how do you decide if you need to make a magic change or change the database? The cost of MySQL’s various changes may not be cheaper than buying high-end databases. Especially now a lot of databases have cloud services to choose from, buy an Oracle or Microsoft cloud service database, the entry cost should now be accepted by small and medium-sized enterprises.

This problem is quite big and needs to be discussed from the perspective of the whole.

MySQL + open source sharding middleware is the core technology of the company in their own hands, most Internet companies are willing to adopt. This is not just about the immediate economic cost, but also about the cost of solving technical problems and the ability to control technology when business changes occur. Mature companies tend to develop their own middleware architectures and support systems for monitoring, governance, and more.

NoSQL is also an option, but they are positioned as a useful complement to relational databases rather than a complete replacement, so relational databases plus sharding still have their place.

With Oracle, my view is that important business data can be considered, while peripheral data is not necessary. Of course, if the company is not strapped for cash and is not technically oriented but purely business oriented, it can rely entirely on Oracle. It’s just that Oracle doesn’t fit all the scenarios of the Internet. With cloud databases like Oracle, it’s important for a company to position itself. If you have a core business, you can completely outsource the technology.

Personally, it is reasonable to use Oracle and cloud database when the business volume is small or moderate. In large Internet companies with explosive business growth, these solutions are not feasible because unicorn-level companies encounter business scenarios that are almost unique, and their solutions are not directly provided by third-party capabilities. Growing to a certain extent, most companies will choose the combination of self-research and open source to ensure the adaptability of their technology and the matching degree with the business.

Q: So the selection is actually based on the OLTP business model and data volume changes as the main consideration indicators? Choose free MySQL when you may not have enough technology or money to start with. And then when you survive and have the ability to do big, the data also skyrocketed, at this time to recruit people to magic change, or choose the existing magic change plan. Traditional commercial databases are concentrated in key areas such as finance, or in the domain of MySQL such as OLAP (or maybe opt for big data products without databases at all). I don’t know if I understand this correctly.

Yes, correct.

In short, just starting out with MySQL; Rich and moderate business use Oracle, core business use Oracle, non-core business use cloud database; Insufficient funds, large business volume with MySQL + open source middle layer; Large business volume can only be researched.

Big data and relational databases are not the same direction, mainly used to store other types of data, orders, transactions and other data generally do not magnify data, more suitable for logging, browsing records and so on.

Q: In addition, I would like to mention the DATABASE PG for selection. In the industry, MySQL is more popular, PG is less. Is it because PG has not been successful in mass cluster magic? At least we’ve heard a lot of cases since Google changed MySQL. I don’t think I’ve heard of PG.

With MySQL, I know I can do it big enough, despite all the bad stuff, because of all the success stories and all the third-party open source magic cluster solutions. However, PG is far better than MySQL in features, but it is not as flexible as MySQL, so it uses less. If you have money, buy business. If you have no money, MySQL+Hadoop.

Sharding-jdbc serves any relational database, whether MySQL or PG.

But we can talk about it roughly again. Many companies rely on balance in technology selection, especially in database selection. MySQL is not necessarily the best database, PG is better in some ways, but MySQL is the most recognized database in many ways. Such as the surrounding MHA suite, development and DBA familiarity with the database, etc. As a result, even MariaDB or Percona, the branching versions of MySQL, are generally not as well recognized as MySQL, no matter how much the function performance is improved and the engine is changed.

Q: THERE is a question I have been puzzled about. There are two ideas in the middleware of database and table partitioning, respectively:

  1. Sharding-jdbc and TDDL enhanced VERSION of the JDBC driver idea
  2. Similar to Mycat, add the middle layer, and then do the idea of separate libraries and tables in the middle layer

What I want to ask is, what are the advantages and disadvantages of these two ideas, and what is the dominant selection of large companies?

Both approaches have advantages and disadvantages.

Advantages of JDBC driver edition:

  1. Lightweight, more scoped, only JDBC enhanced, not including HA, transactions, and database metadata management
  2. The workload of development is small, and there is no need to pay attention to NIO, various database protocols, etc
  3. There is no need to change the operation and maintenance, and no need to pay attention to the HA of the middleware itself
  4. High performance. JDBC directly connects to the database without secondary forwarding
  5. It can support all kinds of databases based on JDBC protocol, such as MySQL, oracle, SQLServer

Advantages of Proxy version:

  1. Can be responsible for more content, data migration, distributed transactions, etc. into the category of Proxy
  2. Manage database connections more efficiently
  3. Integration of big data thinking, OLTP and OLAP separate processing

Therefore, the two approaches can complement each other, and it is recommended that teams use Java and only OLTP Internet front-end operations. If multiple databases are possible, choose JDBC layer middleware. If you need a mix of OLAP and OLTP to perform heavyweight operations such as data migration, distributed transactions, and so on, you can consider Proxy layer middleware. However, open source data migration and distributed transaction solutions are not common at present. The way NewSQL changes the database engine is not discussed here.