Congratulations, your company has finally grown to a size where you need to consider high availability and even separate repositories and tables. But do you know what elements are needed to separate inventory and table? The spin-off process is complicated, plan ahead, don’t wait for the actual start of all kinds of unexpected work, out of control.

This article is intended to open up the breadth of database middleware, regardless of the implementation depth, as to the concept of vertical and horizontal partition of the library table and the reasons are not explained too much. So this article is aimed at professionals with some r&d experience who are looking for selection and resolution processes.

Cut to the level

Below, scoping is done in JAVA and MySQL. Let’s take a look at the level of sub – database sub – table.

(1) coding layer

Create multiple data sources in the same project using if else routing directly in code based on criteria. With dynamic switching source abstract classes in the Spring, specific see AbstractRoutingDataSource.

If the project is not very large, using this method can be done quickly. However, the disadvantages are also obvious, requiring a lot of code to take care of each branch. When it comes to cross-library queries, aggregations, and scenarios that need to loop and merge results, the workload is huge.

If the project is fissile, most of this code cannot be shared, and most of it is shared by copy. If it goes on like this, it won’t.

(2) the framework layer

This situation is suitable for a company with a unified ORM framework, but in many cases it is not practical. This is done by modifying or enhancing the functionality of the existing ORM framework by adding custom primitives or hints to SQL.

Implementing some interceptors (such as Mybatis’ Interceptor interface) and adding some custom parsing to control the flow of data is better, but will change some of the existing programming experience.

Many cases to modify the framework source code, not recommended.

(3) driver layer

Due to the shortcomings of cutting into the coding layer and the framework layer, true database middleware starts at least from the driver layer. What does that mean? Essentially, you rewrote a JDBC driver, maintained a routing list in memory, and then forwarded requests to the actual database connection.

Things like TDDL, ShardingJDBC, etc., are cut in at this layer.

Including Mysql Connector/J Failover protocol (specifically “load balancing”, “Replication”, “Farbic”, etc.)

Also modify directly on the driver.

The flow of requests generally looks like this:

(4) agent layer

The database middleware at the proxy layer disguises itself as a database and accepts links from the business side. The requests from the business side are then loaded, parsed or forwarded to the real database.

MySQL Router, MyCat, etc., are all accessed in this layer.

The flow of requests generally looks like this:

(5) implementation layer

SQL special edition support, such as Mysql Cluster itself supports various features, Mariadb Galera Cluster support peer dual master, Greenplum support sharding, etc.

The need to replace storage, generally a solution, is off the table.

The technologies will eventually converge, and any one of them will work. However, the final selection is influenced by multiple factors such as developer familiarity, community activity, company compatibility, official maintenance, scalability, and the company’s existing database products. Choose or develop a suitable one, and your friends will be much happier.

Compare the driver layer with the agent layer

From the above level description, it is clear that we choose or develop middleware, focusing on the driver layer and the agent layer. At these two levels, you can have greater control and finer management of database connections and routing. But the differences are also clear.

Characteristics of driving layer

Support JAVA only, support rich DB

The driver middleware supports only one development language, Java, but supports all back-end relational databases. This solution is recommended if your development language is fixed and your back-end data source type is rich.

Too many database connections

The driver layer middleware maintains many database connections. For example, for a table with 10 libraries, Connection in Each Java maintains 10 database connections. If there are too many projects, there will be a connection explosion (let’s calculate that if each project has 6 instances, minIdle in the connection pool is equal to 5 and the total number of connections for 3 projects is 10*6*5*3 = 900). A database like Postgres, with one process per connection, can be stressful.

Data aggregation is performed at the business instance

Data aggregation, such as count sum, is done through multiple queries and then aggregated in the memory of the business instance.

The routing table exists in the memory of the business instance and can be updated through polling or passive notification.

Centralized management

The configuration management of all clusters is centralized in one place, and the OPERATION and maintenance burden is small. Dbas can complete related operations.

The typical implementation

Characteristics of agent Layer

Heterogeneous support, limited DB support

Proxy-layer middleware is the opposite. Only one back-end relational database is supported, but multiple development languages are supported. This solution is recommended if your system is heterogeneous and all have the same SLA requirements.

Heavy burden of operation and maintenance

The proxy layer has a limited number of database connections to maintain (except for sticky connections like the MySQL Router). But as a standalone service, considering both standalone deployment and high availability adds a lot of extra nodes, not to mention companies using shadow nodes. In addition, the agent layer is the only entry point for requests, and the stability requirements are extremely high, so if a node crashes with a high memory consumption aggregate query, it can be a catastrophic accident.

The typical implementation

In common

I don’t have enough space to discuss it. Visit each middleware propaganda page, you can see a long Feature list, that is, a whitelist; You can also see a long list of restrictions, or blacklists. Depending on how you play, with enhanced distributed capabilities, the sub-table itself is a castrated database.

Use restrictions

Ensure data balancing and split database data as evenly as possible. For example, if the user database is not evenly divided by province, the module is more evenly divided by userID. Deep paging without the split key is used to extract all data before the number of pages taken by all libraries and sort the data in memory. Easy to cause memory overflow. Reduce the subquery subquery will cause SQL parsing disorder, parsing error situation, minimize the SQL subquery. The minimum transaction principle minimizes the scope of libraries involved in single transaction as far as possible, that is, reduce kua database operations as far as possible, separate the libraries/tables of similar operations together, and split the data of the database as evenly as possible. For example, the user database is not evenly divided by province. Special functions such as DISTINCT, HAVING, Union, in, and OR are not supported. Or be supported, after use will increase the risk, need modification.

product

The recommendation focuses on MyCat and ShardingJDBC. In addition, there are plenty of other middleware that you are not familiar with. Database middleware is hard to maintain and you’ll find a lot of half-dead projects.

Here is a list, in no particular order, of several that have only HA functionality but no split functionality:

Atlas, Kingshard, DBProxy, mysql Router, MaxScale, 58 Oceanus, ArkProxy, Ctrip DAL, Tsharding, Youtube Vitess, NetEase DDB, Heisenberg, ProxysQL, Mango, DDAL, Datahekr, MTAtlas, MTDDL, Zebra, Cobar, Cobar

Almost every big factory has its own database middleware (and I found a few who like to add the company prefix to open source components as products), but they don’t use it for us.

Process Solutions

No matter which level is used to cut into the database and table, the following work process is faced.

Information collection

Statistics impact on business and projects

The larger the scope of the project, the more difficult the repository. Sometimes, a single complex SQL can involve four or five business parties, all of which need to be focused.

Determine the size of the sub-tables, whether only a few of them, or all of them. The more you divide, the more work you have to do, almost linearly.

There are also some projects that involve pulling the whole body together. For example, the following process affects more than just a branch link.

Identify participants

In addition to the technical support staff of the sub-library and sub-table components, the most important people to be involved are those who are most familiar with the system and existing code. Only they can determine which SQL should be discarded, the impact surface of SQL, and so on.

Determine the database and table strategy

Determine the dimensions and shard keys of subtables. Once the split key (column of routing data) is determined, it is not allowed to be modified. Therefore, in the early architectural design, it should be established first before the subsequent work. Multiple dimensions of data means that there are different shard keys to achieve the effect of different query conditions. This involves data redundancy (overwrite, data synchronization) and can be more complex.

preparation

Data structured

The library table structure does not meet requirements and needs to be organized in advance. For example, shard keys have different field names or types. In the implementation of the separate database and table strategy, these personalities will cause the policy to be too large and difficult to maintain.

Scanning all SQL

Scan all the SQL in the project and determine if it can run properly according to the shard key one by one. There will certainly be a large number of non-compliant SQL in the judgment process, so it is necessary to give a transformation plan, which is one of the main workload.

Validation tool support

It is possible to modify and validate directly from the original project, but there are many problems, mainly inefficiency. I tend to design some validation tools first, enter SQL or a list to validate, and then print routing information and results for judgment.

Technical preparation

For each of the points mentioned below, try an example and estimate the difficulty for your team.

The following: All unsupported SQL type collation in middleware Crash prone considerations Unsupported SQL give processing options consider a generic primary key generator consider how to deal with SQL without sharded keys consider how to traverse a full library such as scheduled tasks consider how to transform cross-library cross-table queries Prepare some toolsets

In the implementation stage

Data migration

The Databus is necessary because it re-affects the distribution of data, whether full or incremental, and involves data migration.

Ideally, all additions, deletions and changes are messages that can be double-written by subscribing to MQ.

In general, however, you still need to simulate this state, such as using the Canal component.

How to ensure safe data switching, we will discuss in other chapters.

Adequate testing

The sub – database and sub – table must be adequately tested, and each SQL sentence must be strictly validated. Complete coverage is necessary if unit testing or automated testing tools are available. Any data that is incorrectly routed, especially if it is added, deleted or modified, can create a lot of trouble.

During the testing phase, the validation process is output to a separate log file, and after sufficient testing, the log file is reviewed for incorrect data flow.

SQL the reinspection

A SQL recheck is strongly recommended. Mainly according to the function description, determine the correctness of SQL, also known as review.

rehearse

Rehearse the solution several times in an off-line environment to make sure everything is safe.

Develop new SQL specifications

After the database and table, the SQL in the project is shackled and cannot be written at will. Many operations that are normally supported may not work in a split environment. So there should be a validation process for all SQL involved before going live, even if it has been adequately tested.

digression

Don’t take on a job without support.

Sub-database sub-table is a strategic technical scheme, many cases can not be rolled back or the rollback scheme is complex. If the library table to be split involves multiple business parties and the company’s technical staff is complex, the CTO should personally lead the coordination and be supervised by a professional and careful architect. Without the authorization of the coordinator will fall into an embarrassing situation, resulting in the process out of control project delivery.

Truly experienced people, will know its pain!