Jingdong architect said: distributed database system evolution road

Regarding the use of database, there are several trends in JD. In the early stage, JD mainly used SqlServer and Oracle, and MySQL was also used in a small amount. With the accumulation of business technology and the cost of use, many businesses began to use MySQL. Many core businesses, including the early use of SqlServer and Oracle, also gradually began to migrate to MySQL. Standalone MySQL often cannot support such businesses, so distributed solutions are needed. In addition, with the increase of data volume and visits, the business that originally used MySQL will also encounter bottlenecks and finally consider adopting distributed solutions. The development trend of JINGdong is shown in Figure 1.

Figure 1 Evolution trend of service usage databases

There are several kinds of distributed database solution in various Internet companies to use it is very common, is essentially to open store data in multiple nodes so as to alleviate the pressure of single node, business level can also take apart according to the characteristics of the business itself, as shown in figure 2, suppose you have a user list, with ID as the split key, assuming that split into two, In the simplest case, data with odd ids falls on one storage node and data with even ids falls on another storage node. The actual deployment diagram is shown in Figure 3.

In addition to business level do break up, also can consider to use some more general solutions, mainly divided into two categories, one is the client solution, this solution is introduced in business application specific client package, by the client package complete data query of split and aggregation operation results, the scheme for business have certain invasive, As a large number of application instances are deployed, the database may face a heavy connection number pressure. In addition, version upgrade is difficult. The advantage is that the link is short and the application instance is directly connected to the database.

FIG. 2 Schematic diagram of data splitting

Another kind is the middleware solution, this solution is to provide compatible database agent transfer protocol and grammar specification, business at the time of connection middleware can directly use the traditional JDBC client, thereby greatly reducing the burden on the level of business development and disadvantages middleware development difficulty is a slightly higher than the client, In addition, there is an extra link on the network transmission link, which affects the performance slightly in theory. In actual use, these systems are accessed on the equipment room Intranet, and the impact on the network can be ignored.

Figure 3 system deployment diagram

According to the above analysis, in order to better support jingdong’s massive business of large-scale data volume, we developed a set of distributed database middleware solution compatible with MySQL protocol, which we called JProxy. After many times of evolution, this solution finally completed and supported the task of jingdong group to Oracle/Sqlserver.

The first version of JProxy is shown in Figure 4. Each JProxy has a configuration file, in which we will configure the library table split information and routing information of the corresponding business. After receiving SQL, JProxy will parse the SQL and then decide whether the SQL needs to be rewritten and which nodes should be sent according to the routing information. After the results of each node are returned, the results are summarized and returned to the application according to the MySQL transport protocol.

Select * from user where id = 1 or ID = 2; select * from user where id = 1; Select * from user where ID =1; select * from user where ID = 2; Finally, the two records obtained from the two nodes are returned to the application.

This scheme is feasible when there are few service database tables. With the development of services, the number of database tables may continue to increase, especially for Oracle services, when switching databases, several tables may be switched at a time, and then another table may be switched next time, which requires frequent modification of configuration files. In addition, when JProxy is deployed, at least two or even more copies need to be deployed, as shown in Figure 5. At this time, one problem is how to ensure that all configuration files are completely consistent in the process of constant modification. In the early operation and maintenance process, we manually modified a configuration file, and then copied the corresponding configuration file to other JProxy to ensure the consistency of the JProxy configuration file content. This process is mentally heavy and error-prone.

Figure 4 version 1

Figure 5 configuration file

In later versions we introduced the JManager module, which is responsible for managing routing meta information in the configuration file, as shown in Figure 6. The routing metadata of JProxy is uniformly obtained through JManager. We only need to add and modify the routing metadata to the metadata database through JManager, and notify each JProxy to dynamically load routing information after completion of the operation to ensure that the routing information of each JProxy is completely consistent. Thus solve the pain point of maintaining the consistency of routing meta information.

Figure 6 version 2

There are two ways to expand capacity. One is what we call re-Sharding solution, which simply means to split one piece into two pieces, and split two pieces into four pieces, as shown in Figure 7. Originally there was only one MySQL instance and one shard. Shards shard1 and shard2, then add a new MySQL instance, split shard1 into shard11 and shard12, split shard2 into shard21 and shard22 and put them on another new MySQL instance. This expansion mode is the most ideal, but it will be a little troublesome to realize it. In a short term, we choose another expansion mode that is conservative enough to support business development on the premise of reasonable prediction. We call it pre-Sharding scheme, which is to split the number of fragments that are sufficient for a certain period in advance. When the amount of data is small in the early stage, these shards can be placed on one or a small number of MySQL instances. When the amount of data increases in the later stage, new MySQL instances can be added to the cluster to migrate the original shards to the newly added MySQL instances, as shown in Figure 8. We split shard1, shard2, shard3 and shard4 into four shards at the beginning. These four shards were originally on one MySQL instance. When the data volume increased, we could add new MySQL instances and migrate shard3 and shard4 to the new MySQL instance. The number of fragments in the entire cluster has not changed but the capacity has been doubled.

Fig.7 Re-Sharding scheme

FIG. 8 Pre-Sharding scheme

Pre-sharding scheme is equivalent to achieving capacity expansion through migration. The change of fragment location involves a series of changes such as data migration verification and routing metadata change, so we introduce JTransfer system, as shown in FIG. 9. JTransfer can achieve seamless online migration. During migration and expansion, you only need to submit a migration plan, specify which source instance to migrate a fragment to which target instance, and specify when to start the migration task. When the time is up, the system will automatically start the migration. Based all the entire migration process involves the migration amount of data and business access to generate incremental data migration process, the start will base all measured data from the source instance dump out to the target instance recovery, to confirm correct later began to run after the incremental data, when the incremental data after enough system forecast can quickly after the end, We will do a brief lock operation, so as to ensure that the total increment of final after completion, the locking time is when submit migration task can specify a parameter, such as the maximum of 20 s, if because suddenly become more traffic such as ultimately the rest of the incremental didn’t completed in 20 seconds after, will give up the entire migration tasks, Ensure minimal impact on online access. After the migration is complete, the routing meta information is modified and pushed to all JProxy. Finally, the lock is released, and the access is routed to the new location where the fragment resides.

Figure 9 version 3

When the system is used in the production environment, in addition to the above, many deployment and operation and maintenance matters need to be considered. The first thing to consider is how to survive the system, consider the self-protection ability of the system, ensure the stability of the system, and ensure that the performance can meet business requirements.

In JProxy, we adopt the network IO model based on event driven and take multi-core and other characteristics into account to maximize the performance of the whole system. During pressure measurement, the performance of JProxy almost shows a linear growth trend with the increase of MySQL instances, and the machine where JProxy is located has no pressure during the whole process.

Guarantee performance is not enough, also need to consider to control the number of connections, the control system memory, etc., the number of connections is mainly the number of the control connection is better understood, control memory mainly refers to the control system in use process, the demand for memory, number of data extraction, such as doing the SQL statement is similar to the select * from table full amount of this query, At this time, all MySQL data of the backend will send data to the middleware concurrently through multiple connections. There is only one connection from the middleware to the application. If the memory is not controlled, the middleware OOM will be caused. In the concrete implementation, we control the network flow rate of the front and back end of the middleware by pressing the data in the TCP stack so as to ensure that the memory of the whole system is in a controllable range.

You also need to consider, which IP access which IP cannot access requires precise control, specific to a particular table control also need to add and delete permissions, we suggest that the business in writing the SQL as far as possible with split fields SQL that can fall on a shard to ensure that the visit is enough simple controllable, We provide detailed permission control, which can be added, deleted, changed and checked at table level, including whether to have split fields, to achieve the maximum control of SQL, to ensure that the business in the test stage write SQL that does not meet the expectations can be found in time, greatly reducing the risk of online operation in the later stage.

In addition to basic stability, service high availability solutions need to be considered on a system-wide basis. JProxy is stateless. A service must deploy at least two JProxies across racks in the same equipment room to ensure high availability. Two more JProxies will be deployed in another machine room to achieve high availability across the machine room. In addition to the high availability of the middleware itself, it is necessary to ensure the high availability at the database level. Full link high availability is the true high availability. The database layer is deployed in one master and one slave mode in the same machine room, and another standby mode is deployed in the standby machine room, as shown in Figure 10. When JProxy accesses the MySQL database, it uses the domain name. If the primary and secondary databases of the MySQL database are abnormal, JProxy can access the new primary database after the switchover. If the database in the equipment room is abnormal, JProxy can directly switch the domain name to the standby equipment room to ensure that JProxy can access the database in the standby equipment room. Services access JProxy through domain names. If JProxy in an equipment room is faulty, just like a database, the domain name of the JProxy front end is directly switched to the standby equipment room to ensure that services can access JProxy.

The high reliability of data is also a key point. We will regularly back up the data of the database and store the backup data in the corresponding storage system, so as to ensure that the data in the database can be restored even if deleted.

Figure 10 Deployment diagram

It is extremely important to monitor and alarm when the system is running online. Monitoring can be divided into multiple levels, as shown in Figure 11, from the information of host and operating system to the information of application system to the monitoring of specific information inside a specific system. In view of the operating system and the host has MJDOS system can put the system monitoring of jingdong magnetic/card/memory/CPU/machine load and so on all kinds of information is included in the monitoring system, the operating system on the basis of the information system is critical to the diagnosis of abnormal, such as because the network packet loss caused by service exceptions can be found in the monitoring system in a timely manner.

Jingdong UMP party unity and monitoring alarm system, the monitoring system is mainly for all the service application system, all the application systems expose interfaces according to certain rules, after registered in UMP party system, UMP party system can provide a complete set of monitoring alarm service, the most basic such as the survival of the monitoring system, and whether there is a slow query, etc.

In addition to these two basic monitoring systems, we also developed a customized monitoring system JMonitor for the entire middleware system. The reason for developing this monitoring system is that we need to collect more customized monitoring information to locate problems in the first time when the system is abnormal. For example, when the service finds that TP99 drops, it is often accompanied by slow SQL. In the process from sending SQL to receiving results, the application goes through JProxy to MySQL, and from MySQL to JProxy and back to the application. Any link in this link may be slow, no matter which stage is time-consuming. We need to record this kind of slow SQL fine, fine to each stage spent how much time, so that when slow SQL can quickly and accurately find the root of the problem quickly solve the problem.

In addition, when cooperating with business to Oracle/ SQL Server, we do not recommend the use of cross-library transactions, but there will be a situation that the SQL in the same transaction is all with split fields, each SQL is a single node, there are multiple such SQL in the same transaction, but the transaction is cross-library. Such transactions are well documented and can be found directly by the business side through JMonitor for further improvement. In addition, when testing the environment, the BUSINESS system did not consider too much optimization of the SQL written at the beginning, which may lead to a lot of slow SQL. These slow SQL will be uniformly collected and analyzed in the JMonitor system to help the business side to quickly iterate and adjust SQL statements.

FIG. 11 Monitoring system

As shown in Figure 12, a SINGLE SQL without a split field will be sent to all shards. If there are 64 shards on a single MySQL instance, This is a significant resource consumption. If you can control the SQL to fall on a single shard, you can greatly reduce the connection pressure on the MySQL instance.

Figure 12 Connection number

Cross-library distributed transactions should be avoided as far as possible. One is that the distributed database middleware scheme based on MySQL cannot guarantee strict distributed transaction semantics, and the other is to avoid breaking library transactions even if it can achieve strict distributed transaction semantics support. If multiple cross-library distributed transactions deadlock on one fragment, the transactions on other fragments cannot continue to cause large area of deadlock directly. Even transactions on a single node should be controlled as small as possible to reduce the probability of deadlock.

Taking Jingdong sorting center as an example, the size of each sorting center varies greatly. Sorting centers in big cities such as Beijing and Shanghai have a large amount of data. Sorting centers in other cities are relatively small, so we will customize routing policies for them. In this way, the data of the large sorting center can be placed on the specific MySQL instance with good performance, while the data of the other small sorting center can be processed in the ordinary splitting way.

At the JProxy system level, we can support multi-tenant mode. However, considering that Oracle/SqlServer services are often very important and have a large amount of data, we deploy a set of different services independently to avoid the interaction between services at the deployment level. Considering that independent deployment will cause some waste of resources, we introduced a container system to isolate operating system resources through containers to ensure full utilization of system resources. Level to solve many problems don’t need to must be in code, the code level to solve more troublesome or can’t do one hundred controls the percentage of things can be addressed by the architecture level, structure level is not good to solve things can be solved by the level of deployment, deployment level can not solve things through product level, the way to solve the problem of all kinds, It is necessary to take a comprehensive consideration from the overall perspective of the whole system. Quoting deng Gong’s words, “No matter whether a cat is black or white, a cat that can catch mice is a good cat”. In the same way, a system that can support business development is a good system.

In addition, we briefly discuss why the distributed database middleware system based on MySQL cannot guarantee strict distributed transaction semantics support. The so-called distributed transaction semantics is essentially the semantics of a transaction, including ACID properties, which are atomicity, consistency, persistence and isolation.

Atomicity means that a transaction either succeeds or fails and there can be no intermediate state. Persistence means that once a transaction has been committed, it is still successful to recover from a system crash. Isolation refers to the fact that individual concurrent transactions are isolated and invisible from each other, and there may be many levels of isolation in database implementation. The consistency of transactions means to ensure that the system is in A consistent state. For example, if 500 yuan is transferred from account A to account B, the total amount of the system does not change from the perspective of the whole system. It cannot be the case that 500 yuan has been deducted from account A but 500 yuan has not been increased from account B.

Figure 13 serializable scheduling

There is a serializable scheduling problem when transactions are executed in the database system. Assuming that there are three transactions T1, T2 and T3, the execution effect of these three transactions should be the same as the serial execution effect of three transactions. The end result should be one of the {T1/T2/T3, T1/T3/T2, T2/T1/T3, T2/T3/T1, T3/T1/T2, T3/T2/T1} sets. When it comes to distributed transactions, In order to meet the requirements of serializable scheduling, the scheduling order of each sub-transaction must be consistent with that of the global distributed transaction. As shown in Figure 13, the scheduling order of the three distributed transactions of T1/T2/T3 in one library is consistent with that of the global transaction, and in the other library becomes T3/T2/T1. When standing in the point of view of global broke the serializable schedule, serializable schedule to ensure the implementation of isolation when serializable schedule has been broken natural isolation goes broke, the distributed middleware solution based on MySQL implementation, because each transaction in the transaction of the same distributed transaction ID is generated on the MySQL, The global transaction ID is not provided to ensure that the scheduling order of each sub-transaction is consistent with the global distributed transaction, resulting in the isolation cannot be guaranteed, so the current distributed transaction based on MySQL cannot guarantee the strict semantic support of distributed transaction. Of course, with the introduction of GR in MySQL, the strong consistency in CAP theory can be achieved. It is also possible to support strict distributed transactions by strengthening related functions of middleware and customizing related functions of MySQL.