Team to introduce

Technical Architecture Department of Jingdong Mall – Database Technology Department, responsible for providing efficient and stable database services and research and development of database ecological products for Jingdong Mall, mainly focusing on: AIOps database system, database kernel, multi-model database system, elastic database, BinLake, JTransfer and data knowledge graph and other products research and development, the department adhering to the concept of technology first, continuous innovation in the field of database progress.

Operations automation from the pain points in the work, jingdong database team is facing mall tens of thousands of engineers, the pressure push us constantly change, change is not achieved overnight, however, also went from the manual to scripting, automation, platform, intelligent hard shift, so it needs to drive the construction of the operational system, The essence of operation and maintenance automation lies in the liberation of operation and maintenance personnel, promote the increase of human rate, reduce human failure, to learn to cultivate their own “lazy” this good habit.

The construction of jingdong’s automatic operation and maintenance system began in 2012. The following two aspects will be introduced:

I. Jingdong database intelligent operation and maintenance platform

Jingdong’s business is growing every year in the form of explosion. There are a large number of database servers and thousands of product lines. To support such a huge business system, a perfect automatic operation and maintenance management platform is needed. Currently, JD MySQL database management platform, referred to as DBS, mainly covers the following contents: Perfect asset management system, database, process management system, database, fault monitoring system, database management systems, databases, reporting system, flexible auxiliary operations tool database system and database, involving all aspects of the operational DBA, automate the DBA to MySQL, self-support, visualization, intelligent, service management, Avoid production accidents caused by manual operation errors by DBA, and ensure the safe, stable and efficient operation of JINGdong database.

The following core functional components are mainly introduced here:

As the cornerstone of automatic operation and maintenance, its accuracy is directly related to the reliability of the whole database management platform. Jingdong database management platform starts from the operation and maintenance habits of database business side and DBA, covering computer room, host, business, cluster, instance, library, table and other dimensions:

  • Equipment room and host: Records hardware information.

  • Service dimension: Records the name, level, and information about service departments.

  • Cluster dimension: records the MySQL cluster architecture information.

  • Instance dimension: Records MySQL parameters to ensure automatic operation and maintenance.

  • Library dimension: mainly records database name and business contact information.

In the face of complicated operation and maintenance work such as database addition and expansion, automatic installation and deployment platform can completely liberate DBAs. At present, the automated deployment system of JD includes application server, deployment of database instance, data synchronization, consistency check, splitting and switching, etc. The whole process is streamlined, including operation approval of all levels of business and DBA, and finally achieves comprehensive automatic and process deployment of MySQL services, as shown in the following figure:

The main function points include the following:

  • Installation and deployment of MySQL instance, architecture construction, domain name application. The primary and secondary instances in a cluster cannot reside in the same cabinet, and a host with good hardware performance takes precedence over the primary library.

  • Monitor deployment, backup deployment, asset registration.

  • MySQL services are created as mirrors that rely on K8S’s mirror repository.

  • The application account is created by the application party through the automatic online system.

  • Data consistency verification is usually performed at night when services are off-peak.

Jingdong’s intelligent analysis and diagnosis covers four important parts, including database monitoring index collection, diagnostic analysis, fault self-healing and trend analysis:

(1) Monitoring system

The monitoring system provides accurate data basis for database management and enables operation and maintenance personnel to know the running status of the production service system at their fingertips. The core monitoring indicators include OS load, MySQL core indicators, and database logs. By analyzing the monitoring information, we can judge the running status of the monitored database, predict the possible problems, and give the optimization scheme to ensure the stability and efficiency of the whole system.

The distributed monitoring system of JINGdong adopts passive mode, with high availability of both server and proxy to prevent single point of failure. Here is the overall architecture and flow chart:

(2) Monitoring performance analysis

The intelligent analysis of database performance is mainly the secondary analysis of database monitoring data to eliminate security risks. In the actual production, some hidden dangers do not reach the alarm threshold set, at a critical point of alarm, in fact, this situation is the most dangerous, may break out at any time, to solve these hidden dangers, we through monitoring data sequential, year-on-year, TOP indicators and other aspects of the group summary analysis, discover hidden dangers in advance.

Slow SQL analysis:

Index analysis:

Spatial analysis and prediction:

Lock analysis:

(3) Fault self-healing

Faults occur in various forms, and the most core content depends on the auxiliary analysis of monitoring, how to provide the most accurate information, the content is as follows:

  • Alarm filtering: Filters out unimportant alarms and duplicate alarms

  • Generate derived alarms: Generate various derived alarms based on the association relationship

  • Alarm association: Specifies whether derived alarms of different types are associated in the same time window

  • Weight calculation: Calculates the possibility of a root alarm based on the preset weight of each alarm

  • Generate a root alarm: The derived alarm with the largest weight is marked as a root alarm

  • Merge root alarms: If multiple types of alarms have the same root alarms, merge them

The magnitude of JINGdong database server is large, which will lead to a relatively higher probability of failure, and the requirements for system stability are also more demanding. Therefore, in order to ensure the high availability of the database and the continuous service of 7*24 hours, our team independently developed the database automatic switching platform, which realized automatic and semi-automatic switching modes and realized multi-dimensional scene switching at single cluster level, multi-cluster level and machine room level. The switchover process includes modifying monitoring information, asset information, backup policy, and primary/secondary roles in a one-click manner to avoid secondary faults caused by human factors.

(1) Distributed detection

As the core component of switching system, distributed detection mainly solves system DISASTER recovery (Dr) problems. According to the characteristics of multi-dc deployment of JINGdong database server, each independent data center has a detection node, which is distinguished by the interface domain name specially identified. When a switchover occurs, the switchover system randomly selects two interfaces in the equipment room based on the IP address of the faulty host. If a node host is found alive, the host is considered alive. If both nodes are detected as down, the host is down.

(2) Master failover

If the primary database instance fails, the switching system will first check the instance survival status through the distributed detection system. After confirming the breakdown, the switching system will choose automatic switching or manual switching according to the instance switching identifier in the basic information. The principle of the two switching modes is the same: Firstly, create a switching task on the switching system. Manual switching requires the DBA to execute the switching button. During the switching operation, data will be inserted through insert to verify the running status of the instance, avoiding instance ramming and disk read-only. If there are no surviving slave libraries, the operation is abandoned and the DBA is notified by email and SMS. The new master database is selected according to the principle of local first (small number of connections first, low QPS load later) and remote later. After the switchover succeeds, the corresponding metadata information will be changed, as shown in the following example:

Primary database 10.66.66.66:3366 is faulty in a cluster where the primary database is 10.66.66.66:3366. Switchover is required as follows:

  • If the monitoring system detects that the primary database is down, it automatically creates a switchover task and performs an automatic switchover or manual switchover. The manual switchover is used as an example.

  • Select the target instance. If all the four slaves in the example are alive, then according to the principle of local first and remote second, select 10.66.66.68:3366 and then check the connection number. In the case of the same connection number, compare QPS. 10.66.66.69:3366 with low QPS load is selected as the target example:

  • Result of switchover:

(3) Slave Failover

If the secondary library instance is faulty, change the domain name of the faulty instance to the non-faulty instance in the cluster. The selection method of the target instance is consistent with the selection rule of the primary library instance. If the switchover succeeds or fails, the corresponding DBA will be notified by email or SMS. After the failed instance recovers, the DBA determines whether a failback is required. The following is an example:

The slave library 10.88.88.89:3366 in a cluster with one master and four slaves is faulty and needs to be switched as follows:

The monitoring system will automatically create a task and check the connection number and QPS according to the principle of local first, then remote, and determine the target instance as 10.88.88.88:3366 for automatic switchover. The DBA can view the details in the switchover task list.

The switch back button is displayed after the task is successfully switched. The DBA can perform the switch back and view the detailed information about the switch back.

(4) Planned switchover between master and slave

The planned primary/secondary switchover implements batch switchover by single cluster or multiple clusters. When performing batch switchover, you can view the detailed steps for switching subtasks. The architecture before and after the switchover is compared as follows:

Cluster 1

Create tasks in batches. The selection principle is local before remote, number of connections before QPS, 10.66.66.66:3366. The target primary database is 10.88.88.89:3366.

Batch switchover:

Detailed information about subtask switchover can be viewed, including the switchover result, execution procedure, and architecture of each subtask:

All function modules of JD MySQL database switching system have been componentalized, which simplifies the operation process of DBA and shortens the time of database switching.

(1) Architectural design

At the beginning of design, jingdong database backup system aims to free DBA from complicated backup management work, realize automatic processing, reduce human intervention, and improve the availability of backup files. With regard to the availability of backup files, the polling recovery policy ensures that each cluster is restored to within a period. The system architecture design is shown in the figure below:

The architecture has the following characteristics:

Scheduling trigger diversification

The scheduling center supports interval, crontab, and date:

  • Interval is a periodic scheduling task. You can specify a task at a fixed interval, including weeks, days, hours, minutes, and seconds. You can set the start time, end time, and time zone.

  • The crontab is the same as the Linux crontab. The crontab supports year, month, day, week, day_of_week, hour, minute, and second. You can set the start time, end time, and time zone.

  • Date is a one-time scheduling function that supports time zone Settings.

Concurrency control

Due to the imbalanced setting of scheduling tasks, there may be many tasks to be scheduled at a certain time, which is easy to cause problems in the scheduling system. Therefore, scheduling tasks can be executed more smoothly by controlling the number of concurrent tasks.

Triggering and executing layering

Task triggering itself is a lightweight set, while task execution is generally heavy, so the triggering and execution are designed hierarchically to prevent problems with subsequent triggering due to long execution time.

Tasks are not lost during maintenance

The Crontab of Linux will not execute the tasks to be run during maintenance after startup, while the scheduling center based on APScheduler will run the tasks that have not been executed within a specified interval after startup to reduce the missed execution of tasks due to maintenance.

Add, delete, modify, and query backup policies

Previously, the backup system of the company needed to specify a specific IP address, which often resulted in backup failures due to server maintenance. Therefore, the backup policy was combined with high availability at the beginning of the design. The backup policy specified a domain name instead of an IP address. During a failover, DBS switches the domain name on the slave database to another slave database in the cluster, and the corresponding backup is also transferred to the slave database, ensuring that the backup server is available.

Automatic retry on failure

Backup may fail occasionally. Therefore, the backup retry function is added to enable backup retries for backup jobs that fail within 6 hours. The backup retries are maximum three times to achieve a higher backup success rate.

Automatic recovery detection

Backup in every step should be strictly verified, but cannot guarantee absolute backup file is available, so the introduction of the automatic recovery detection mechanism, to help the DBA to test the backup file, found in time because of not considering the result of the backup file is unavailable, and restore testing and audit of a rigid requirements, Automatic recovery detection also relieves DBAs of heavy recovery detection work.

(2) Scheduling design

The whole automatic backup and recovery system mainly consists of dispatching system, backup system, recovery system, recovery detection system and automatic repair system. The dispatching system is the core of the whole system, through which to coordinate the operation of other systems. The scheduling system can deploy Standby devices to implement HIGH availability and deploy actuators in clusters to implement high availability and horizontal expansion.

During each backup, the backup system checks instance health status and backup running status to prevent invalid database instances from being backed up. Restoring the system is mainly used when data recovery, elastic capacity expansion and so on need to restore from backup files to running database instances. DBA can complete data recovery simply by operation. Recovery detection automatically detects the availability of backup files under the command of the scheduling system to help DBAs find unavailable backup files in time. Some backup failures can be resolved by automatic retry. However, some backup failures cannot be resolved by retry and need to be repaired accordingly. Therefore, an automatic recovery system is developed to automatically recover backup failures caused by environment problems.

Scheduling system is the most core system and the brain of the whole backup and recovery system. At that time, I investigated several implementation methods, such as Crontab of Linux, Azkaban and Apscheduler of Python open source framework, and finally concluded that Apscheduler is more flexible and compact with more diversified scheduling methods. The maintenance cost is lower in the later stage of Python development, so Apscheduler is used to develop the scheduling center.

(3) Front end of the system

It consists of four modules: backup policy management, backup details, backup blacklist management, and recovery details.

Backup policy management:

Backup strategy management page contains a backup state distribution, use of storage, and each cluster state of the current backup strategy, if you have already added backup strategy could be here on (time, servers, backup) to modify, suspend (continue), delete, if not add backup strategy, you can add.

Backup details:

Backup details shows the total number of recent backups, number of successful backups, success rate, running status of backup jobs on the current day, 24-hour backup job distribution curve, and backup details. Detailed backup records can be queried based on cluster names and project names, helping DBAs better understand the backup running status.

Restore test details:

The recovery detection page contains the number of recovery detection in recent days, number of recovery detection successes, success rate bar chart, pie chart of the running status of the recovery detection task in the current day, and recent recovery detection completion rate, helping the DBA to have a clear understanding of the recovery overview.

Second, database reform

Before ContainerDB, JINGdong’s database service was containerized. Although the database service has realized the basic functions such as fast delivery and automatic failover of database service through Docker container, which improves the stability and efficiency of database service to a certain extent. However, the operation and use of database services are basically the same as the traditional way. Typical problems are as follows:

(1) The granularity of resource allocation is too large

Database server resource standards are fixed, granularity is too large, and there are too few resource standards available for database services.

(2) Serious waste of resources

The standard of resource allocation is determined by DBAs based on experience, which is highly subjective and cannot be accurately evaluated according to the actual situation of the business. In addition, WHEN ALLOCATING resources, DBAs generally consider that there is no need to migrate or expand services within three years, and allocating more resources at a time will cause serious resource waste. Moreover, due to the fixed database resource standard, the standard is too large, leading to the fragmentation in the host machine, often a host machine can only create one container, and the remaining resources can not meet any resource standards, resulting in the low resource utilization rate of the host machine.

(3) Resources are not scheduled statically

Once the database service is provided, the resources occupied will be fixed, and the online dynamic scheduling cannot be carried out according to the load of the database. Once the disk usage of the database is too high, the DBA needs to manually expand the capacity, which is inefficient.

Because of the above problems, ContainerDB is no longer a simple database service. We need to make database services smarter, and make the database resources move around, so that ContainerDB can deliver resources in phases. ContainerDB’s flexible load-based scheduling gives jd’s database resources the intelligence to truly flow, and it has successfully served multiple 618 and 11.11 campaigns.

ContainerDB has a logic library for each service application. In the logic library, it defines the KeySpace for the Sharding Key hashing of all tables in the entire service. Multiple tables can be created in each logic library. However, Sharding keys must be defined in each table. The Sharding Key is used to divide the data in the table into multiple shards, each of which corresponds to a KeyRange. KeyRange represents a range of Sharding Index obtained after the modulus hashing operation of Sharding Key. Each Shard is supported by a complete set of MySQL master-slave architecture to provide database services. The application only interacts with the Gate cluster, which automatically routes data writes and queries based on metadata information and SQL statements. The monitoring center in ContainerDB monitors the usage of all basic services and resources in real time. The hooks registered in the monitoring center automatically perform dynamic capacity expansion, fault self-healing, and sharding management, all of which are completely unaware to applications.

(1) Continuous delivery of streaming resources

One of the main reasons for the waste of resources in the database services is that the initial resource allocation granularity is too large. At the beginning, the resources are advanced for three or even five years. However, the resources in the resource pool are limited. Therefore, it is impossible for all services to advance resources. As a result, some services have no resources. ContainerDB uses a streaming mode for continuous delivery of resources. Each service is initially allocated a standard 64 GB hard disk. With the development of services and the increase of data volume, the disk capacity is continuously increased until the disk capacity reaches the upper limit of 256 GB.

In this way, we have greatly lengthened the delivery cycle of database resources so that we can provide database services to all services before the full three – or five-year resource budget is in place, increasing the business support capability of the database.

(2) Elastic scheduling based on load

Database services use two types of resources: instantaneous resources and incremental resources.

Transient resources refer to resources whose usage fluctuates severely in a short period of time. These resources mainly include CPU and memory.

An incremental resource means that the resource usage does not fluctuate seriously in a short period of time, but increases slowly and supports incremental increase without decreasing. Such resources mainly include disks. ContainerDB uses different scheduling policies for different resources. For instantaneous resources, ContainerDB assigns three criteria to each database:

  • Lower limit: 2C/4G, upper limit: 4C/8G

  • Lower limit: 4C/8G, upper limit: 8C/16G

  • Lower limit: 8C/16G, upper limit: 16C/32G

The initial resource allocated to each container is the standard lower limit. When the CPU load is too high or the memory is insufficient, the database tries to apply for more CPUS or memory than the lower limit. After the load is recovered, the extra resources are released until the CPU and memory resources are restored to the lower limit.

Incremental resources: Disks: 64 GB disks are allocated at the beginning of service access. When the disk usage reaches 80% but does not reach the upper limit of 256 GB, a vertical upgrade is performed. If the current disk of a container reaches the upper limit of 256 GB, online Resharding is performed.

  • Vertical upgrade: A resource check is performed to check whether the host has sufficient remaining disk resources for the vertical upgrade. If the check succeeds, a global resource lock is imposed on the host, and the disk is vertically expanded and 64 GB is added. If the check fails, a new container with the disk capacity +64 GB and the same CPU and memory as the current container is provided on the host, and the database services are migrated to the new container. Vertical upgrades are instantaneous and do not affect database services.

  • Online Resharding: Apply for two new Shards. The disk, CPU, and memory standards of the database Container in the new Shard are the same as those in the current Shard. Rebuild the primary/secondary relationship between all databases in the new Shard. Then the Schema information copy and filtering replication are started. Finally, the routing rules are updated and the read and write traffic is switched to the new Shard and the old Shard resources are taken offline.

Whether it is vertical upgrade or online Resharding, it is important to note that while ensuring that the Master of each shard is in the main machine room, it is not possible to allocate all resources to one host/rack/machine room. ContainerDB provides strong affinity/anti-affinity resource allocation capabilities. Currently, ContainerDB’s affinity/anti-affinity policies are as follows:

Each KeySpace has a main room, and the database instance belonging to the same Shard (currently a Shard contains 1 master and 2 slave) should be allocated as much as possible: Master must belong to the main equipment room. No two Master instances belong to the same rack, and no three Master instances belong to the same IDC. This policy prevents both the Master and the slave from being faulty due to a power failure of a cabinet, and prevents all database instances from being unavailable due to IDC faults.

Therefore, if the resources in the resource pool are not evenly distributed, the anti-affinity policy may not be used during resource allocation. Therefore, ContainerDB has a resident background process that continuously polls all shards in the cluster to determine whether the instance distribution meets the anti-affinity rule. If it does not, it attempts to redistribute instances. In order not to affect online services, the secondary library is preferentially redistributed.

Based on elastic scheduling capabilities, ContainerDB implements the following three functions:

  • Online capacity expansion: When the database load of a Shard reaches the threshold, the Shard is automatically upgraded, migrated, or Resharding online.

  • Online self-healing: When a MySQL instance in the Shard fails, ContainerDB checks whether the failed instance is master. If it is master, ContainerDB selects the slave with the largest GTID as the new master. If it is not master, slave complement is performed directly.

  • Online access: ContainerDB allows users to start online data migration and access tasks in self-service mode. This task migrates data from the traditional MySQL database to ContainerDB online. After the data migration is complete, the domain name is automatically switched to complete online non-aware migration of service system data sources.

ContainerDB provides three functions: capacity expansion, Online self-healing, and Online access, ensuring jd’s Always Online database service.

(3) Not only scheduling

Flexible and streaming resource delivery and scheduling are the cornerstones of ContainerDB, but in addition to these two core features, ContainerDB also makes improvements in user ease, compatibility, and data security, including:

Data protection

In the traditional database directly connected scheme, when the Master is unreachable, the Slave is changed to the Master and the domain name of the original Master is transferred to the new Master. However, in the case of network jitter, it is easy to cause double Master and dirty write due to the DNS cache on the AppServer. As you can see from the overall architecture diagram, ContainerDB connects to users via Gate. Gate is a clustered service. Multiple Gate services are mapped to the same domain name. Gate accesses each MySQL service through an IP address. When the Master of a MySQL server in ContainerDB becomes unreachable, a new Master is selected and the routing metadata information is updated. The Master switchover is performed only after the Master switchover is performed. In this way, the dual Master switchover and dirty data write due to network jitter and DNS cache are avoided, and data is strictly protected.

Streaming query processing

ContainerDB implements prioritized merge sort on the Gate layer to provide fast streaming query. When a large volume of data is queried, part of the query results can be returned instantly, greatly improving customer experience.

No sensing data migration

ContainerDB develops the online data migration and access tool JTransfer by implementing the algorithm of partial stock data copy and incremental data addition in the Window function respectively. Through JTransfer, it can migrate the dynamic data from the traditional MySQL database to ContainerDB. When the lag between the ContainerDB data and the source MySQL data is less than 5 seconds, the server stops writing data in the source MySQL database. When the lag becomes 0, the domain name of the source MySQL database is migrated to the Gate cluster. The AppServer is unaware of the migration.

Compatible with MySQL protocol

ContainerDB is fully compatible with the MySQL protocol, supports standard MySQL clients and official drivers, and supports most ANSI SQL syntax.

Routing rule transparency

ContainerDB is connected to the user through a Gate cluster. The Gate obtains all tables involved in the query based on the syntax tree and query execution plan generated by the query statement sent by the user. The fragmented information of each table is obtained based on the metadata information in the Topology. Finally, the query or write is routed to the correct shard by combining the association condition in the Join statement with the predicate information in the Where clause. The whole process is automated by Gate and completely transparent to the user.

Self-service service

ContainerDB abstracts functions such as database service instantiation, DDL/DML execution, sharding upgrade, and expansion into independent interfaces, and provides streamlined self-service user access service with the help of a process engine. After users successfully apply for database service, ContainerDB automatically pushes the database access password to the user’s email box.

The past is gone, the future is here.

We’ll be thinking more about the value that databases can generate from a user’s perspective. We believe that jingdong’s future database services will:

  • More Smart: Based on the monitoring data of different resources such as CPU, memory and hard disk in each database instance, we will conduct deep learning and cluster analysis to analyze the tendency resources of each database instance, and intelligently adjust the limit of the tendency resources of each database instance and lower the limit of the non-tendency resources.

  • More Quick: We will analyze the corresponding relationship between the host and the container, the restriction parameters of each container and the historical resource growth rate of each container in real time. We will sort out the fragments of the host where the container is located in advance, so as to ensure that each container can be upgraded vertically to achieve capacity expansion, thus greatly speeding up capacity expansion.

  • More Cheap: We will provide a storage engine completely developed by ourselves. We plan to realize the integration of query engine and storage engine, and provide a multi-model database engine, so as to realize the unification of various data models and greatly save resources required by database services and research and development costs.

  • More Friendly: Both ContainerDB and our own multi-model database are fully compatible with the MySQL protocol and syntax, making the migration cost of existing applications close to zero.

  • More Open: ContainerDB will embrace Open source through various scenarios within JD.com and look forward to working with others in the industry to improve ContainerDB. Our subsequent multi-model database will eventually be made available to the open source community, and we look forward to making it available to the industry.