Author: Zhang Liang, Head of Commercial data at Didi Cloud

I joined Didi in 2014 and was responsible for the engine construction of LogAgent, Kafka, ElasticSearch and OLAP. I have rich experience in the architecture design and research and development of high-concurrency and high-throughput scenarios. Presided over the platform design and research and development of data systems such as task scheduling system, monitoring system, log service, real-time computing and synchronization center.

Share background

In Didi, I was responsible for the construction of open source big data engine service systems such as LogAgent, Kafka, Flink, Elasticsearch, Clickhouse, etc., through many detachings and pitfalls, I have accumulated some practical experience. Outbreak in recent years, accelerated the pace of enterprise digital transformation, with dozens of Internet, finance, securities, education enterprise has carried on the depth of communication, based on open source data engine to build independent, controllable, security service system has a strong appeal, how common confusion is based on open source engine, combining with the characteristics of enterprises and development stage, service system construction for high ROI.

Ii. Construction practice

Didi’s big data infrastructure construction based on the open source engine started from the BI requirement of data-driven business operation and business decision. With real-time data flow reaching 100 MB/S and storage reaching PB level, the service operation of the open source data engine will encounter various challenges of stability, ease of use and operation and maintenance friendliness. It has gone through four stages: engine experience stage, engine development stage, engine breakthrough stage and engine governance stage. The pain points and service challenges encountered in different stages are different:

  • Engine experience period: with the rapid development of business, engine selection, version selection, model selection and deployment architecture design are the key to early stability.

  • Engine development period: with the growth of engine users, q&A, best practice implementation and online problem diagnosis in daily operation consume more than 60% of the team’s energy. It is urgent to build big data PaaS layer to reduce the threshold of user engine technology learning, application, operation and maintenance and improve user self-service capability.

  • Engine break-in period: with the expansion of the cluster scale, the business scenario, certainly will touch the ability to open source engine border, need to build internal iteration mechanism, based on open source engine needs both closely together, with the open source community smooth version upgrades, enjoy the community technology dividend, and need to be done on the basis of the open source engine BUG FIX and feature enhancements.

  • Engine governance period: As the PaaS platform is built and engine versions are rapidly iterated, there are three main categories of problems: mixing of SLA scenarios without discerning SLA scenarios, misuse beyond the engine’s capabilities, and abuse without cost awareness. As a result, the reputation of engine services is low, and the ROI of resources (machine + manpower) is low. Therefore, the construction of metadata-driven engine governance system is urgently needed.

1. Engine experience period

There are many open source engines to solve specific technical problems, such as message queue Kafka, Pulsar, RocketMQ, etc. Technology selection is crucial to service SLA, operation and maintenance, and seriously affects happiness and sense of value.

1) Engine selection

Consider Star number, Contributor number, and whether there is PMC or Committer in China. There are many factors such as the wide range of application, whether there are many endorsions of production practices of large factories, whether there are frequent offline Meetup, online q&A response speed, whether there are abundant online best practice materials, whether the deployment architecture is simplified and so on.

2) Model selection

Services can be classified into IOPS and TPS scenarios. Open source engines can be CPU intensive, I/O intensive, and hybrid. The following uses distributed search engine Elasticsearch as an example. In enterprise-class search scenarios, random I/OS are frequent and DISK IOPS is high. SSD disks are required only. CPU consumption You need to evaluate the CPU requirements of the service scenario based on the query complexity and QPS. It is recommended to BenchMark the performance of Elasticsearch engine in a specific scenario to provide a basis for model selection.

Based on engine principles, industry best practice, application scenarios, and pressure test results, select the available models. The ideal state is to balance CPU, disk capacity, network I/O, and disk I/O resources to make CPU the bottleneck. In addition, with the continuous optimization of engine and the development of hardware and software infrastructure, the replacement of machine warranty is normal, and the selection of the best model is a dynamic evolution process. In 2020, didi Operation and Maintenance support team conducted a new round of model optimization and scene tuning, which reduced the cost of Elasticsearch log cluster by half. The average peak CPU utilization reached 50%.

3) Deployment selection

The following uses Elasticsearch as an example. The minimum high availability deployment cluster contains three nodes. A single Node acts as a Master Node, a Client Node, and a Data Node. Generally 3 to 5 node cluster scale, once to dozens of nodes, write throughput reach hundred MB/S, node network processing thread pool, memory resources competition highlights, yuan index of data processing and data processing resources are not isolated, meta information can lead to cluster synchronization performance or consistency problems, trigger cluster is unavailable, so after reaching a certain size, Consider role-based deployment.

Didi Elasticsearch role deployment practice

To apply for required engine services, users need to provide a reasonable deployment architecture based on service importance. There are two common modes: single-tenant single-cluster VS multi-tenant large cluster. In the engine experience stage, the control of the engine is limited, the online service has high stability requirements, and the delay and jitter are sensitive. Therefore, it is recommended to choose the independent cluster solution. In offline scenarios, overall resource utilization is concerned, and RT jitter is insensitive. Therefore, you are advised to deploy a large cluster with multiple tenants.

2. Engine development period

With the continuous growth of engine service business, the number of clusters (10+) and cluster size (100+) grow rapidly. On the one hand, there will be a surge in user consultation and question answering. The reason is that the engine has a high threshold to get started, and more importantly, it is related to resource application, Schema change and other high-frequency user operations that are not enabled by platform. On the other hand, it will encounter the boundary of operation and maintenance support capability, with incomplete indicator system, inefficient problem diagnosis, and lack of support of degraded, traffic limiting, safe and cross-AZ high-availability service system, making operation and maintenance personnel tired. This stage need to be two aspects of the work, on the one hand, need to improve engine users, operational security personnel’s work efficiency, the high frequency of changes or operating platform implementation, on the other hand need to supplement source engine in operational friendliness on board, the build engine of index system, a high availability system, upgrade the service architecture.

1) Engine PaaS platform construction

Nanny-type human flesh supports users’ operations such as high-frequency resource creation and Schema change, which can be handled in the early stage of engine construction when there are not many users. With the increase of engine service users, the threshold of the engine is high, at most there is a Quick Star small white user guide; The official user manual focuses on functional description, and the lack of scenario-based guidance of best practices combined with business scenarios is fully exposed. The user service system of ALL IN ME has become the bottleneck of the forward enabler service.

Most open source engines have their own Metric systems, which are complex and obscure, and lack a deep understanding of the engine’s business processes. In addition, all engines have their own atomic API capabilities, scripting to achieve high frequency operations, the operation process is not transparent, and the lack of Double Check enforcement mechanism is prone to safety and zero risk.

To sum up, a PaaS cloud management system is urgently needed to reduce access and use costs for ordinary users, provide FAQ for common problems, and settle best practices for business scenarios. To create fully managed engine services for operation and maintenance support personnel, a systematic index system of engine construction is needed to improve the efficiency of problem location. You need to automate cluster installation, deployment, upgrade, and capacity expansion on a platform to improve cluster change efficiency and security.

Take Didi Kafka PaaS cloud platform Logi-Kafkamanager as an example to introduce the platform construction concept, the specific design see: Didi open source Logi-Kafkamanager one-stop Kafka monitoring and control platform, has been open source github.com/didi/Logi-K… .

2) Upgrade the engine service architecture

Open source projects generally open their core capabilities, define the interaction protocol between the client and the server, and give developers their own contributions to the surrounding ecological docking and the maintenance of SDK versions of different languages. Many enterprise-level features, such as tenant definition, tenant authentication and tenant flow limiting, are given to users to expand and implement by themselves.

Early in the open source service application, is often because of business need to introduce a researcher, only to understand the basic principles of the engine, in order to fill the open source engine in a safe, current limiting, disaster, monitoring, etc, are produced by the business/middleware architect to choose their own familiar with the language of open-source SDK do a plane packing layer, the business of invasive do minimum, Maintain SDK version, promote and upgrade uniformly.

With constant expansion of business, business service middle and micro can fall to the ground, with the expansion of the personnel and organization and entropy, the SDK ability enhancement and BugFix change becomes difficult, upgrading the SDK, promote business communication and landed cost is extremely high, need to do pressure test, to reduce the intrusion and demonstrate business income, release the rhythm of the business and so on all kinds of work, The convergence period of SDK is in years. The engine service mode of SDK cannot support the rapid development of services.

Based on the above reasons, the industry generally adopts the classic Proxy architecture to build a bridge between the server and the client. Many internal engines of Didi have corresponding implementation practices, such as Kafka-Gateway, ES-Gateway, DB-proxy, Redis-proxy, etc. The following uses Didi Elasticsearch as an example to describe Proxy architecture construction.

  • The extension of enterprise features such as security, traffic limiting, and high availability at Elastic Proxy layer breaks the bottleneck of cluster scale expansion of big data engine. Users can enjoy the service mode of multi-tenant shared cluster, exclusive cluster, and independent cluster with different SLA levels without paying attention to the details of underlying physical cluster and resources. A fully managed service form is built.

  • Cross-version upgrade has many incompatibable points in storage format, communication protocol and data model, so it can rely on Proxy architecture to achieve cross-version smooth upgrade. The following is the architectural practice of Cross-version upgrade 6.6.1 from ES2.3 in 2019. For details, please refer to: Didi ElasticSearch platform cross-version upgrade and platform reconstruction road

The platform architecture of Proxy really achieves the decoupling of the underlying engine service and the upper business architecture, laying a solid foundation for the subsequent technological innovation of the underlying engine and rapid iteration of the service architecture.

3. Engine breakthrough period

With the rapid development of services, real-time peak traffic reaches GB/S and offline data increment reaches TB/ day. Both data throughput and cluster scale have reached the capacity limit of the open source engine. On the one hand, low-frequency fault scenarios, such as disk failure, machine crash, Linux kernel problems, occur frequently when cluster instances reach hundreds of nearly a thousand scale; On the other hand, the engine has a high probability of triggering bugs in extreme scenarios. They all need to have deep control of the engine, be able to do daily Bug fixes and internal version iteration, and finally achieve autonomy and control of the engine.

1) Depth control of engine principle

As engine service business to expand, the engine’s commercial value, the enterprise gradually set up special engine research and development team, how to expand the depth study and improvement of the open source engine, as well as the subsequent how to Follow the rhythm of the community, at the same time of iterative dividends enjoy community technology to achieve enterprise enhance features fall to the ground, combining with the experience of practice in details, Use Kafka as an example to throw out some insight.

  • To be familiar with the open source code quickly, it is necessary to build a local commissioning environment, be familiar with packaging, compilation and deployment, run test cases, learn to debug functions based on test cases, and read official user and development documents: kafka.apache.org

  • Familiar with engine startup and daily running logs, and familiar with the main flow and operating principle of function modules.

  • Regularly share engine principles and exchange source code, participate in community Meetup and follow up community Issue list.
  • Online problem review: read the source code is conducive to the establishment of the overall macro understanding of the engine, online problem is the best learning site. It is an efficient way to turn knowledge into cognition to find out the Root Cause of online faults and to recheck the engine in time. Engine students should see problems like gold, eyes shine, to the bottom, to improve the control of the engine.

  • Carry out chaos engineering, sort out abnormal scenarios of the engine, master the operating principle of the engine in this scenario by combining abnormal operation logs and monitoring indicators, define the engine capability boundary, and make a plan for stability guarantee.

2) Engine internal branch iteration

Online engine Bug: Generally, there are two solutions. One is that the community has solved the corresponding Issue/Feature, and we merge the corresponding Patch to the branch version maintained locally to fix the Bug. On the other hand, we submit an Issue to the community, propose a solution, test and Review according to the community Patch submission process, and finally merge into the current version. Didi contributes 150+Patches annually to the Apache community, including Hadoop, Spark, Hive, Flink, HBase, Kafka, Elasticsearch, Submarine, and Kylin open source projects.

Engine research and development of enterprise features: at the time of internal service, on the one hand, need to enterprise LDAP, security system, the control platform, on the other hand, the enterprise enhance engine characteristics, implementation way to compare localization or extreme small scene, the community does not receive corresponding Patch, we need to maintain their own feature list and expand. In terms of maintaining enterprise features, the modification is as cohesive as possible, and the plug-in implementation of engine interface can be expanded as far as possible to facilitate the subsequent version upgrade and local branch maintenance. Take Didi Elasticsearch as an example.

With the development of the community version and the evolution of the internal branches of the enterprise, it is necessary to regularly follow up the community version and enjoy the benefits of the community. The formulation of the upgrade plan needs to be very rigorous, involving many links and evaluation items. Take Didi Elasticsearch 7.6 as an example.

It generally involves sorting out version features, experiencing new features, evaluating upgrade benefits, and evaluating version compatibility. Internal features reverse-merge to open source version, functional testing, performance testing, compatibility testing; Make detailed upgrade and rollback plans, upgrade plans for the cluster, and evaluate users’ changes and impact levels.

4. Engine governance period

The core goal of engine governance is to pass business value up and ask for technical bonus down. The core is to see the problem clearly through data and find the highest entry point of ROI. At the same time, the core starting point of promoting business transformation is to clarify the long-term value of engine services and ensure the continuous investment and value creation of technology!

1) Transparent transmission of business value

With the increase of engine service business, business application scenarios continue to evolve dynamically. On the one hand, we can see resource ROI through resource utilization rate; on the other hand, we can Review the hierarchical guarantee system regularly based on users’ sensitivity to RT, attention to service operation stability and importance of application scenarios.

Core business, the energy input in operation and maintenance, the tilt of core hardware resources, the service mode of independent cluster, the design of high availability architecture, spare no effort to keep the bottom line of stability, is to create the largest business value.

Non-core business, through the business value sub-model, in coordination with the organizational construction grasp, the red and black list ranking operation, encourage “better use, better use” positive growth flywheel; For customers with high credit scores, more preference will be given to service response, resource application and business guarantee, which will ultimately ensure the maximization of overall resource ROI.

2) Ask for technical bonus

Service operation is usually meet 2-8 principles, 20% of the business core, specifically grading guarantee mechanism operation is reasonable, for service operator, pressure control, in the protection of fundamental situation, the core business value need side in the remaining 80% of the business, and on the resource, performance, cost, stability, unified management and optimized from the platform level, structured to reduce the service cost, Improve service quality and build positive service word-of-mouth system of “one for one” and “one for all”. Based on the perfect indicator system, we can have insight into the software bottleneck of open source system, and optimize and innovate in stages according to ROI. Specific cases can be referred to.

  • Didi Offline Index fast build FastIndex architecture practice
  • ElasticSearch TPS write performance doubling technology

When the size of the business reaches a certain scale, the commercial value of technological innovation can be reflected. The above optimization saves the company nearly ten million cost every year, and the value of technology is fully reflected.

Drops Logi

Didi Logi log service suite has been polished in Didi for more than 7 years, aiming at log collection, log storage, log computing, log retrieval and log analysis. It has carried out targeted optimization in component capability PAAS construction and engine stability and expansibility.

Currently, this suite has opened source Didi Logi-Kafkamanager, and will open source logi-Agent, Logi-LogX, logi-ElasticSearchManager and other PAAS suites in the future.

1. Github: z.didi.cn/4newP

2, rapid experience address: http://117.51.150.133:8080/kafka account password admin/admin

3, daily FAQ: github.com/didi/Logi-K…

4. Upgrade Manual: github.com/didi/Logi-K…

5. Didi Logi-Kafkamanager Cloud platform Construction Summary:

Mp.weixin.qq.com/s/9qSZIkqCn…

6, series of video tutorials: mp.weixin.qq.com/s/9X7gH0tpt…

Drops the nightingale

Didi Nightingale is a set of distributed and highly available operation and maintenance monitoring system. Its biggest feature is hybrid cloud support, which can support both traditional physical machine virtual machine scenarios and K8S container scenarios. Meanwhile, Didi Nightingale is not only capable of monitoring, but also of CMDB and automatic operation and maintenance. Many companies develop their own operation and maintenance platforms based on Didi Nightingale.

Making: z.d idi. Cn / 4 wurz

Official document: n9e.didiyun.com

Gocn. VIP /topics/1081…

The voice answer: m.ximalaya.com/keji/450958…

Video tutorial: m.bilibili.com/space/44253…

Secondary development: xie.infoq.cn/article/30d…

If you have problems using didi Logi-KafkaManager and nightingale, or have any questions you need to communicate with the developers, you can scan the qr code below to enter didi Logi and nightingale’s open source user group, and ask questions in the group.

There are Didi Logi-KafkaManager and Nightingale project leaders: Didi senior expert engineers — Zhang Liang, Qin Xiaohui and other technical gurus, online to answer your questions, welcome parents to press the TWO-DIMENSIONAL code and small assistant into the group. (Note Kafka or nightingale)