What's didi's secret to handling so much data?

Abstract

This speech is mainly to share the application scenarios and practices of real-time computing in Didi.

On August 12, 2017, Liang Liyin, head of Didi Real-time Computing Platform, delivered a speech titled “Real-time Computing Practice of Didi Massive Data” at netease Erudite Practice Day: Big Data and Artificial Intelligence Technology Conference. IT big said as the exclusive video partner, by the organizers and speakers review authorized release.

Read the word count: 1260 | 4 minutes to read

Video playback of guest speech:
t.cn/RQXAmrK

Didi big data system

The main feature of Didi’s big data system is that the data is real-time, and more than 90% of the data can be collected. There are three types of data sources. One is Binlog data. All the data on the end is collected in real-time through Binlog in the database. In addition, publicLog is used to collect all logs on the server in real time. And the end of the buried point reporting data.

Because all of our data are basically collected in real time, real-time technology is also widely used in customer-level processing processes. Three products are used in real-time storage. One is ES, which is mainly for log retrieval and real-time analysis. Another is Druid, for real-time reporting and real-time monitoring; HBase performs query and data scanning.

Hive and Spark are used in the offline part. Hive is responsible for ETL, and Spark is responsible for data analysis and query. Spark Streaming and Flink Streaming are used for Streaming calculation.

In terms of scale, our real-time storage and offline scale have achieved the leading level in China.

Real-time computing scenario

There are four scenarios for real-time computing: ETL, real-time reports, real-time monitoring, and real-time services.

Because 90% of our data are collected in real time, ETL is the first step after collection, so the scale of ETL is the largest now. Real-time reports can be used by operations and customer service for presentation of reports.

The scale of real-time monitoring is second only to ETL. There are two types of internal monitoring requirements. One is machine-level, using other technical solutions. The rest is the real-time monitoring of business, such as daily order quantity, balance rate and other data, all using the real-time computing system.

The real time business is our key breakthrough part this year, we want to make some breakthroughs in the scenario of streaming computing on the end.

Real-time ETL

To facilitate the use of ETL, we have made it platformized so that users only need to configure on the Web to implement data cleaning. The current cleaning volume can reach about 3.5 million data per second, and about several P data volumes will be cleaned every day. This is done entirely by cloud computing based on Spark Streaming.

Real-time reports

Spark Streaming and Druid are the main real-time technologies used for real-time reports. Spark Streaming or data cleaning. Druid can consume Kafka data in real time, but there are requirements on the data, so it goes through a round of cleaning and conversion.

Real-time reports are also more scenarios, customer service screen, abnormal statistics and order heat map.

The customer service large screen is a screen that can display the customer service call response rate, complaint hotspot and queuing situation and other information.

Exception statistics includes the monitoring of requests sent from the end to the server. The success rate of requests, failure rate and number of requests can be monitored in this way.

The order thermal map can show the order quantity, passenger quantity and driver quantity of a certain area in the way of map.

We chose Druid because of some of its features, such as flexible queries.

Real-time monitoring

In order to improve the monitoring efficiency in the future, we built a one-stop self-service monitoring platform and carried out the construction of a full-link platform.

Based on this platform, didi has about 200 internal data sources and 400 to 500 indicators to monitor.

Real-time business

Flink Streaming is a new engine introduced this year. We want to provide a better solution to the problems of very high latency, data loss and data duplication through real-time services.

Facing the challenge

Lower real time computing development costs: Real time computing is still a lot harder to develop than Hive and so on, and we’re exploring ways to make it easier.

Real-time business development and challenges: Our technology is very mature in the field of real-time ETL, real-time report and real-time monitoring, which basically covers all the internal business scenarios of Didi. Real-time services have very high requirements for delay and fault tolerance, which is an important challenge we are facing now.

Reasonable allocation of business peak and valley resources: what we need to do now is how to allocate resources reasonably, so that resources can be used more reasonably and save costs for the company.

That’s all for today’s sharing, thank you!

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

What’s didi’s secret to handling so much data?

Abstract

Didi big data system

Real-time computing scenario

Real-time ETL

Real-time reports

Real-time monitoring

Real-time business

Facing the challenge

What’s didi’s secret to handling so much data?

Abstract

Didi big data system

Real-time computing scenario

Real-time ETL

Real-time reports

Real-time monitoring

Real-time business

Facing the challenge

Related Posts

Performance optimization of Nginx high performance Web Server

High concurrency second kill system based on Redis+Zookeeper+MySQL

Mathematical principles in the philosophy of HashMap