Hundreds of millions of data, how to query and analyze simple and efficient?

Abstract:At the time of the 618 promotion, Xiao Zhang encountered a thorny problem, which required a joint analysis of the revenue of the company’s e-commerce department in the past year and the operation data of offline stores within a week.

What kind of data challenges does this create?

Data island: The data of e-commerce department is stored in warehouse A and the operating income data of stores is stored in warehouse B. How to conduct multi-warehouse joint analysis conveniently?
Pb-level data volume: Multi-e-commerce platforms + nationwide offline stores will generate TB-level data volume every day, and the annual data volume is as high as PB-level!

He immediately contacted the CTO of the group, hoping to export the data of each department to him within a day.

At this point, the CTO was stumped:

The existing resource pool of the company can cope with TB-level data freely, while the data volume of Xiaozhang is roughly estimated to reach PB level, which is far beyond the range of the existing resource pool of the company and can only be exported at the cost of time. And the overall cost of expanding the company’s resource pool for unusual scenarios is too high.

In the face of zhang’s difficult problems, Yunhuhu recommended a Huawei cloud big data query and analysis magic device — Data Lake Exploration (DLI) service; A DLI can pry eB-level data volume joint query, only 0.35 yuan/hour per CU (1CU=1Core4G Mem), 1CU monthly package only 150 yuan.

Data Lake Exploration (DLI) Service 2.0 is a Serverless big data computing and analysis service fully compatible with Spark and Flink ecosystem. Users can query and analyze heterogeneous data sources using standard SQL or programs.

How does DLI solve the small zhang problem?

DLI service architecture – Serverless

DLI is a serverless big data query and analysis service. Its advantages are as follows:

(1) Charging by volume: real charging by usage (scanning volume /CU), 0 charge when no operation.

(2) Automatic capacity expansion: The system automatically expands and scales computing resources based on service loads.

The DLI Serverless architecture can easily solve the problems of small costs, insufficient resources, and AD hoc business requirements.

1. Spark+Flink, the core DLI engine

Spark is a unified analysis engine for large-scale data processing, focusing on query computational analysis. Based on open Source Spark, DLI performs a lot of performance optimization and service transformation. It is compatible with The Apache Spark ecosystem and interface, and its performance is 2.5 times higher than that of open source. It can query and analyze EB-level data at the hour level. DLI also provides a Flink engine for real-time processing.

2. DLI trump card function — cross-source analysis

DLI supports a variety of cloud services on the cloud, self-built databases and offline databases, and can directly implement cross-database analysis of multiple data sources to build a unified view of the enterprise.

When Xiao Zhang connects offline warehouse A and warehouse B to DLI at the same time, he can conduct joint query directly on DLI. It avoids the process of data migration and re-establishment of warehouse for joint query, and easily handles cross-database query.

Other benefits of data Lake Exploration (DLI) services

Pure SQL operation: Provides standard SQL interfaces, enabling users to query and analyze massive data using ONLY SQL.
Separation of storage and computation: decouple storage and computation, separate application and accounting, reduce costs and improve resource utilization.
Enterprise multi-tenant: Computing resources are isolated by tenant and data permissions are controlled to queues and jobs, helping enterprises share data between departments and manage permissions
O&m free, HA: Users do not need to be aware of underlying O&M, upgrade, cross-AZ HA, and cross-AZ hypermetro.

Application scenario of Data Lake Exploration (DLI) service

1. Database analysis +DLI 2.0: One-click warehouse building retains the easy-to-use experience of database

Pain points:

(1) Most databases cannot do full analysis

(2) Complex database relationships cannot be queried

(3) Other online data services are affected

Solution:

Big data query analysis can be completed using only standard SQL

2. Precision marketing +DLI 2.0: E-commerce intelligent recommendation cross-database cross-source massive data second-level query

Pain points:

(1) Too many data sources, how to make joint analysis

(2) Intelligent recommendation needs to be realized in a short time

Solution:

DLI cross-source capability easily breaks data silos. It now supports 10 types of data sources and offline self-built data.

3. Log analysis +DLI 2.0: Company mandatory Scenario Charging by volume reduces the cost

Pain points:

(1) Log analysis has a long time span

(2) Large idle resources with low utilization

Solution:

DLI charging by volume, single CU only 0.35 yuan per hour.

4. Real-time risk control +DLI 2.0: finance, operation and maintenance and other real-time scenarios to reduce risk events

Pain points:

(1) Data is not refreshed in time, and risk events occur frequently

(2) It is necessary to have an in-depth understanding of Flink background architecture for real-time data analysis

Solution:

The risk control system has high requirements on real-time performance. DLI adopts high-performance computing resources, and a single CPU can handle 1000 ~ 20000 messages per second.

Serverless big data service is a future-oriented form. As the current problems are broken one by one, its proportion in big data analysis will increase year by year. Truly turning big data analytics into an accessible tool that every enterprise can afford, just like water and electricity. Huawei Cloud Data Lake Exploration (DLI) service enables enterprises to easily complete batch processing and stream processing of heterogeneous data sources, and excavate and explore data values.

For more information, visit huawei Cloud Data Lake Exploration (DLI) Service Officer

Click to follow, the first time to learn about Huawei cloud fresh technology ~

Hundreds of millions of data, how to query and analyze simple and efficient?

What kind of data challenges does this create?

How does DLI solve the small zhang problem?

Other benefits of data Lake Exploration (DLI) services

Application scenario of Data Lake Exploration (DLI) service

Related Posts

How wasteful is electricity for a business? Post-90s developers show off their skills to make every hour of electricity smarter!

Please sign for it, it’s your 2018 New Year gift!

A comparative analysis of DQN and DDPG, two marriages of deep learning and reinforcement learning