The Data Intelligence Group of Baidu Map Open Platform Business Department is mainly responsible for big data calculation and analysis of Baidu Map internal related businesses, processing daily data with a scale of ten billion, and providing OLAP multi-dimensional analysis and query service with a single SQL millisecond response for different businesses.
For the application of Apache Kylin in the actual production environment, Baidu Map Data Intelligence Group is one of the earliest practitioners in China. Apache Kylin was open source in November 2014. At that time, our team was in need of building a complete OLAP analysis and calculation platform for big data, which was used to provide multi-dimensional analysis and query service for single SQL millisecond-second level data of ten billion rows. In the process of technology selection, We refer to Apache Drill, Presto, Impala, Spark SQL, Apache Kylin, etc. For Apache Drill and Presto due to fewer production environment cases, considering the problems encountered in the later stage is difficult to discuss, and the overall development of Apache Drill is not mature. For Impala and Spark SQL, it is mainly based on memory computation, which requires high machine resources. A single SQL can satisfy the second-level dynamic query response, but the interactive page usually contains multiple SQL query requests. Under the large scale data scale, the dynamic computation is difficult to meet the requirements. Later, we focused on an Apache Kylin solution that generated Cube and provided low-latency queries based on MapReduce predictors, and completed the first full deployment of Apache Kylin in production around February 2015.
Apache Kylin is an open source distributed analysis engine that provides SQL query interfaces and multi-dimensional analysis (OLAP) capabilities on top of Hadoop to support very large scale data. It was originally developed by eBay Inc. Developed and contributed to the open source community, and officially graduated as a top Apache project in November 2015.
The challenge of big data multidimensional analysis
We ran several Cube tests on the Apache Kylin cluster, and the results showed that it can effectively solve the three pain points of big data computing and analysis.
Pain point 1: The time-consuming problem of dynamic calculation of multidimensional indicators of massive data of ten billion levels is solved by Apache Kylin by generating Cube result data set through predicted calculation and storing it in HBase.
Pain point 2: complex condition screening problem. When users query, Apache Kylin uses Router to search algorithm and the optimized HBase Coprocessor to solve it.
Pain point 3: query problem across large time intervals such as month, quarter and year. For the storage of predicted calculation results, Apache Kylin uses Cube’s Data Segment partition storage management to solve the problem.
The solution of these three pain points enables us to achieve a single SQL millisecond response in a multi-dimensional analysis product defined by the data model at the scale of billions of billions of data. Therefore, we have a high interest in Apache Kylin. In the application of big data computing query analysis, a page usually needs multiple SQL queries. Assuming that a single SQL query needs 2 seconds to respond and there are 5 SQL requests on the page, it will take about 10 seconds in total, which is unacceptable. At this point, Apache Kylin is particularly good at responding to multiple SQL queries on a single page.
In practice, according to the needs of different businesses of the company, the background storage and query engine of OLAP platform for big data of our data intelligence team adopts Apache Kylin, Impala and Spark SQL. In the case of small and medium-sized data scale and relatively random analysis dimension indicators, The platform can provide Impala or Spark SQL services; In the case of a specific product with a very large scale of ten billion rows of data, we used Apache Kylin solution because of the high demand for query performance and the clear dimensions and indicators to be analyzed for the specific product. The following will mainly introduce the practical use of Apache Kylin inside Baidu Map.
Big data OLAP platform system architecture