Yao Qin, netease Big data expert and Apache Spark Committer, and Chen Qi, principal of OLAP of Youzan Infrastructure Group, attended the big data special session of netease Digital Technology Salon jointly held by Netease Digital And Intel. Five experts including Xu Cheng, Intel Senior Software development project manager, Apache Hive Committer, Lei Jianbo, netease cloud music data expert, and Gu Ping, netease Shifan big data product expert. They shared their teams’ latest practices on topics such as Serverless Spark, ClickHouse, Spark/Flink acceleration, data warehousing, and data products.

Kyuubi: Open source enterprise Serverless Spark framework

Yao Qin, a Big data expert and Apache Spark Committer of netease, shared the original research and development intention, design key points and practice of Kyuubi, the open source project of Netease. Kyuubi is a distributed JDBC service that follows HiveSever2’s RPC implementation. With Spark’s multi-tenant capabilities, it can be an ideal Hive QL migration platform for Spark SQL. Second, the Compiler(Compiler optimization) and Runtime (execution) of the entire SQL are all implemented by Spark, which can achieve excellent performance. Under this framework, netease Shufan integrates some advanced features of Kyuubi and Spark and starts the Journey of Serverless Spark (Spark as a Service).

Kyuubi encapsulates Spark high-level apis and provides them through THE C/S architecture. Users are unaware of Spark concepts and frameworks and more focused on their own services and data. This can meet the direct needs of more people and more businesses for big data.

Within netease, Kyuubi has helped netease media business smooth migration from Hive QL tasks to Spark SQL, reducing overall time consumption by 70% and improving overall performance by 727% while saving computing resources by 50%. In addition, the team is helping the business line migrate Spark jobs from the YARN cluster to Kubernetes.

Video playback: www.bilibili.com/video/BV116…

PPT download: sq.163yun.com/resource/do…

Kyuubi open Source: github.com/NetEase/kyu…

ClickHouse is optimized in the use of likes

Qi Chen, OLAP director of ClickHouse Infrastructure group, introduced ClickHouse in three aspects: 1) Development of ClickHouse in ClickHouse, platform construction, application scenarios such as DMP, SCRM, CDP implementation and optimization. 2) Off-line read and write separation of hundreds of billions of levels of data, using off-line write K8s to temporarily build clusters to achieve off-line data read and write separation, so as to solve the business development problem of more write and less read. 3) I explored POC of the new database by myself, trying to integrate Doris and ClickHouse to solve the pain points of both sides.

ClickHouse does not look like a traditional distributed database. It is more “manual” and requires users to design a process to improve it, such as writing and materializing views. ClickHouse also doesn’t have automatic Rebalance capability, which makes capacity expansion and reduction operation and maintenance complicated. Apache Doris, by contrast, is more like a distributed database and addresses some of the pain points, such as auto-balancing and support for Shuffle Join, but so far its single-table performance, maturity, and stability are not as good as ClickHouse.

As a result, Uzan tried to replace impala-based Apache Doris with a high-performance ClickHouse operator implementation to build a better distributed OLAP database in the future. From the POC realization effect, the scheme is feasible.

Video playback: www.bilibili.com/video/BV1h6…

PPT download: sq.163yun.com/resource/do…

Accelerate big data analysis with Intel Optane PMEM technology

I shared how to use Intel Open Source project Optimized Analytics Package (OAP) to accelerate the performance of Spark and Flink. This section describes the performance improvement of the Existing Spark framework in memory management and Shuffle implementation. Spark has a number of further optimized feature points, as well as how to take advantage of new hardware, such as Intel Optane PMEM technology, to take advantage of the unique value of Optane, such as persistence, in-place erase, byte addressing and low latency.

Xu Cheng interpreted the characteristics of OAP Analytic Cache. Including the use of high performance module, the Spark in the Arrow/Flink cache awareness, Disaggregated cache, the Filter/Project/Aggregation to devolve and high-performance compression accelerators QAT support, etc. Using Spark Cache Awareness as an example, OAP extends the existing Spark Data Source Scan to identify cached hot data blocks and uses the Cache Location Provider to provide cache at the scheduling level

Awareness supports multiple Cache Location Providers for different usage scenarios.

Video playback: www.bilibili.com/video/BV1zb…

PPT download: sq.163yun.com/resource/do…

OAP open Source address: github.com/oap-project…

Netease cloud music warehouse construction road

Lei Jianbo, data expert of netease cloud Music, introduced that netease cloud music is lowering the threshold of data use, improving the effect of decision-making and utilization, and realizing data-driven business growth through standardized, sharing and self-service unified data warehouse system. He shared the practice and thinking of netease Cloud Music to cope with the challenges and the achievements achieved from two aspects of traffic data management and data asset precipitation.

Burying point is a huge pain point in terms of traffic data management, including the big difference in burying point format, the lack of specification and demand review in the link before burying point, the lack of good technical design and engineering specifications in the implementation of client burying point, and the need to re-extract JIRA bills for most aggregated flows. Netease Cloud Music realizes governance by establishing buried site specifications in advance, recreating buried site process in the process, and promoting gray audit afterwards. In this process, netease Cloud Music jointly built easyTracker buried point management platform and easyFetch self-service data platform with netease Digfan to ensure the standardization of buried point and self-service of flow data service.

Video playback: www.bilibili.com/video/BV1To…

PPT download: sq.163yun.com/resource/do…

Practice of netease data products

Gu Ping, a big data product expert of netease Sufan, shared the practice of Netease Yanxuan data products — he built the data product system and data center system of netease Yanxuan from 0 to 1. Netease strictly selected business is coming to “middle support + data product driven” twin-engine model, the release of the data value to support innovative business exploration, Gu Ping based on the practical business of netease strict selection, share covers the marketing and development of the supply chain system of data product ideas and steps, and introduces as middle data and data to support management experience.

Support the “brand + platform” operation mode of Yanxuan, Yanxuan data products cover three levels of digital operation, digital management and digital supply, including commodity data operation platform, marketing data operation platform, mobile data workbench, supply chain data operation platform four data products. Among them, the mobile data workbench is the first data product developed by Yan Xuan, which is mainly oriented to the data-oriented management of the management, helping to promote the successful construction of the data product system from top to bottom. According to Gu Ping, data products can be connected with business systems to provide abnormal monitoring, diagnosis and decision-making suggestions, but without the support of the data center, data products cannot be realized. Based on netease’s capability, Yan Xuan implemented the data system construction with high efficiency and quality.

Video playback: www.bilibili.com/video/BV1Bb…

PPT download: sq.163yun.com/resource/do…