At the recent netease Spark Technology Salon hosted by netease Spark and Intel, Yao Qin, a netease Spark Committer, and Chen Qi, head of OLAP of Youzan Infrastructure Group, attended the big data session. Intel senior software development engineering manager, Apache Hive Committer Xusheng, netease cloud music data expert Lei Jianbo, and netease big data product expert Gu Ping, etc. They shared their teams’ latest practices on topics such as Serverless Spark, ClickHouse, Spark/Flink Acceleration, Data Warehouse and Data Products.

Kyuubi: Open source enterprise Serverless Spark framework

Yao Qin, a netease big data expert and Apache Spark Committer, shared the research and development intention of the open source project Kyuubi, its design points and its practice in netease. Kyuubi is a distributed JDBC service that follows Hiveever2’s RPC implementation, making it an ideal platform for Hive QL migration of Spark SQL with Spark’s multi-tenancy capabilities. Second, it puts the entire SQL Compiler(Compiler optimization) and Runtime (execution) in Spark’s hands, which can achieve excellent performance. Under this framework, netease Spark integrates some of the advanced features of Kyuubi and Spark to begin the journey of Serverless Spark (Spark as a Service).

Because Kyuubi encapsulates Spark’s high-level API and is delivered via a C/S architecture, users are “unaware” of Spark’s concepts and frameworks and are more focused on their own business and data itself. This can meet the direct needs of more people and more businesses for big data.

Inside netease, Kyuubi has helped netease Media business to smoothly migrate Hive QL tasks to Spark SQL, reducing overall time consumption by 70% and improving overall performance by 727% while saving computing resources by 50%. In addition, the team is helping Line of Business implement the migration of Spark jobs from the YARN cluster to Kubernetes.

Video playback: https://www.bilibili.com/vide…

PPT download: https://sq.163yun.com/resourc…

Kyuubi open source address: https://github.com/NetEase/ky…

ClickHouse has excellent use and optimization

Chen Qi, head of OLAP of Youzan infrastructure group, introduced the use and optimization of ClickHouse in Youzan from three aspects: 1) the development of ClickHouse in Youzan, platform construction, application scenarios, such as the implementation and optimization of DMP, SCRM, CDP and other scenarios. 2) Offline read-write separation of 100 billion data volume. Use offline write K8S to temporarily build a cluster to realize read-write separation of offline data, so as to solve the business development problem of write more and read less. 3) Explore the POC of a new database, try to integrate Doris and Clickhouse to solve the pain points of both sides.

According to Chen Qi, ClickHouse is not like a distributed database in the traditional sense. The whole database is “manual”. In many places, users need to design a process to improve it, such as writing, materialized view, etc. At the same time, ClickHouse does not have the ability to automatically Rebalance, which makes the operation and maintenance of expansion and contraction particularly complicated. In comparison, Apache Doris is more of a distributed database and addresses some of its pain points, such as auto-balancing, Shuffle Join support, etc., but so far its single-table performance, maturity, and stability are not as good as ClickHouse’s.

Therefore, we try to replace the Impala-based Apache Doris with the high-performance ClickHouse operator to build a better distributed OLAP database in the future. From the effect of POC implementation, the scheme is feasible.

Video playback: https://www.bilibili.com/vide…

PPT download: https://sq.163yun.com/resourc…

Accelerate big data analytics using Intel Optane PMEM technology

Optimized Analytics Package (OAP) Optimized Analytics Package (OAP) Optimized Analytics Package (OAP) The performance of existing Spark framework in memory management, Shuffle implementation and other aspects can be further improved. Spark has a number of further enhancements to take advantage of new hardware, such as Intel Optane PMEM (Persistence Memory) technology, to take advantage of Optane’s unique value of persistence, in-place write, byte addressing, and low latency.

Xu Cheng mainly interprets the characteristics of OAP Analytic Cache. Including the use of high performance module, the Spark in the Arrow/Flink cache awareness, Disaggregated cache, the Filter/Project/Aggregation to devolve and high-performance compression accelerators QAT support, etc. Taking Spark Cache Awareness as an example, OAP extends the existing Spark Data Source Scan to recognize cached hot data blocks, and uses the Cache Location Provider to provide cache awareness at the scheduling level. Multiple cache location providers are supported for different usage scenarios.

Video playback: https://www.bilibili.com/vide…

PPT download: https://sq.163yun.com/resourc…

OAP open source address: https://github.com/oap-project/

The road to the construction of netease cloud music database

According to Lei Jianbo, the data expert of netease cloud music, netease cloud music is reducing the threshold of data use, improving the effect of decision utilization and realizing data-driven business growth through standardized, shared and self-service unified data warehouse system. From the two aspects of traffic data governance and data asset precipitation, he shared the practice and thinking of netease cloud music to deal with the challenges, as well as the achievements achieved.

In terms of traffic data governance, the buried point is a huge pain point, including the large difference in the format of the buried point, the lack of norms and requirements review before the buried point, the lack of good technical design and engineering specifications for the implementation of the buried point on the client side, and the need to re-submit JIRA for most aggregated traffic. Netease cloud music realizes the governance by establishing the buried point specification in advance, rebuilding the buried point process in the process, and promoting the gray scale audit afterwards. In this process, netease Cloud Music builds EasyTracker buried point management platform, EasyFetch self-service number platform and other systems with netease Sketch to ensure the standardization of buried points and self-service of traffic data.

Video playback: https://www.bilibili.com/vide…

PPT download: https://sq.163yun.com/resourc…

Netease Data Product Practice

Gu Ping, a big data product expert from netease, shared the practice of netease strict selection of data products — he built netease strict selection of data products system and data middle platform system from 0 to 1. Netease strictly selected business is coming to “middle support + data product driven” twin-engine model, the release of the data value to support innovative business exploration, Gu Ping based on the practical business of netease strict selection, share covers the marketing and development of the supply chain system of data product ideas and steps, and introduces as middle data and data to support management experience.

Support Yanxuan “brand + platform” operation mode, Yanxuan data products cover three levels of digital operation, digital management and digital supply, including commodity data operation platform, marketing data operation platform, mobile data workbench, supply chain data operation platform four data products. Among them, the mobile data workbench is the first data product developed by Yanxuan. This product is mainly oriented to the data-oriented management of the management, which helps to promote the successful construction of the data product system from top to bottom. Gu Ping said that data products can be connected to business systems to provide abnormal monitoring, diagnosis and decision-making suggestions, but they cannot be realized without the support of a data center. Based on the ability of netease, strict selection of efficient and high quality of the implementation of the data system construction.

Video playback: https://www.bilibili.com/vide…

PPT download: https://sq.163yun.com/resourc…