One, foreword

I worked for Sina advertising DMP team, ran MapReduce, and wrote Storm, Spark Streaming, and Hive. As the technical leader of Alibaba’s blood red envelope, I used Blink to handle 800,000 QPS traffic peak of Double Eleven and was responsible for the issuance of billions of cash red envelopes. The simplified desensitized version of the project code was opened on Aliyun, which was the best practice of Blink in the e-commerce field. I have also boldly abandoned the traditional sub-database and sub-table database in the task engine, a technical product that supports more than ten teams and dozens of businesses of Ali, and adopted the advanced cloud native multi-model database Lindorm. I have rich experience in using and trepsing.

Based on the above personal experience, I would like to briefly describe the evolution of the technology stack in the field of big data in the last five years and explain some of the key technologies mentioned.

This article is intended to be an overall overview. Some of the arguments are based on personal experience and are not industry-accepted conclusions.

Two, big data technology division

3. Historical evolution of big data technology

3.1 Historical evolution of streaming Computing

At present, there are three mainstream Streaming computing frameworks: Storm/Jstorm, Spark Streaming and Flink/Blink.

Apache Storm is a distributed real-time big data processing system. Storm is designed to handle large amounts of data in a fault-tolerant and horizontally scalable approach. It is a streaming data framework with the highest uptake rate. In Storm, you need to design a real-time computing structure, which we call a topology. The topology is then submitted to the cluster, where the master node assigns code to the worker node, and the worker node executes the code. In one topology, there are two roles, spout and Bolt. Data is passed between spouts, which send the data stream as a tuple; Bolt transforms the data stream. Jstorm is an Apache Storm copied by Alibaba using Java language. It is claimed to have four times the performance of Apache Storm. It stopped updating in 2016.

Spark Streaming, an extension of the core Spark API, does not process one data stream at a time like Storm. Instead, it segments and shards the data stream at time intervals before processing it. Spark’s abstraction for continuous data streams is called DStream (Discretized Stream). DStream is a small batch RDD (elastic distributed data set), while RDD is a distributed data set, which can be converted by arbitrary functions and sliding data Windows (window computing) to achieve parallel operation.

Apache Flink is a computing framework for stream data + batch data. Batch data is considered a special case of stream data, with low latency (in milliseconds) and the ability to ensure that message transfers are not lost or repeated. Flink creatively unifies stream processing and batch processing by treating the input data stream as unbounded, while batch processing is treated as a special kind of stream whose input data stream is defined as bounded. The Flink program consists of two basic building blocks, Stream, which is an intermediate result data, and Transformation, which is an operation that computes one or more input streams and outputs one or more result streams.

A comparison of the three computing frameworks is as follows:

(photo: blog.csdn.net/jiangjunsho…

According to the above analysis, Flink/Blink is the mainstream framework in the current streaming computing field.

3.2 Evolution of Offline Computing

Offline computing is when all the input data is known before the calculation begins, and the input data does not change. Offline computing mainly includes Hadoop MapReduce, Spark, Hive/ODPS and other computing frameworks.

To use Hadoop MapReduce for data processing, developers need to use Java, Python and other languages for development and debugging, write Map and Reduce functions respectively, and optimize the performance of Map and Reduce processes by themselves. The development threshold is high, and the computing framework does not provide much benefit to developers. In terms of performance optimization, when small tables are associated with large tables, small tables can be put into the cache first (by invoking MapReduce API). In addition, the interface between Combine and Partition can be rewritten to compress the amount of intermediate data processing from Map to Reduce to improve data processing performance.

Spark is a semi-MapReduce based on memory calculation. Spark SQL is generally used to clean data during offline data processing. The target file is usually stored in the HDFS or NFS.

Hive is a data warehouse architecture based on the Hadoop file system and analyzes and manages data stored in the HDFS. Hive implements a layer of SQL interfaces on Hadoop to translate SQL into MapReduce for execution on Hadoop. This enables data developers and analysts to use SQL to perform massive data statistics and analysis without developing MapReduce in programming languages. Thus lowering the threshold of data development.

At present, the offline data processing industry, Alibaba’s Odps platform (alibaba’s internal offline processing platform) uses its own set of Hadoop cluster to provide PB level data processing every day, Huawei is still processing data based on Hadoop cluster cloud ETL, and Bytedance’s data platform also mainly uses Hive for offline computing.

Collectively, Hive has the lowest learning cost and is most widely used by companies.

3.3 Evolution of NOSQL Databases in Columns

The concept of NOSQL is extensive and profound, including key-value database, document-oriented database, Wide Column Store/ column-family database, graph-oriented database, etc. This section describes HBase, the most popular column storage database, and its replacement Lindorm.

HBase is a non-relational database (NOSQL) based on HDFS and oriented to columns (column families). HBase cleverly places large and sparse tables on commercial server clusters. A single table can contain billions of rows and millions of columns. Nodes can be linearly added from bottom to top for horizontal expansion.

Lindorm is a new generation of distributed database oriented to online massive data processing. It is suitable for cloud native database service of any scale and various models. Based on storage and computing separation and multi-mode sharing and fusion, Lindorm has the advantages of flexibility, low cost, stability, reliability, simplicity, openness, and eco-friendliness.

In general, Lindorm is an upgraded version of HBase. It has better performance and stability than HBase. If you need to use massive data to provide online services, you can consider Lindorm.

3.4 Historical Evolution of Big data development language

Scala used to be the darling of big data development. Kafka, the most popular messaging middleware in the industry, was written in Scala, and Spark, the killer framework in the field of big data, was written in Scala. In addition, Scala’s functional programming style, natural Lambda expressions for handling large amounts of data, concise and elegant syntax sugar, and steep learning curve are also favored by programmers with an extreme pursuit of code beauty.

In the past, Kafka + Scala + Spark + Spark Streaming technology system can both batch and stream processing, until the emergence of Flink/Blink, which has a unified batch and a smoother learning curve, completely breaks this situation, and SQL language has greatly increased its proportion in big data processing.

At present, big data development languages are flourishing and leading the way. Among them, SQL language (writing Flink/Blink and Hive tasks) is widely used in the field of data warehouse construction and data analysis; JVM language family (mainly Java and Scala) plays an important role in the Hadoop ecosystem and is the first choice for data platform development; Python is extremely favored in the direction of artificial intelligence. R language is a powerful tool for data modeling and data visualization. Each language has its own application scenarios. It is recommended to choose according to your job and interests.

3.5 Big data learning Suggestions

Based on the analysis of the evolution history of big data technology stack in the above chapter, it is suggested that students who are interested in data development learn the following big data components:

1) Streaming, real-time computing: Flink

2) Offline computing: Hive

3) Column storage of NOSQL database: Lindorm

Iv introduction to cloud native multi-mode database Lindorm

4.1 introduce Lindorm

Lindorm is a cloud native database service for any size and multiple models, supporting low-cost storage and processing of massive data and flexible payment on demand, providing wide table, timing, search, file and other data models. Compatible with HBase, Cassandra, Phoenix, OpenTSDB, Solr, SQL and other open source standard interfaces, it is one of the databases that provide key support for Alibaba’s core business.

Lindorm innovatively uses the cloud native architecture of storage computing separation and multi-mode sharing and fusion to meet the demands of resource decoupling and elastic scaling in the era of cloud computing. The cloud native storage engine LindormStore is a unified storage base, which builds various vertical dedicated multi-mode engines upward, including wide table engine, timing engine, search engine and file engine. Based on the multi-mode engine, Lindorm provides unified SQL access, cross-model joint query, and multiple open source standard interfaces (HBase/Phoenix/Cassandra, OpenTSDB, Solr, and HDFS) to meet the requirements of seamless migration of storage services. Finally, the unified data Stream bus is responsible for data flow and real-time capture of data changes between engines, so as to realize data migration, real-time subscription, data lake transfer, data warehouse backflow, unit multi-activation, backup and recovery and other capabilities.

For application scenarios that currently use HBase+ElasticSearch or HBase+OpenTSDB+ES, such as monitoring, social networking, and advertising, the native multi-mode capability of Lindorm will be utilized to solve the pain points such as complex architecture, painful query, weak consistency, high cost, and misaligned functions. Make business innovation more efficient.

The introduction of the official website is quite obscure, I use words to translate: The Lindorm team was originally engaged in HBase of Ali. They made a series of optimizations based on HBase and solved a series of problems of HBase. Based on this, they gradually developed their own multi-mode engine, trying to unify storage analysis technology. Realize the complete functions of MySQL, HBase, timing database and so on. Lindorm supports SQL – like and HBase – like access modes, benchmarking MySQL and HBase respectively as upgrade substitutes.

Lindorm claims to be cloud native, but in my opinion it’s not really cloud native, we first used Lindorm in the cloud, and there was no reference to a cloud native database.

4.2 Comparison of Lindorm and MySQL

In the past 10 years, with the rapid development of Internet technology, database presents a blowout development, there are all kinds of products, such as file storage database, column storage database, NewSQL database. The reason is attributed to the rapid expansion of data volume, and the processing performance of traditional database on big data can not meet the demand. Enterprises and developers tend to develop different databases for different application types to meet specific data processing needs.

The first advantage of Lindorm over MySQL is what I call storage scalability. MySQL tables are bounded in size, and Lindorm, which claims to support trillions of rows per table, tens of millions of concurrent rows, and hundreds of petabytes of data storage, is basically considered unbounded, and its storage scalability easily supports horizontal scaling of tables.

The second advantage of Lindorm over MySQL is that it does not have to have a database or table. Remember MySQL how we deal with massive data? We will do sub-database sub-table, we will do vertical sub-table for business, so as not to use the convenience of wide data table; More complex, we also need to do horizontal segmentation, according to the specific fragmentation algorithm to do horizontal segmentation of the same data table. Many profound and obscure problems in the field of database are introduced by dividing database and table, such as distributed transaction problem, horizontal expansion problem, cross-library join problem, cross-table result set union problem, historical database cleaning problem and so on.

So, what if you were told that you don’t have to deal with most of the distributed transactions at the database level, don’t have to worry about scaling horizontally, and can query, sort, count, and group across table keys just as easily as you could with a single table?

4.3 Comparison between Lindorm and HBase

Advantages of HBase: High read/write performance, batch import, no need to separate databases and tables, storage and computing, low cost, and high flexibility.

With the development of The Times, disadvantages of HBase are also reflected:

  • Row keys are complex in design and lack constraints on data types. It can only be based on primary key queries and does not support complex business scenarios well
  • Read/write burrs affect business usage somesthesia
  • The client logic is heavy and the CPU load is high. The client needs to connect to ZK directly. The Meta table obtains routing information, and it is difficult to troubleshoot bugs
  • The switchover between the active and standby clusters cannot ensure consistency, and the standby cluster only receives replication traffic, which wastes resources

As a result of these shortcomings, we rarely use HBase for real online services, even with all its hype. Lindorm, an upgrade to HBase, breaks all that. See the Lindorm website for a performance comparison. Help.aliyun.com/document\_d…

In short, Lindorm mainly solves the read/write burrs and lack of secondary indexes in HBase.

4.4 Pits for Lindorm combat

4.4.1 Lindorm Secondary index

The real power of Lindorm is that it supports secondary indexing in addition to massive data, something HBase cannot do. HBase, which supports only Row Key dimension queries, can only partially simulate the function of secondary indexes through the clever Row Key design.

But the Lindorm secondary index is exactly where it fails, because every time it creates a secondary index it copies data again, so the Operations and development fellows of Lindorm recommend caution when using secondary indexes.

4.4.2 Lindorm Oversized paging

One of the great advantages of Lindorm is that it does not have to deal with the fragmentation tables even in a massive data scenario, so naturally we will use a large number of queries across the fragmentation table keys (for example: With a split table database, it’s hard to do random sweepings and cross-user statistics with an online service, but you can only run data offline with Lindorm), but Lindorm doesn’t support arbitrary paging queries like Hive’s API, because it suffers from large paging issues under large volumes of data.

“Line 50,000th to 50,00020th out of 100,000” is extremely inefficient because you need to get the first 50,000 rows before you get the 20 rows you need. This is because the select * from tableName limit I,n implementation of Lindorm is similar to HBase scan. Press the Row Key to scan the full data, and then return the required data.

In practice, with an offset of around 30,000 pages, RT requests have already exceeded 3s timeouts (standard storage, 2019 performance data), so don’t just use Lindorm as a wide table like Hive, but as a set of best practices for dealing with large pages.

4.4.3 Division of Lindorm Region

Similar to HBase, Lindorm gracefully splits regions for data expansion. However, this splitting is not complete. If the volume of Row Key data or requests in a certain segment are concentrated, Lindorm will also encounter data skewness and hotspot regions. Fortunately, region partitioning can be manually performed to rehash Row keys in a more reasonable manner.

4.5 Lindorm related QA

Q: How do I generate read/write burrs in HBase? How does Lindorm solve the burrs in HBase?

A: HBase is written in Java. A major source of burrs is the JVM GC. Lindorm addresses this problem by optimizing the HBase source code level to greatly reduce the creation of system runtime objects and thus reduce GC pressure. On the other hand, I worked with ali’s internal JVM team to use a GC algorithm more suitable for Lindorm. The Lindorm team is one of the first two teams within Ali to use ZGC, and it has been fully used for Lindorm since at least 2019, when ZGC was still in the experimental phase.

Q: Does Lindorm have ACID characteristics?

A: Lindorm has the same support for single rows as MySQL does for ACID, but cross-rows do not support transactions.

4.6 Lindorm summary

Lindorm can be a perfect replacement for HBase, and all previous HBase scenarios can be migrated to Lindorm. However, it is not recommended to migrate directly to Lindorm compared to MySQL, but Lindorm can be used if the following conditions are met:

  • A truly massive amount of data, a billion plus, and the growth is unpredictable
  • Do not perform range query, or use range query only on primary key indexes and secondary indexes (similar to HBase SCAN)
  • No multi-line transactions are required

I do not recommend using Lindorm for trading orders or financial services, but there are examples of using Lindorm on the official website. Performance storage access latency for Lindorm is as low as 0.2ms to 0.5ms (help.aliyun.com/document\_d…

5. Introduction to Fink, a big data computing engine

5.1 introduce Flink

In some foreign communities, many people divide the computing engine of big data into four generations.

The first generation of computing engines, of course, was Hadoop and MapReduce. It divides the calculation into two phases, Map and Reduce. Upper-layer applications need to write map and Reduce tasks by hand.

Tez, a second generation computing engine that supports DAG (directed acyclic graph) framework, is mainly batch tasks

Third-generation computing engines, such as Spark, feature DAG support within jobs and real-time computing.

The fourth-generation computing engine, represented by Flink and Blink, has unified batch streaming, supports DAG calculation, and has further real-time performance.

Since the Google Dataflow model was put forward, streaming and batch integration has become the most mainstream development trend of distributed computing engines. Streaming and batch integration means that the computing engine has both the low latency of streaming and the high throughput and high stability of batch computing, and provides a unified programming interface to develop the applications of the two scenarios and ensure the consistency of their underlying execution logic. Streaming batch integration greatly reduces the cost of development and maintenance for users, but at the same time it is a great challenge for computing engines. Spark is one of the first computing engines to put forward the concept of streaming and batch integration. However, because it implements streaming based on mini-batch in nature, it is difficult to meet the extreme requirements of complex and large-scale real-time computing scenarios due to the semantics and latency of streaming calculation.

Flink follows the idea of the Dataflow model: batch processing is a special case of stream processing. However, due to the execution efficiency, resource requirements and complexity of batch processing scenarios, the streaming application and batch application were separated in programming API at the beginning of Flink’s design, although the underlying layer was streaming. This allows Flink to still use batch optimization techniques at the execution level and simplify the architecture by removing unnecessary features like watermark and checkpoint.

5.2 introduce Blink

Blink is the Flink version of Alibaba’s internal research and development. In 2015, when we selected a new big data computing engine for search and recommendation business, the real-time computing team of Alibaba Cloud already had a preliminary idea on the technical direction of streaming and batch integration. The team decided to introduce Flink after in-depth research, feasibility verification, and reasoning about possible future problems. Although the whole Flink system was not particularly mature at that time, the team believed that Flink’s design concept of streaming computing was more in line with the general trend of real-time development of data computing in the future. There is an old saying in Ali that “if you go on the right path, you are not afraid of going far”. Judging from the development in the following years, Flink has indeed made smooth progress, even exceeding the expectation of the team at that time.

For a long time, Blink’s features and performance were far superior to the community version of Flink. After Alibaba acquired Ververica, the founding company of Flink, in 2019, it invested nearly 100 engineers in Flink technology research and development and community work. Blink made a lot of optimization in Table/SQL. In order to integrate Blink’s advanced features into Flink, Ali’s engineers pushed the community to rebuild the Table module architecture and upgrade the Table/SQL API to the primary programming API.

In the core data scenario of Tmall on November 11, 2020, in the face of the transaction peak of 583,000 transactions per second and the wireless traffic peak of over 100 million transactions per second, all tasks of Tmall reached the second-level delay, and the peak TPS of the whole real-time computing cluster reached 4 billion transactions per second. In addition, cluster resource utilization is greatly improved, and batch tasks can be executed at off-peak.

5.3 Comparison between Flink and Blink

(Photo source: help.aliyun.com/document\_d…)

5.4 Flink use

5.4.1 Creating a data source table

Flink’s data source tables can support both streaming and offline data storage, and even both can serve as data sources in the same Flink task. Streaming data includes dozens of types such as Kafka, Ali Cloud SLS log, data bus DataHub, RocketMQ/MetaQ and so on. Offline data includes Hive and ODPS.

CREATE TABLE dwd_tb_trd_pay_ri(biz_order_id VARCHAR, -- 'order ID' auction_id VARCHAR, --' goods ID' auction_title VARCHAR, Buyer_nick VARCHAR, -- 'pay_time VARCHAR,' -- 'pay_time' gmt_create VARCHAR, -- 'createtime' gmt_modified VARCHAR, -- 'createtime' biz_type VARCHAR, -- 'transaction type' pay_status VARCHAR, -- 'payment status'' attributes' VARCHAR, -- 'order tag' from_group VARCHAR, -- div_idx_actual_total_fee DOUBLE --' datahub') WITH (type='datahub', endPoint='http://dh-cn-hangzhou.aliyun-inc.com', Project ='yourProjectName',-- 'your project' topic='yourTopicName',--' your topic' roleArn='yourRoleArn',-- 'yourRoleArn' batchReadSize='500' );Copy the code

5.4.2 Creating a Data Result Table

Flink’s results can be stored in dozens of storage engines, Including message queue (such as Kafka, RocketMQ/MetaQ, DataHub), database (such as MySQL, Oracle), NOSQL storage (Redis, HBase, Lindorm), log service (SLS), etc.

CREATE TABLE tddl_output(gmt_create VARCHAR, --' CREATE time 'gmt_modified VARCHAR, --' CREATE time' buyer_id BIGINT, Cumulate_amount BIGINT, --' amount 'effect_time BIGINT, --' pay time 'primary key(buyer_id,effect_time) WITH (type=' RDS ', url = 'yourDatabaseURL', --' your database url' tableName = 'yourTableName', --' yourTableName' userName = 'yourUserName', --' your username 'password = 'yourDatabasePassword' --' your password');Copy the code

5.4.3 Writing Service Logic

Use SQL language to write business logic, but also can use all kinds of built-in functions, window functions, and custom functions (UDF, UDAF, UDTF).

INSERT INTO tddl_output
SELECT
    gmt_create,
    gmt_modified,
    buyer_id,
    div_idx_actual_total_fee
from dwd_tb_trd_pay_ri
where div_idx_actual_total_fee >0;   
Copy the code

5.4.4 Performance Tuning

During the development of real-time computing jobs, after the implementation of business logic, the online and start-up of jobs, the job needs to be tuned to meet the performance requirements of real-time computing jobs.

Job tuning mainly involves SQL optimization and parameter tuning. Blink developed the function of automatic configuration tuning, which reduced the threshold of parameter tuning to a certain extent, but the effect was generally lower than manual tuning. Performance tuning has a huge impact on performance and throughput. I have used performance tuning to increase job throughput by 4 times with the same resource consumption.

5.5 Flink advantages

  • Very low learning cost, under the guidance of experienced students, can be completed in one day familiar with grammar and demo writing.
  • Stream batch integration, just learn to use a set of SQL can be based on Flink stream batch development, reduce the threshold of development.

Big data in Bytedance

Although Bytedance has not contributed many open source frameworks in the field of big data, according to the author’s experience, bytedance’s big data technology strength is very strong. Its internal engineers have developed many data platforms and database products, supporting the rapid development of many popular apps such as Douyin, TikTok and Toutiao.

6.1 Data Platform

Bytedance’s data platform horizontally supports all of the company’s business lines, including Toutiao, Douyin, Watermelon, education, etc., and can be extended to external companies to solve EB-level big data problems. Bytedance’s data platform has multiple data products, including Fengshen, Dorado and Libra.

Fengshen is an agile BI platform developed by the data platform itself, providing flexible and easy-to-use query analysis services and report making capabilities. The data content of Fengshen comes from various business lines and central departments. Rich and diversified content and flexible and powerful platform capabilities provide indispensable data support for the rapid development of business.

Dorado is a big data RESEARCH and development platform integrating data integration, data development, task scheduling, operation and maintenance management and other functions. It provides one-stop big data development solutions, helps business departments build their own data centers simply and efficiently, and focuses on data value mining and exploration. Dorado allows you to write scheduling tasks for offline and streaming data, such as offline data synchronization and storage conversion, as well as data development, such as Hive, Spark, and Flink tasks.

Libra is an A/B testing platform that provides large-scale online experimentation and statistical evaluation capabilities. Cover recommendation, advertising, search, UI, product features and other experimental scenarios to support business on the road of rapid iteration of trial and error, bold assumptions, careful verification.

Fengshen and other data platforms, the bottom layer is ClickHouse. ClickHouse open Source edition is a column database management system for Online Analytical Processing (OLAP) that enables users to interactively analyze multidimensional data from multiple perspectives. Olap-clickhouse is a deeply optimized version of ClickHouse by ByteDance developers. It provides enhanced query and write performance over massive amounts of data. Applications include multidimensional analysis of massive amounts of data, machine learning model evaluation, microservices monitoring and statistics.

6.2 Storage System

Bytedance has developed a number of storage systems, including SQL, NoSQL, NewSQL and other database types, including ByteKv, ByteSQL, ByteGraph and other database products.

ByteKv is a distributed KV database system developed by ByteDance, which supports distributed transactions and has strong consistency for online scenarios. This mode is applicable to scenarios where the data volume is large (TB to tens of PB) and strong consistency, transaction, and sequential scanning are required. Such as money, coins, metadata, orders, goods, content, etc.

ByteSQL is a high-performance distributed table storage service that supports high concurrency based on distributed KV service ByteKV developed by Bytedance Infrastructure, and is oriented to strong consistent online transaction processing (OLTP) requirements. This mode applies to scenarios where a single table is large (TB to dozens of PB), strong consistency, transaction, and global secondary index.

ByteGraph is a distributed graph data storage system developed by ByteDance. It supports the directed attribute graph data model and Gremlin graph database language. The read and write throughput can be extended to tens of millions of QPS and the delay is millisecond. At present, ByteGraph has deployed 200+ clusters all over the world, supporting almost all product lines of the company, such as Toutiao, Douyin, TikTok, watermelon, Volcano, risk control, knowledge Graph and so on. It is widely used in social, PUGC, recommendation, risk control and other business scenarios.

These are just a few of byteDance’s own database products, many of which have yet to be advertised, so I won’t go into details here.

Seven, conclusion

This paper briefly discusses the changes of big data technology stack in the past five years, and adds what I have seen and heard in the actual work process. Limited by personal cognition and space, it may only show one thousandth of the charm of big data technology. Data, algorithms, the development of the work force, greatly enriched our life, also promote the rapid development of economy and society, as the big data age I feel deeply honored to veteran, hope more developers to join in, promote the early arrival of the singularity artificial intelligent technology, together witness the unprecedented revolution of science and technology.

reference

  1. Flow calculation of three kinds of framework blog.csdn.net/jiangjunsho…
  2. Comparison of four data processing methods: Traditional ETL tool, MapReduce, Hive, Spark www.sohu.com/a/230352341…

** Author: ** Wang Weiqiang