An article to understand the big data architecture

Recent data in Taiwan concept fire, we are also multifarious to its definition, and many. But no matter how you define it, a sound data technology architecture is essential. Understanding the location, function, and meaning of each part of these architectures not only gives us a better understanding of the scope and boundaries of data products, and what technology can help us achieve and how to achieve it better, but also many of the design concepts of technology can help us understand the world and understand complex systems. Therefore, this article aims to sort out the common open source technical solutions in the market, and the principles and application scenarios behind them, so as to help product managers have a general and comprehensive understanding of big data technology system.

Generally speaking, we divide the whole chain of data into four links, from data acquisition and transmission, to data storage, to data calculation & query, and to subsequent data visualization and analysis. The framework diagram is as follows:

1. Data acquisition and transmission

This typically corresponds to a company’s logging platform, where data is collected and cached somewhere for consumption by subsequent computing processes.

There are different collection methods for different data sources, from APP/ server logs, to business tables, as well as various apis and data files. Log data is subject to “special care” because of its large amount of data, diverse data structures, and complex environment. At present, there are Flume, Logstash, Filebeat, Fluentd and Rsyslog frameworks in the market for log collection. Let’s choose the first two frameworks which are widely used.

Flume is a real-time log collection engine developed by Cloudera. It features high concurrency, high speed and distributed massive log collection. It is a high availability, high reliability, distributed mass log collection, aggregation and transmission system. Flume supports sending customized data in the log system for data collection. It also supports simple processing of data and writing to various data receivers. There are currently two versions, OG and NG, featuring:

This feature focuses on data transmission and has an internal mechanism to prevent data loss. This feature is used in critical log scenarios
Developed by Java, there is no rich plug-in, mainly rely on secondary development
The configuration is cumbersome, and data exists on the exposed monitoring port

Logstash is an open source data collection engine under Elastic.co. It can dynamically unify the data of different data sources to the destination, analyze with ElasticSearch and display with Kibana. It is the “L” part of the famous ELK technology stack. The main features are:

There is no internal persist queue, and exceptions may lose some data
Written in Ruby, requires a Ruby environment, lots of plug-ins
Simple configuration, emphasis on data processing, convenient analysis

From the perspective of both design ideas, Flume is not designed to collect logs at first, but to transmit data to HDFS, which is fundamentally different from Logstash. Therefore, it should focus on data transmission and security, and require more secondary development and configuration. While Logstash obviously focuses on preprocessing log data first, paving the way for subsequent parsing. It’s easy to use with the ELK stack, and it’s more like a ready-made lunch, ready to go.

1.2 How does Log Collection Work

Let’s take Flume as an example to talk about how the log collection Agent works.

Flume consists of three parts: Source, Channel and Sink, corresponding to the collection, cache and save three links.

The Source component is used to collect various types of data sources, such as Directory, HTTP, kafka, etc. The Channel component is used to cache data, including memory Channel, JDBC Channel, and Kafka Channel. Finally, the Sink component can be used to save data. HDFS, HBase, Hive, and Kafka are supported.

The following describes the important role of Flume in practical application by combining a real-time big data processing system. The overall architecture of the real-time processing system is as follows: By deploying the Agent on the Web server, once the new log data occurs, it will be monitored by Flume program, and finally transmitted to Kafka Topic, and then a series of subsequent operations.

1.3 Data Transmission Kafka

Kafka was originally developed by linkedin and opened source in early 2011. It was incubated by Apache Incubato on October 23, 2012. The goal of the project is to provide a unified, high-throughput, low-latency platform for processing real-time data. Its persistence layer is essentially a “massive publish/subscribe message queue based on a distributed transaction logging architecture,” which makes it valuable as an enterprise-level infrastructure for handling streaming data.

2. Data storage

In terms of database storage, there are three dimensions of single-machine/distributed, relational/non-relational, column storage/row storage. Each dimension has corresponding products to meet the needs of a certain scenario.

In the case of a small amount of data, stand-alone databases are generally adopted, such as MySQL, which is widely used and mature in technology. When the volume of data reaches a certain level, a distributed system must be adopted. At present, the most famous system in the industry is the Hadoop system under the name of Apache Foundation, which can basically be regarded as the classic model of storage computing in the era of big data.

HDFS

As a distributed file system in Hadoop, HDFS provides highly reliable low-level storage support for HBase and Hive, corresponding to the open source implementation of Google GFS. It is also commonly used in batch analysis scenarios.

HBase

HBase is a Hadoop database running on the HDFS as a column-based non-relational database. It has random read and write capabilities that HDFS lacks, making it suitable for real-time analysis. HBase is based on Google BigTable and stored in key-value format. HBase can quickly locate required data among billions of rows of data on hosts and access the data.

The Hive and Pig

Hive and Pig are both query languages integrated in Hadoop’s top layer, providing dynamic query of static data and supporting SQL language. The bottom layer is compiled into MapReduce program, saving the tedious of writing MR programs. The difference is that Hive SQL is an SQL-like query language that requires data to be stored in tables, while Pig is a flow-oriented programming language that is often used to develop concise scripts to transform data streams for embedding into larger applications.

MapReduce

MR pioneered distributed computing, making it possible to process large quantities of data. In simple terms, it is to group relatively large computing tasks first and then summarize them to improve computing efficiency. For example, if you need to renovate your new home, buy a lot of things in different places. It will take you ten days to go shopping alone. Now we call a bunch of small friends (distributed), and each one is responsible for buying things in a place (Map), and finally bringing them to our home (Reduce), which can be done in a day.

Other auxiliary tools

Zookeeper provides stable services and failover, while Sqoop provides convenient RDBMS (relational database) data import for Hadoop. This facilitates data migration from traditional databases to HBase.

It’s worth noting that the Hadoop ecosystem is actually based on three papers published by Google in 2003. It may be that Google intended to improve the backward status quo in the industry at that time, so that we can slightly follow his footsteps before the paper was released… After all these years, it’s not clear how far Google’s internal understanding and use of data has come.

3. Data calculation & query

3.1 Batch and stream computing

Big data processing scenarios can be divided into batch processing and stream processing, corresponding to offline analysis and real-time analysis respectively. Common framework categories are:

Only batch processing framework: Hadoop MapReduce

Stream-only frameworks: Storm, Samza

Hybrid frameworks: Spark, Flink

Due to space constraints, in addition to Hadoop ecology mentioned above, let’s briefly introduce Spark:

3.2 the Spark and Flink

Apache Spark is a next-generation batch processing framework that includes streaming capabilities.

Unlike MapReduce, Spark processes all data in memory in batch mode, greatly improving computing performance. In stream processing mode, Spark implements a concept called micro-batch through Spark Streaming. This technique can treat a data stream as a series of very small “batches” that can be processed through the native semantics of a batch engine. This works very well in practice, but it still falls short in terms of performance compared to a real stream processing framework.

In summary, Spark is the best choice for handling tasks with diverse workloads. Spark’s batch capability provides an unmatched speed advantage at the expense of higher memory footprint. Spark Streaming is a good solution for workloads that value throughput rather than latency.

And Flink as a newer generation of processing framework, with faster computing power, lower latency, has slowly emerged. However, the application of a framework, especially an open source framework, takes a long time to run, test and optimize. Big data technology is evolving with the open source community. In the near future, IT is believed that Flink will gradually become the mainstream of big data processing technology just as Spark replaced Storm.

3.3 Data Query

After processing data, but also need efficient query engine can be accessed and used by users. OLAP’s query technology framework can be broadly divided into three categories:

HBase based pre-aggregation: For example, Opentsdb and Kylin, you must specify a pre-aggregation indicator to perform the aggregation operation during data access. This applies to fixed service reports with multiple dimensions
Column storage based on Parquet: For example, Presto, Drill, Impala, etc., are basically parallel computing completely based on memory. Parquet system can reduce storage space and improve I/O efficiency. It is mainly offline processing, so it is difficult to improve real-time data writing
External indexing based on Lucene, such as ElasticSearch and Solr, can meet far more query scenarios than traditional database storage. However, for log and behavior time series data, all search requests must also search all fragments. In addition, the support for aggregation analysis scenarios is also a weakness

We use Presto, Druid, and Kylin models to describe their characteristics:

Presto: Open source by Facebook, Presto is a distributed data query framework that integrates Hive, Hbase, and relational databases. The execution model behind it is fundamentally different from Hive and does not use MapReduce. Because all of its processing is done in memory (similar to Spark above), it is an order of magnitude faster than Hive in most scenarios
Druid: MetaMarket is a distributed, columnar-oriented, quasi-real-time analytical data storage system with latency up to 5 minutes. It can guarantee the query and analysis performance of massive data in high concurrency environment, and provide the query, analysis and visualization functions of massive real-time data
Kylin: Cube predictive computing technology is its core. The basic idea is to make multidimensional indexes of data in advance, and only scan the indexes instead of accessing the original data to speed up the query. The disadvantage is that the historical data recalculation of Cube must be traced every time the dimension is increased or decreased, which is very time-consuming. Kylingence is said to have solved this problem at its new product launch a few days ago. Wait and see

The following figure is taken from kuaishou’s evaluation of OLAP technology selection for your reference:

Most of the time, there is no clear boundary between computation and query. This is divided into two parts for convenience. In fact, for technically savvy teams, these open source systems can be tweaks, such as using Kylin’s predictive power +Druid’s query engine to speed up queries, and so on.

4. Data visualization and analysis

In data visualization, three approaches are generally adopted to display data. The most basic use of open source chart library, such as foreign HighCharts, D3, Baidu ECharts, and Ali Antv G2, G6, F2, etc. On the next level up are open source visualization frameworks from well-known companies like Airbnb Superset, Redash, Metabase, etc. These frameworks can generally meet the functions of data source access, self-help report making and report sorting and display, which are more convenient to access. The next layer is the commercial visualization software, such as foreign Tableau, Qlik, domestic FineReport, Yonghong BI and so on. They cost a fee, but offer richer visualizations and some technical support, making them a good choice for companies that don’t have the energy to mess with visualizations.

Chart 4.1 library

To understand the whole open source charting ecosystem, we need to understand the native capabilities of SVG and Canvas. SVG stands for Scalable Vector Graphics. Like HTML, it has its own namespace and uses XML tags to draw. Canvas, as a new label in HTML5, is used for client graphics drawing and has a Drawing API based on Java.

D3.js stands for Data-drivenDocuments and supports SVG and Canvas. Compared to other products, it is more low-level and does not categorize the charts. Developers can easily manipulate the DOM through D3’s rich class library, drawing any graphics they want, and covering more comprehensive visualizations at the cost of increased development complexity.

While foreign HighCharts is a chart library developed based on SVG, ECharts and G2 are both based on Canvas. ECharts comes with a complete chart package, right out of the box, while G2 is a data-driven graphical syntax based on visual coding, with a high degree of ease of use and scalability. Ali subsequently encapsulated a set of Bizcharts based on React based on G2, which focused on the visualization of e-commerce business charts and precipitates the visual specifications of e-commerce business lines. Implement common charts and custom charts in the React project.

The comparison between ECharts and G2 can be borrowed from the author of ECharts, G2 is flour, ECharts is noodles, both small but beautiful.

4.2 Visual Framework

Here we mainly introduce the famous Superset and Metabase in the industry. The former scheme is more perfect, supporting the collection of different data sources to form corresponding indicators, and then visualization through rich chart types. Excellent in time series analysis, support moving average and cycle offset and other analysis methods. Deep integration with Druid allows you to quickly parse large data sets. The disadvantage is that it does not support group management of reports, once the report is more troublesome to use. It does not provide chart drilling and linkage functions, and permission management is not friendly.

Metabase focuses on the experience of non-technical people (such as product managers and operations staff), giving them the freedom to explore data and answer their own questions, and making the interface more aesthetically pleasing. In the authority management is relatively perfect, even without the account can also share the chart and data content. Dashboard supports categorization to facilitate report management. Disadvantages: Time series analysis does not support the comparison of different dates, and needs to customize SQL implementation. Only one database can be queried at a time, and the operation is tedious.

As for data mining, it is mainly focused on commercial companies at present. Through in-depth cooperation with some industries, we can train and deepen our own learning model, which will not be described here.

At the end of this article, I would like to thank you for sharing your knowledge on the Internet. Here are the links to the main references for your own reference. My background is not technical, all the information is compiled online, it is the free sharing spirit of the Internet that makes this article possible. At the same time, special thanks to zhuan data technology chief jun brother friendship. If there is any mistake, please leave a comment.

An article to understand the big data architecture

Related Posts

Still writing task scheduling codes by hand? Try this visual distributed scheduling framework!

Use Celery in Flask projects (with Factory mode or not)

Is TDD (Test Driven Development) dead?