It has been 10 years since the word “big data era” was put forward. More and more enterprises have completed the construction of big data platform. With the explosion of the mobile Internet and the Internet of Things, the value of big data is being mined in more and more scenarios. As everyone uses The Big data of The European Champions, the threshold of building a big data platform is getting lower and lower. With the power of open source, any organization with basic research and development capabilities can completely build their own big data platform. However, for students who do not know the concepts of big data platform, data warehouse and data mining, they may not be able to complete the construction smoothly, because when you go to Baidu, you will find too many things and do not know how to choose. Today to share with you the next big data platform is how to play. \

The architecture overview

Generally, the architecture of big data platform is as above, from external data collection to data processing, data display, application and other modules.

The data collection

User access to our product generates a large number of behavior logs, so we need a specific log collection system to collect and transport these logs. Flume is a highly available, reliable, and distributed system for collecting, aggregating, and transmitting massive logs provided by Cloudera. Flume supports customized data sender in the log system for data collection. At the same time, Flume provides the ability to process data simply and write to various data receivers.

For data that is not used in real time, you can use Flume to directly upload the data to the HDFS of the cluster. For the real-time use of data, Flume+Kafka can be used, the data directly into the message queue, through Kafka will be transmitted to the real-time computing engine for processing.

The amount of data in a business database is much smaller than the amount of data in an access log. Non-real-time data is periodically imported to HDFS or Hive. A common tool is Sqoop, which is a tool for moving data between Hadoop and a relational database. Sqoop is a tool for moving data between a relational database (e.g. Data in MySQL,Oracle,Postgres, etc.) is exported to HDFS of Hadoop, and HDFS data can be exported to relational databases. For real-time database synchronization, Canal can be used as the middleware to process database logs (such as binlog) and synchronize them to the data storage of big data platform in real time after calculation.

Data is stored

No matter what kind of large-scale data computing engine is used in the upper layer, the data storage system in the bottom layer is mainly HDFS. HDFS (Hadoop Distributed File System) is the core sub-project of Hadoop project and the basis of data storage and management in Distributed computing. With high fault tolerance, high reliability, high throughput and other characteristics.

HDFS stores text one by one, which is convenient for analysis and statistics. Therefore, on the basis of HDFS, Hive is used to map data files to structured table structures for subsequent SQL-like query and management of data.

The data processing

Data processing is known as ETL. In this section, we need three things: a computing engine, a scheduling system, and metadata management.

Currently, Hive and Spark engines are used for large-scale non-real-time data calculation. Hive is a MapReduce-based architecture that is stable and reliable but has low computing speed. Spark is memory-based computing, which is generally faster than MapReduce. However, Spark has high requirements on memory performance and may run out of memory. Spark is compatible with Hive data sources. For stability, it is recommended to use Hive as the primary computing engine for daily ETL, especially for data that does not require high real-time requirements. Other engines such as Spark can be used based on scenarios.

In terms of real-time computing engine, it has gone through three generations, including Storm, Spark Streaming and Flink. Flink has been acquired by Ali, Dachang has been pushing, the community is very active, and there are many resources in China.

For scheduling systems, it is recommended to use the lightweight Azkaban, an open source batch workflow task scheduler from Linkedin. azkaban.github.io/

You typically need to develop your own metadata management system for planning metadata in your data warehouse and ETL processes. Metadata is divided into business metadata and technical metadata.

  • Service metadata is used to support various service condition options on the Web UI of the data service platform. For example, the following are commonly used: mobile device model, brand, carrier, network, price range, device physical features, and application name. Some of this metadata comes from standard libraries provided by the underlying data department, such as brands and price ranges, which can be synchronized or read directly from corresponding data tables. Some time-sensitive metadata, such as application information, needs to be generated through ETL processing on a daily basis. It is stored in the MySQL database to support application computing. For the data selected by the corresponding conditions on the filling page, Redis is used to store it. It will process the data in MySQL every day/month to generate the key-value pair data that is easy to query and store it in Redis.
  • Technical metadata mainly includes model description, blood relationship, change record, requirement source, model field information in data warehouse.

The data flow

Understand the flow of data acquisition, data processing, and data presentation through the above diagram. Usually in our actual work, the process from data source to analysis report or system application mainly includes data acquisition synchronization, data warehouse storage, ETL, statistical analysis, and writing to the upper application database for indicator display. This is the most basic line. At present, there are data analysis and mining based on data warehouse, which will further mine and analyze existing model data based on machine learning and deep learning to form deeper data application products.

Data applications

As the saying goes, “Good wine fears a deep alley.” Why have we done so much work in front of data application? For enterprises, everything we do needs to reflect the value, and the data application at this time is the value of big data. Data applications include some report indicators to assist business analysis, personalized push based on user portraits in the mall, and various data analysis reports, etc.

Good data applications must be visualized, such as many traditional enterprises to buy soft sails. Open source community recommended a visualization tool Superset, visualization of many types, support a lot of data sources, easy to use. Redash, recently acquired by several bricks, also aims to unify big data processing platform for itself. It can be seen that visualization is very important for enterprise data value representation.

At the end

Through this paper, we can have a preliminary understanding of the big data platform processing, know which technology stack is involved and how to transfer data. It is still not enough to build our own big data platform from 0 to 1. After understanding the process, you also need to really get the hang of building Hadoop cluster, Spark cluster, data warehouse construction, data analysis process standardization and so on. \

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the ministry, tsinghua university, Peking University, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, such as Google, Microsoft, government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

Long press scan code to add "Python little assistant" ▼ Click to become a community member click on itCopy the code