Let’s start by looking at this diagram, which shows the architecture of a big data platform used by one company, most of which are similar:

From the perspective of the overall architecture of big data, the core layers of big data should be: ** Data acquisition layer, data storage and analysis layer, data sharing layer, data application layer. ** may be called different, but their roles are essentially similar.

So let me take a look at what the core technologies of big data consist of by following the clues in this architecture diagram.

I. Data collection

The task of data collection is to collect and store data from various data sources to the data store, and some simple cleaning may be done during the process.

There are many types of data sources:

  • Website Log:

As the Internet industry, website logs account for the largest share, website logs are stored in multiple website log servers,

Generally, Flume Agent is deployed on each website log server to collect website logs in real time and store them in the HDFS.

  • Business database:

There are various types of business databases, such as Mysql, Oracle, SqlServer, etc. At this time, we urgently need a tool that can synchronize data from various databases to HDFS. Sqoop is one of them, but Sqoop is too heavy, and no matter the size of data, MapReduce needs to be enabled for execution, and every machine in the Hadoop cluster needs to have access to the business database; Should this scene, Taobao open source DataX, is a good solution, if there are resources, you can do secondary development based on DataX, can be very good solution.

Flume can also synchronize data from the database to HDFS in real time through configuration and development.

  • Data sources from Ftp/Http:

Data provided by some partners may need to be periodically obtained through Ftp or Http, and DataX can also meet this requirement.

  • Other data sources:

For example, some manual input data, only need to provide an interface or small procedures, can be completed;

2. Data storage and analysis

Undoubtedly, HDFS is the most perfect data storage solution for data warehouse/data platform in the big data environment.

Offline data analysis and calculation, that is, the part of the real-time requirements are not high, in the author’s opinion, Hive is the first choice, rich data types, built-in functions; ORC file storage format with very high compression ratio; Very convenient SQL support, make Hive based on structured data statistical analysis is far more efficient than MapReduce, a SQL can complete the requirements, the development of MR may need hundreds of lines of code;

Of course, the Hadoop framework also provides MapReduce interface. If you are really interested in Java development or are not familiar with SQL, you can also use MapReduce to do analysis and calculation.

Spark has been very popular in the past two years. After practice, its performance is much better than MapReduce, and it works better with Hive and Yarn. Therefore, Spark and SparkSQL must be supported for analysis and calculation. Because Hadoop Yarn already exists, it is very easy to use Spark without having to deploy the Spark cluster separately.

Third, data sharing

Here the data sharing, in fact, refers to the previous data analysis and calculation results stored in the place, in fact, is the relational database and NOSQL database;

The results of Hive, MR, Spark, and SparkSQL analysis and calculation are still in the HDFS. However, most services and applications cannot directly obtain data from the HDFS. Therefore, a data sharing place is required so that services and products can easily obtain data. As opposed to the data acquisition layer to HDFS, you need a tool to synchronize data from HDFS to other target data sources, as DataX can do.

In addition, some real-time computing results may be directly written to the data share by the real-time computing module.

Data application

  • Business products (CRM, ERP, etc.)

The data used by business products already exists in the data sharing layer and can be accessed directly from the data sharing layer.

  • Report (FineReport, Business Report)

With business products, the data used in reports are generally collected and stored in the data sharing layer.

  • Ad-hoc query

There are many users of impromptu query, including data developers, website and product operators, data analysts, and even department bosses, who all have the demand of impromptu query data;

Such AD hoc queries are usually not available in the existing reporting and data sharing layers and need to be queried directly from the data storage layer.

Ad-hoc queries are generally performed using SQL. The biggest difficulty is the response speed. SparkSQL is slower than Hive, and is compatible with Hive.

Of course, you can also use Impala if you don’t care about having one more framework in your platform.

  • OLAP

Currently, many OLAP tools do not support directly obtaining data from HDFS. All of them synchronize the required data to a relational database to perform OLAP. However, if there is a large amount of data, the relational database will not work.

In this case, you need to do relevant development to obtain data from HDFS or HBase to complete the OLAP function. For example, develop interfaces to obtain data from HBase based on the dimensions and indicators selected by users.

  • Other data interfaces

There are generic and custom interfaces. For example, an interface to get user attributes from Redis is universal, and all businesses can call this interface to get user attributes.

5. Real-time computing

Now business needs more and more real-time data warehouse, such as: real-time understanding of the overall traffic of the website; Get exposure and clicks on an AD in real time; In the case of massive data, it is impossible to rely on traditional database and traditional implementation method. What is needed is a distributed real-time computing framework with high throughput, low latency and high reliability. Storm is more mature in this area, but I chose Spark Streaming for the simple reason that I don’t want to introduce more frames into the platform, and Spark Streaming has a slightly higher latency than Storm, which is negligible for our needs.

Spark Streaming is currently used to realize real-time website traffic statistics and real-time advertising effect statistics.

Flume collects website logs and advertising logs on the front-end log server and sends them to Spark Streaming in real time. Spark Streaming collects statistics and stores the data to Redis, where services can be accessed in real time.

6. Task scheduling and monitoring

In the data warehouse/data platform, there are a variety of very many procedures and tasks, such as: data acquisition task, data synchronization task, data analysis task;

In addition to scheduled scheduling, these tasks also have very complex task dependencies. For example, data analysis tasks can only be started after corresponding data collection tasks are completed. A data synchronization task can be started only after the data analysis task is complete.

This requires a very perfect task scheduling and monitoring system, which as the center of data warehouse/data platform is responsible for scheduling and monitoring the allocation and operation of all tasks.