Keywords: Internet, big data, data warehouse, data platform, architecture

Takeaway:

Overall architecture Data acquisition data storage and analysis data sharing data application real-time computing task scheduling and monitoring metadata management summary has always wanted to sort out this content, since it is a ramble, think of what to say what. I’ve always been in the Internet industry, in the Internet industry.

First list the use of data warehouse and data platform in the Internet industry:

Integrate all business data of the company and establish a unified data center; Provide all kinds of reports, some to the top management, some to the various business; Provide data support for website operation, that is, through data, let the operation timely understand the operation effect of the website and products; Provide online or offline data support for each business, and become a unified data exchange and provision platform for the company; Analyze user behavior data, reduce input cost and improve input effect through data mining; For example, targeted and accurate advertising, personalized user recommendation, etc. Develop data products that directly or indirectly benefit the company; Build open data platform, open company data; . The above listed content looks similar to the traditional industry data warehouse uses, and all require data warehouse/data platform has very good stability, reliability; But in the Internet industry, in addition to the large amount of data, a growing number of business demands timeliness, even many is the requirement of real time, on the other hand, the Internet industry business changes very fast, can’t be like a traditional industry, you can use the top-down method to establish data warehouse, once and for all, it requires that the new business soon can come into the data warehouse, old offline business, Can be easily offline from the existing data warehouse;

In fact, the data warehouse in the Internet industry is the so-called agile data warehouse, which requires not only fast response to data, but also fast response to business.

Construction of agile data warehouse, in addition to the architecture technical requirements, there is a very important aspect, is the data modeling, if started thinking of establishing a set of compatible with all data and business data model, then returned to the traditional data warehouse construction, it is difficult to meet the rapid response to business changes. To deal with this situation, it is generally necessary to conduct in-depth modeling of the core persistent business first (such as: website statistical analysis model and user browsing trajectory model based on website logs; User model based on the company’s core user data), other businesses generally adopt dimension + wide table to establish data model. This is the last part.

The overall architecture

Here’s a diagram of the data platform architecture we currently use, which is similar to most companies:

Data warehouse, data platform architecture diagram

Logically, there are generally data acquisition layer, data storage and analysis layer, data sharing layer, data application layer. They may be called different things, but the characters are essentially the same.

We look from the bottom up:

Gorgeous dividing line: You can follow LXW's big data field, or join a mailing list to receive notifications of blog updates.Copy the code

The data collection

The task of the data acquisition layer is to collect and store data from various data sources to the data store, which may do some simple cleaning.

There are many types of data sources:

Website logs: As the Internet industry, website logs account for the largest share, website logs are stored on multiple website log servers,

Generally, Flume Agent is deployed on each website log server to collect website logs in real time and store them in the HDFS.

Business database: There are various types of business databases, such as Mysql, Oracle, SqlServer, etc. At this time, we urgently need a tool that can synchronize data from various databases to HDFS. Sqoop is one of them, but Sqoop is too heavy, and no matter the size of data, MapReduce needs to be enabled for execution, and every machine in the Hadoop cluster needs to have access to the business database; In response to this scenario, DataX, the open source of Taobao, is a very good solution (refer to the article “Massive Data Exchange Tool of Heterogeneous Data Sources -Taobao DataX Download and Use”). If there are resources, secondary development can be done based on DataX, which can be a very good solution. So is the DataHub we currently use.

Flume can also synchronize data from the database to HDFS in real time through configuration and development.

Data sources from Ftp/Http: Data provided by some partners may need to be periodically obtained through Ftp/Http. DataX can also meet this requirement.

Other data sources: for example, some manual data entry, only need to provide an interface or small program, can be completed;

Data storage and analysis

Undoubtedly, HDFS is the most perfect data storage solution for data warehouse/data platform in the big data environment.

Offline data analysis and computing, that is, the part of the real-time requirements are not high, in my opinion, Hive is the first choice, rich data types, built-in functions; ORC file storage format with very high compression ratio; Very convenient SQL support, make Hive based on structured data statistical analysis is far more efficient than MapReduce, a SQL can complete the requirements, the development of MR may need hundreds of lines of code;

Of course, the Hadoop framework also provides MapReduce interface. If you are really interested in Java development or are not familiar with SQL, you can also use MapReduce to do analysis and calculation.

Spark has been very popular in the past two years. After practice, its performance is much better than MapReduce, and it works better with Hive and Yarn. Therefore, Spark and SparkSQL must be supported for analysis and calculation. Because Hadoop Yarn already exists, it is very easy to use Spark. You do not need to deploy an independent Spark cluster. For details about Spark On Yarn, see Spark On Yarn Series articles.

The real-time computing part is explained separately later.

Gorgeous dividing line: You can follow LXW’s big data field, or join a mailing list to receive notifications of blog updates. Data sharing

Here the data sharing, in fact, refers to the previous data analysis and calculation results stored in the place, in fact, is the relational database and NOSQL database;

The results of Hive, MR, Spark, and SparkSQL analysis and calculation are still in the HDFS. However, most services and applications cannot directly obtain data from the HDFS. Therefore, a data sharing place is required so that services and products can easily obtain data.

As opposed to the data acquisition layer to HDFS, you need a tool to synchronize data from HDFS to other target data sources, as DataX can do.

In addition, some real-time computing results may be directly written to the data share by the real-time computing module.

Data applications

Business products Data used by business products already exists in the data sharing layer and can be accessed directly from the data sharing layer.

Reports and business products, the data used by reports are generally summarized and stored in the data sharing layer.

Impromptu query there are many users of impromptu query, may be data developers, website and product operators, data analysts, and even department bosses, they have impromptu query data demand;

Such AD hoc queries are usually not available in the existing reporting and data sharing layers and need to be queried directly from the data storage layer.

Ad-hoc queries are generally performed through SQL, but the biggest difficulty lies in the response speed. Hive is a bit slow to use. My current solution is SparkSQL, which responds much faster than Hive and is compatible with Hive.

Of course, you can also use Impala if you don’t care about having one more framework in your platform.

OLAP Currently, many OLAP tools do not support directly obtaining data from HDFS. They synchronize the required data to a relational database to perform OLAP. However, if there is a large amount of data, the relational database cannot perform OLAP.

In this case, you need to do relevant development to obtain data from HDFS or HBase to complete the OLAP function.

For example, develop interfaces to obtain data from HBase based on the dimensions and indicators selected by users.

Other data interfaces this kind of interface has the general, has the custom. For example, an interface to get user attributes from Redis is universal, and all businesses can call this interface to get user attributes.

Gorgeous dividing line: You can follow LXW’s big data field, or join a mailing list to receive notifications of blog updates. Real-time computing

Now business needs more and more real-time data warehouse, such as: real-time understanding of the overall traffic of the website; Get exposure and clicks on an AD in real time; In the case of massive data, it is impossible to rely on traditional database and traditional implementation method. What is needed is a distributed real-time computing framework with high throughput, low latency and high reliability. Storm is more mature in this area, but I chose Spark Streaming for the simple reason that I don’t want to introduce more frames into the platform, and Spark Streaming has a slightly higher latency than Storm, which is negligible for our needs.

Spark Streaming is currently used to realize real-time website traffic statistics and real-time advertising effect statistics.

Flume collects website logs and advertising logs on the front-end log server and sends them to Spark Streaming in real time. Spark Streaming collects statistics and stores the data to Redis, where services can be accessed in real time.

Task scheduling and monitoring

In the data warehouse/data platform, there are a variety of very many procedures and tasks, such as: data acquisition task, data synchronization task, data analysis task;

In addition to scheduled scheduling, these tasks also have very complex task dependencies. For example, data analysis tasks can only be started after corresponding data collection tasks are completed. A data synchronization task can be started only after the data analysis task is complete.

This requires a very perfect task scheduling and monitoring system, which as the center of data warehouse/data platform is responsible for scheduling and monitoring the allocation and operation of all tasks.

I have written an article before, “Task scheduling and Monitoring in Big Data Platform”, here is no longer redundant.

Metadata management

This is the part that we want to do well, it’s very complicated, and I think it’s worth less than the cost, so we’re not going to worry about it. Currently there is only metadata for daily task runs.

conclusion

In my opinion, architecture is not the more new technologies the better, but the simpler and more stable the better if it can meet the needs. At present, in our data platform, the development is more concerned with business, rather than technology. They have made clear the business and requirements, and basically only need to do simple SQL development, and then configure to the scheduling system. If the task is abnormal, they will receive alarms. In this way, more resources can be focused on the business.

Related reading:

Big data platform task scheduling and monitoring system

Spark On Yarn series articles

Taobao DataX data exchange tool for heterogeneous data sources download and use

All right, that’s it. We’ll talk about it later.