Although data analysis is hidden behind the business system, it plays a very important role. The results of data analysis play a pivotal role in decision-making and business development. Along with the development of the technology of data, data mining, data to explore proper nouns, such as exposure is higher and higher, but in big data analysis system similar to the Hadoop series before, has experienced a rapid growth of data analysis, especially the data analysis is given priority to with BI system, there has been a very mature and stable technical scheme and the ecological system, For BI system, the general architecture diagram is as follows:

It can be seen that in BI system, the core module is Cube. Cube is a higher-level business model abstraction, on which various operations can be carried out, such as drilling up, drilling down, slicing and so on. Most BI systems are based on relational databases, which use SQL statements to operate, but SQL is relatively weak in multidimensional operations and analysis, so Cube has its own unique query language MDX, and MDX expressions have stronger multidimensional expression ability. Therefore, Cube as the core analysis system basically occupies half of the data statistical analysis, most database service manufacturers directly provide BI software services, it is easy to build a set of Olap analysis system. However, the problems of BI gradually emerged as time went by:

  • BI system is mainly focused on analyzing high-density and high-value structured data generated by business data, and it is very weak in processing unstructured and semi-structured data, such as image, text and audio storage and analysis.
  • Since data warehouse is structured storage, when data enters data warehouse from other systems, we usually call it ETL process. ETL action is strongly bound to business, and it usually requires a special ETL team to connect with business and decide how to clean and transform data.
  • With the increase of heterogeneous data sources, for example, if there are video, text, images and other data sources, it needs very complex ETL programs to parse the data content into the data warehouse, which leads to ETL becoming too large and bloated.
  • Performance becomes a bottleneck when the amount of data is too large, showing significant strain in the TB/PB level of data volume.
  • Paradigm and constraint rules of the database, focusing on solving the problem of data redundancy, is in order to ensure the consistency of the data, but for the data warehouse, we don’t need to modify data and the guarantee of consistency, in principle the raw data of data warehouse are read-only, so these constraints would become the factors that affect performance.
  • ETL action presupposes and processes data, leading to the data acquired by machine learning is postulated data, so the effect is not ideal. For example, if it is necessary to use data warehouse to mine abnormal data, the feature data to be extracted should be clearly defined when the data is entered into the database through ETL, otherwise it cannot be structured into the database. However, in most cases, features can be extracted based on heterogeneous data.

Under a series of problems, the big data analysis platform led by Hadoop system gradually shows excellent performance, and the ecosystem surrounding Hadoop system also keeps getting larger. For Hadoop system, it fundamentally solves the bottleneck problem of traditional data warehouse, but also brings a series of problems:

  1. Upgrading from data warehouse to big data architecture does not have smooth evolution, and basically amounts to overturning and redoing.
  2. Distributed storage under big data emphasizes the read-only nature of data. Therefore, storage methods such as Hive and HDFS do not support update, and HDFS does not support parallel write operations. These features have certain limitations.

The data analysis platform based on big data architecture focuses on solving the bottleneck of traditional data warehouse data analysis from the following dimensions:

  1. Distributed computing: Distributed computing allows multiple nodes to compute in parallel, and emphasizes data localization to minimize data transmission. For example, Spark uses RDD to represent data calculation logic. A series of optimizations can be performed on the RDD to reduce data transmission.
  2. Distributed storage: The so-called distributed storage refers to splitting a large file into N copies and placing each copy independently on a machine. This involves copying, sharding, and management of files. The main optimized actions of distributed storage are all in this piece.
  3. Combination of retrieval and storage: Component, in the early days of the big data storage and computing is relatively single, but now more is in store to do more in the direction of the hands and feet, to make query and calculation more efficient, efficient for computing was to find data quickly, read the data quickly, so the data content of a storage not only, and will add a lot of information, such as the index information. People like Parquet and Carbondata are all of these ideas.

In general, there are the following big data architectures around Hadoop system at present:

Traditional big data architecture

Call traditional big data architecture, because its location is in order to solve the problem of traditional BI, in simple terms, data analysis, without any change of the business, but because of the amount of data, performance, lead to problems such as system can’t normal use, need to be upgraded, so this kind of structure is to solve the problem. As you can see, it still retains the ETL action and puts the data into the data store through the ETL action.

Advantages: Simple and easy to understand, for BI system, the basic idea has not changed, the change is only technology selection, replacing BI components with big data architecture.

Disadvantages: For large data, so complete Cube without BI architecture, although there are kylin, kylin limitations are very obvious, however, is far from the Cube under BI flexibility and stability, so the business support of flexibility is not enough, so for there are many reports, or complex drilling scene, need too much manual customization, At the same time, the architecture is still batch processing, lack of real-time support.

Application scenario: Data analysis requirements are still based on BI scenarios, but the data volume and performance are insufficient for daily use.

Streaming architecture

On the basis of traditional big data architecture, streaming architecture is very radical. Batch processing is directly removed, and data is processed in the form of streaming. Therefore, ETL is removed at the data access end and replaced with data channel. The data after stream processing is directly pushed to consumers in the form of messages. There is a storage part, but the storage is more in the form of Windows, so the storage does not take place in the data lake, but in the peripheral system.

Advantages: No bloated ETL process, very efficient data.

Disadvantages: For streaming architectures, there is no batch processing, so it does not support data replay and historical statistics well. Only in-window analysis is supported for offline analysis.

Application scenario: early warning, monitoring, and data expiration period requirements.

Lambda architecture

Lambda architecture is an important architecture in big data system, and most architectures are basically Lambda architecture or architecture based on its variants. Lambda’s data channel is split into two branches: live streaming and offline. The real-time stream follows the streaming architecture to ensure its real-time performance, while the offline stream is mainly batch processing to ensure the final consistency. What does that mean? Flow channel processing to ensure effectiveness more auxiliary reference is given priority to with incremental calculation, while the batch layer data for the whole operation, ensure the consistency of the final, so the outermost have a real-time Lambda layer and offline combined action, the action is a very important action in the Lambda, about the merger of the idea is as follows:

Advantages: Both real-time and offline, covering the data analysis scenario very well.

Disadvantages: The offline layer and the live stream face different scenarios, but the internal processing logic is the same, so there are a lot of honor and duplicate modules.

Application scenario: Both real-time and offline requirements exist.

Kappa architecture

The Kappa architecture is optimized on the basis of Lambda, combining the real-time and streaming parts and replacing the data channel with message queues. Therefore, for Kappa architecture, stream processing is still the main process, but the data is stored at the level of the data lake. When offline analysis or re-calculation is needed, the data of the data lake can be replayed through the message queue.

Advantages: The Kappa architecture solves the redundancy in the Lambda architecture and is designed with the transcendent idea that data can be replayed. The overall architecture is very simple.

Cons: While the Kappa architecture looks neat, it is relatively difficult to implement, especially for data replay.

Application scenario: Similar to Lambda, the modified architecture is an optimization for Lambda.

Unifield architecture

All of the above architectures focus on massive data processing, while Unifield architecture is more radical, integrating machine learning and data processing. From the core, Unifield still focuses on Lambda, but it has been transformed and added a new machine learning layer in the flow processing layer. You can see that after the data enters the data lake through the data channel, the model training section is added and used in the streaming layer. At the same time, the streaming layer not only uses the model, but also includes continuous training of the model.

Advantages: Unifield architecture provides a set of architecture solutions combining data analysis and machine learning, which perfectly solves the problem of how to combine machine learning with data platform.

Disadvantages: Unifield architecture has higher implementation complexity. For machine learning architecture, software package and hardware deployment are very different from data analysis platform, so the implementation is more difficult.

Application scenario: a large amount of data needs to be analyzed, and it is convenient for machine learning and has a very large demand or planning.

conclusion

These are some of the most widely used architectures in data processing today, and there are many others, but the ideas are more or less similar. As the field of data and machine learning continues to evolve, these ideas may eventually become obsolete.


For more insights, please follow our wechat official account: Sitvolk

WeChat
Sina Weibo
Evernote
Pocket
Instapaper
Email
LinkedIn
Pinterest