The concept of Big Data was proposed by John Masey, chief scientist of SGI, at the USENIX Conference in 1998. He published a paper called Big Data and the Next Wave of Infrastress, which used Big Data to describe the Data explosion. But it would be years before big data really got the industry’s attention. Among them, the most important enzymes of big data are the three papers of GFS, MapReduce and BigTable published by Google from 2003 to 2006.

Big data refers to massive amounts of data or data that are too large to be acquired, stored, managed, processed and refined by today’s mainstream computer systems in a reasonable amount of time to help users make decisions.

The 4V features of big data are Variety, Volume, Velocity and Value. As shown in the figure below. Variety refers to multiple sources and formats. Data can come from search engines, social networks, call logs, sensors, etc., stored either in structured or unstructured form. Volume indicates that the amount of data increases from TB to PB. Especially in the era of mobile Internet, video, language and other unstructured data are growing rapidly. Velocity means that data has timeliness, requires fast processing and results obtained, which is essentially different from traditional data mining technology. Value refers to a large amount of irrelevant information. If it is not processed, it has low Value and belongs to the data with low Value density.

Big data processing process

The general big data processing process consists of the following processes: data collection, data storage, data processing and data presentation. As shown in the figure below.

In the era of big data, due to the variety and size of data, the forms of data collection from structured data to unstructured data have become more complex and diverse.

When the development of storage technology falters and fails to catch up with the speed of data development, distributed storage becomes an inevitable choice, and unstructured data also puts forward new requirements for storage formats. Emerging data sources also make the amount of data produced a spurt of rapid growth. At this time, the birth of distributed storage and NoSQL responded to such requirements and solved the fundamental problem of big data storage.

Data processing includes data calculation and analysis, which is the core of big data technology and will be introduced in detail in the rest of this paper. Data display refers to reflecting various indicators of the current platform or business operation by providing visual interfaces such as reports.

Evolution of big data

When it comes to big data technology, the most basic and core is still big data analysis and calculation. In 2017, big data analysis and computing technologies are still developing rapidly. No matter Hadoop, Spark, or artificial intelligence, they are all developing and iterating on their own.

At present, most of the traditional data computing and data analysis services are based on the batch data processing model: ETL system or OLTP system is used to construct the data store, and online data services access the data store by constructing SQL language and obtain the analysis results. This data processing method is widely used with the evolution of relational database in the industry. However, in the era of big data, as more and more human activities are informationized and digitized, more and more data processing requirements are real-time and streaming. Andrew NG revealed that the future development trend of big data is artificial intelligence. Bulk computing, streaming computing, and artificial intelligence are covered in detail below, and the latter part will be covered in the next installment.

Batch computing

The traditional batch data processing model is usually based on the following processing model:

1. Use ETL system or OLTP system to construct the original data store to provide subsequent data services for data analysis and calculation. As shown in the figure below, the user loads the data, and the system carries out some column query optimization such as index construction for the loaded data according to its own storage and calculation conditions. Therefore, for batch calculation, the data must be loaded into the computer system, the subsequent computing system can only be calculated after the completion of data loading.

2. The user or system initiates a computing function and makes a request to the data system. At this point, the computing system starts to schedule (start) compute nodes to compute a large amount of data, which may take several minutes or even hours. At the same time, due to the lack of timeliness of data accumulation, the data in the calculation process mentioned above must be historical data, so real-time data cannot be guaranteed.

3. The calculation results are returned, and the data will be returned to the user in the form of result sets after the calculation is completed, or the user can integrate the data into other systems again due to the huge amount of calculation results saved in the data calculation system. Once the data result is huge, the overall data integration process is long and may take several minutes or even hours.

Typical example: Hadoop

Hadoop is an open source project of Apache, which can provide open source, reliable and scalable distributed computing tools. It consists of HDFS and MapReduce, which are used to store and compute big data respectively.

The HADOOP Distributed File System (HDFS) is an independent distributed file system that provides storage services for the MapReduce computing framework. It has high fault tolerance and high availability. Based on block storage, the HDFS implements data access in streaming mode and projects backup between data nodes. The default block size is 64 MB, and users can customize the block size.

HDFS is a distributed file system based on the master/slave structure. The structure includes NameNode directory management, DataNode data storage, and Client access Client. NameNode is responsible for the system namespace, cluster configuration management, and storage block replication. DataNode is the basic storage unit of distributed file system. Client is a distributed file system application. The architecture is shown below.

As shown in the figure above, NameNode implements high availability in master/slave mode, but in previous versions, a Standby could only be a single host. To achieve higher reliability, more than one Standby NameNode can be configured in Hadoop3.0 to ensure higher reliability. For data storage, HDFS stores data in multi-copy mode. That is, the Client obtains data from NameNode datanodes on which data is to be stored. Then the datanodes that store the latest data synchronize the changed data to other Datanodes in synchronous or asynchronous mode. But in recent years, the scale of the increase in data is far greater than people imagine, and the generated data, there is bound to be hot and cold data. Hot or cold, data is a core asset for a company, and no one wants to lose it. However, for cold data, if multiple copies are used, a large amount of storage space will be wasted. After Hadoop3.0, Erasure Coding significantly reduced data storage space usage. For cold data, EC can be used to reduce the cost of storing the data, and CPU calculations can be used to read the data when needed.

MapReduce is a distributed computing framework applicable to offline big data computing. Functional programming mode is adopted to realize complex parallel computing by using Map and Reduce functions. The main function is to decompose a task and summarize the results. Among them, the main function of Map is to decompose a job task into multiple sub-tasks, and then send them to the corresponding node server for parallel computing. The main function of Reduce is to merge the results of parallel computing and return the results to the central server.

Specifically, MapReduce is the shredding of large amounts of unprocessed data into smaller datasets. Each Map processes the data in each data set in parallel, stores the result as <key,value>, and merges the data with the same key value and sends it to Reduce for processing. The principle is shown in the figure below.

Flow calculation

Different from batch computing model, streaming computing lays more emphasis on computing data flow and low delay. Streaming computing data processing model is as follows:

1. Use real-time integration tools to transfer real-time changes to streaming data stores (that is, message queues, such as RabbitMQ); At this time, the data transmission programming is real-time, and a large amount of data accumulated for a long time is amortized to each time point and continuously transmitted in small batches in real-time, so the delay of data integration can be guaranteed.

2. Data link in gap flow and batch processing model is bigger, because the data integration from a cumulative into real-time, different from batch calculation for data integration all ready before operation, to start the computation flow calculation work is a kind of permanent computing services, once started will have been waiting for events triggered by the state, once the small batch data enters the streaming data storage, Flow computations compute immediately and get results quickly.

3. Different from batch calculation results data need to wait for the completion of data calculation results, batch data transfer to the online system; Streaming computing operations can immediately write data to the online system after each small batch of data calculation, without waiting for the calculation results of the whole data, data can be immediately posted to the online system, further realizing the real-time display of the results of the real-time calculation.

Typical example: Spark

Spark is a fast and versatile cluster computing platform. It contains Spark Core, Spark SQL, Spark Streaming, MLlib, and Graphx components. As shown in the figure below. Spark Core

Spark SQL is a library for processing structured data. It supports data query using SQL. Spark Streming is a real-time data flow processing component. MLlib is a package that contains general-purpose machine learning. GraphX is a library that processes graphs and does parallel computation of graphs.

Spark proposes the Concept of Resilient Distributed Dataset (RDD). Each RDD is divided into multiple partitions that run on different nodes in the cluster. General data operations are divided into three steps: creating an RDD, converting an existing RDD, and calling an RDD operation for evaluation.

In Spark, the computation is modeled as a directed acyclic graph (DAG), where each vertex represents an elastic distributed data set (RDD) and each edge represents an RDD operation. An RDD is a collection of objects divided into partitions (in memory or swapped to disk). On the DAG, the edge E from vertex A to vertex B means that RDD B is the result of operation E on RDD A. There are two types of operations: transformations and actions. Transformation (e.g.; Mapping, filters, joins) perform operations on the RDD and generate a new RDD.

The following describes the differences between Spark and Hadoop:

  1. Compared to Hadoop, Spark is faster, with an average processing speed of 10 to 100 times that of Hadoop. Because Spark’s data processing takes place in memory, the storage layer needs to be interacted with only when data is read into memory at the beginning and the final result is persisted. All intermediate data results are stored in memory. While in-memory processing can significantly improve performance, Spark can also improve the speed of disk-related tasks because the entire task set can be optimized by analyzing it in advance. Especially in the scenario of frequent iterations, Hadoop needs to write data back to disk between each iteration, which introduces a large amount of disk I/O, and the overall system performance is low.

  2. Spark does not have a distributed file system like Hadoop’s HDFS, but it can obtain data through HDFS or Cassandra.

  3. Hadoop was designed with a greater emphasis on batch processing; Spark, on the other hand, can solve more problems because of its support for stream processing and machine learning.

  4. For different directions. Hadoop is essentially a distributed data foundation; Spark is a data processing tool.

conclusion

This paper mainly introduces the definition and characteristics of big data, the general process of big data, and focuses on the first two steps in the evolution of big data technology (batch computing and streaming computing). For batch calculation, the important ideas of HDFS and MapReduce of Hadoop are introduced. Meanwhile, in the scenario of frequent iterative computing, Hadoop requires frequent I/O operations, which reduces system performance. Spark was invented to store data in memory under normal circumstances. In addition, RDD and DAG ideas were proposed to manage data effectively. Next time, we’ll cover step 3 in the evolution of big data technology: artificial intelligence.


Welcome to follow wechat public account: Mukeda, all articles will be synchronized on the public account.