What is big data?

The book “Big Data” defines big data in this way. Big data refers to the use of all the data for analysis and processing instead of the shortcut of random analysis (sampling survey).

It sends at least two messages:

1. Big data is massive data

2. There is no shortcut for big data processing, which puts forward higher requirements for analytical processing technology

Second, the processing process of big data

The following figure shows the data processing process:

As a big data novice, you should understand big data through this article. 1. There are hundreds of billions of data sources at the bottom, including SCM(supply chain data), 4PL(logistics data), CRM(customer data), website logs and other data

2. The second layer is the data processing layer. Data engineers extract, clean, transform and load data from data sources according to the standard statistical caliber and indicators (ELT).

3. The third layer is the data warehouse. The processed data flows into the data warehouse for integration and storage, forming one data mart after another.

Data mart refers to a collection of classified storage data, that is, data is stored according to the requirements of different departments or users. Click the link to join the group chat [big data exchange learning 2 groups] : jq.qq.com/?_wv=1027&k…

4. The fourth layer is BI(business intelligence), which carries out data analysis modeling, mining and operation according to business requirements and outputs a unified data analysis platform

5. The fifth layer is the data access layer, which opens different data roles and permissions to different demanders to drive business.

The magnitude of big data determines the difficulty of big data processing and application, and it is necessary to use specific technical tools to process big data.

Big data processing technology

Take the most commonly used Hadoop for example:

Hadoop is an open source framework developed by Apache Corporation that allows large data to be stored and processed in a distributed environment using simple programming model computers across a cluster.

A cluster is a node with two or more servers to provide data services. A single server cannot process massive big data. The more servers, the more powerful the cluster.

Hadoop is like a data ecosystem, with different modules doing their job. The following is an ecological map of the Hadoop website.

If you are new to big data, you should know from this article that the Big Data Hadoop LOGO is a flexible elephant. The origin of the LOGO has been debated on the Internet, with some saying it is because the elephant symbolizes a giant and refers to big data, which Hadoop makes flexible. The official seal and LOGO comes from the fact that founder Doug Cutting’s children once named an elephant toy Hadoop.

As can be seen from the figure above, the core of Hadoop is HDFS,YARN, and Map Reduce. The following is a brief introduction to the meanings and functions of several major modules.

1. HDFS(Distributed File Storage System)

Data is distributed in blocks across different nodes in the cluster. When using HDFS, you do not need to care about which node data is stored or obtained from. You only need to manage and store data in the file system as if using a local file system.

2. Map Reduce(Distributed Computing Framework)

A distributed computing framework distributes complex data sets to different nodes to operate on, each node periodically returning its completed work and the latest status. You can understand the Map Reduce principle based on the following figure:

As a newbie to big data, you should learn from this article that big data computers need to count input words:

If centralized calculation method is adopted, we have to calculate how many times a word like Deer appears first, and then how many times another word appears, until all the words are counted, which will waste a lot of time and resources.

If distributed computing is adopted, computing becomes efficient. We randomly assign data to three nodes, and the nodes count the number of words in the processed data respectively, and then aggregate the same words to output the final result.

3. YARN(Resource Scheduler)

Equivalent to the computer task manager, resource management and scheduling.

4. HBASE(Distributed database)

HBase is a non-relational database (Nosql). In some service scenarios, data store query in HBase is more efficient.

The difference between a relational database and a Fee-based relational database will be discussed in more detail in a future article.

5. HIVE

HIVE is a data warehouse tool based on Hadoop. It can be converted into Map Reduce tasks to query and analyze HDFS data using SQL. HIVE allows users to perform query analysis without writing Map Reduce tasks and only need to master SQL.

6. Spark(Big Data Computing Engine)

Spark is a fast and versatile computing engine designed for large-scale data processing

7. Mahout machine Learning Mining Library

Mahout is an extensible machine learning and data mining library

8, Sqoop

Sqoop can import relational databases into Hadoop’s HDFS or import HDFS data into relational databases

In addition to the above modules, Hadoop also has Zookeeper, Chukwa and other modules. Because it is open source, there will be more and more efficient modules in the future. If you are interested, you can learn about them online.

Complete the big data processing process through Hadoop’s powerful ecosystem.

In the near future, the Multi-intelligence era will definitely come into our life thoroughly. Friends who are interested in the future frontier industries can pay attention to the multi-intelligence era and timely obtain the cutting-edge information and basic knowledge of artificial intelligence, big data, cloud computing and the Internet of Things. Let’s join hands to lead the future of artificial intelligence!