The introduction

With the development of science and technology, we leave more and more data on the net, to online shopping, commodity trading, small to browse web pages, WeChat chat, cell phone recording daily schedule, etc., it can be said that in today’s life, as long as you are in, you will generate data all the time, but these data can be referred to as the big data? No, it’s not big data yet. So what exactly is big data data?

Overview of Big Data

define

Baidu Baike’s definition: Big data refers to the collection of big data that cannot be captured, managed and processed by conventional software tools within a certain period of time. It is a massive, high-growth and diversified information asset that requires a new processing mode to have stronger decision-making ability, insight and discovery ability and process optimization ability. The characteristics of big data can be summarized as follows:

Large amount of data, need to take some tools to collect, and then do analysis and calculation.

Here’s an example: Simon is a fan of Jackie Chan’s movies. He collected 100G of Jackie Chan’s classic movies ** on the website. In order to collect these movies, he decided to store all these movies. So Simon found the film by release year and watched it. Therefore, we can condense the definition of big data as follows:

Big data mainly solves the collection, storage, analysis and calculation of massive data.

The so-called big data, how to measure the scale of data?

The data unit

To measure the size of data, you need to know the units of data. The data units in ascending order are: Bit, Byte, KB, MB, GB, TB, PB, EB, ZB, YB, BB, NB, and DB. The conversion is as follows:

1 Byte =8 bit

1 KB = 1,024 Bytes = 8192 bits

1 MB = 1,024 KB = 1,048,576 Bytes

1 GB = 1,024 MB = 1,048,576 KB

1 TB = 1,024 GB = 1,048,576 MB

1 PB = 1,024 TB = 1,048,576 GB

For us, the most exposed units are probably KB, MB, GB and so on.

However, due to the difference in computing performance of computers in ancient times, gigabytes of data processing was the limit; For servers with 128 GIGABytes of memory, it is not uncommon for multiple servers to process EB-level data in parallel.

Data meaning and value

Every age has its definition of data, and the key goal is to mine the meaning and value behind it.

What is the point of dealing with so much data in so many different formats?

Big data can be considered as a collection of data. We can deduce an approximate objective law from these data, and use this law to predict the probability of the occurrence of the ontology generating data next time. Such as a user on a movie website often watch Jackie chan movies, so when the next time the user access movie website, put about Jackie chan movie recommended list in the front of the position, because we found him by the user’s browsing data like Jackie chan movies, and believe that user’s interest within short time won’t change.

This is a simple application of big data in life — mining user preferences and building recommendation models.

In the data age, our data is characterized by mass, diversity, real-time and uncertainty. So, we want to store and process the massive data with these characteristics, what kind of way or platform is more suitable? After years of technological development and natural selection, Hadoop distributed model stands out.

Therefore, learning big data is bound to Hadoop, but for students who have not been exposed to big data for a short time or have not yet been exposed to big data, if you ask them what we should learn about Hadoop, distributed storage and computing will definitely say, but only these two concepts are too general. So how can we control the learning of Hadoop? Don’t panic, listen to Simon Lang.

Hadoop overview

Hadoop is a distributed system infrastructure developed by the Apache Foundation. It mainly solves massive data storage and massive data analysis and calculation problems. In a broad sense, Hadoop usually refers to the Hadoop ecosystem.

Let’s take a look at the structure of Hadoop and then introduce the Hadoop ecosystem.

Hadoop components

Hadoop is structured in1.xand2/3.xDifferent, as shown in the figure belowHadoop is mainly made up ofTo calculate,Resource scheduling,Data is storedandAuxiliary toolComposition.

  • In the hadoop1.x era, MapReduce in HadoopHandles business logic and resource scheduling at the same time, the coupling is larger.
  • In the hadoop 2/3. X era, increasedYarn.Yarn only schedules resources, and MapReduce only computations.

Note: Resource scheduling refers to CPU, memory, server computing selection, etc

Next, we introduce:

  • HDFS for storage
  • YARN for resource scheduling
  • MapReduce for computation

HDFS Architecture Overview

Hadoop Distributed File System (HDFS for short) is a Distributed File System. The structure of HDFS is as follows:

The HDFS architecture consists of NameNode, DataNode, and secondary SecondaryNode.

  • The NameNode (nn)

NameNode(nn) NameNode(nn) NameNode(nn)

① Manage the HDFS namespace

② Configure a copy policy (for example, configure several copies of data)

3 manage the mapping information of data blocks

Which data blocks are stored in datanodes? Distributed storage

(4) Process client requests

  • The DataNode (dn)

The DataNode is the Slave. The NameNode gives commands and the DataNode performs the actual operations.

① Store actual data blocks

② Perform read/write operations on data blocks

  • Secondary NameNode (2nn)

① Assist NameNode to share its workload. For example, merge Fsimage and Edits regularly and push them to NameNode

(2) In case of emergency, NameNode recovery can be assisted.

NOTE:Secondary NameNode is not a hot standby NameNode, that is, when the NameNode dies, it cannot immediately replace the NameNode and provide services.

YARN Architecture Overview

Yet Another Resource Negotiator, called YARN, is the Hadoop Resource manager that provides server computing resources to the algorithm and serves as a distributed operating system platform. Algorithms such as MapReduce are applications that run on top of an operating system. As shown below:

YARN consists of ResourceManager, NodeManager, ApplicationMaster, and Container.

  • ResourceManager (RM) : indicates the largest resource (memory and CPU) in a cluster
  • NodeManager (NM) : The boss of single-node server resources
  • ApplicationMaster (AM) : The master of running a single task
  • Container: A Container that encapsulates resources required by tasks, such as memory, cpus, disks, and networks.

NOTE:

ResourceManager is a Master node. Each child node has a NodeManager. RM allocates resources to NM. There will also be something called ApplicationMaster(AM for short) in each node. It is responsible for communicating with RM to get resources and NM to start or stop tasks.

MapReduce Architecture Overview

MapReduce is a programming framework for distributed computing programs and a core framework for users to develop Hadoop-based data analysis applications. Its core function is to integrate user-written business logic code and default components into a complete distributed computing program, which runs concurrently on a Hadoop cluster.

MapReduce divides the calculation process into two phases: Map and Reduce

  • The Map phase processes the input data in parallel
  • In the Reduce phase, Map results are summarized

The relationship between the three

Relationship between HDFS, YARN, and MapReduce

  • Read data from HDFS
  • Yarn resource scheduling processes the data
  • MapReduce Starts the MapTask and ReduceTask tasks after receiving the Yarn command
  • The processed data is stored in HDFS

You thought Hadoop was over, NO! NO! NO! The Hadoop ecosystem has learned a lot

Well, keep learning!

The Hadoop ecosystem

Let’s start with a brain map of the Hadoop ecosystem.

Damn, there’s so much content, it’s killing me. Hadoop is an open source framework for distributed computing that provides a distributed system subproject (HDFS) and a distributed computing software architecture that supports MapReduce. Since the content of the brain map is a bit too much, let’s introduce a few components that occupy a high position in the Hadoop ecosystem. If you are interested in other components, you can check them out on your own.

  • Hive

Hive is a data warehouse tool based on Hadoop. It can map structured data files to a database table and quickly implement Simple MapReduce statistics using SQL statements without developing special MapReduce applications. Hive is suitable for statistical analysis of data warehouses.

  • Hbase

Hbase is a highly reliable, high-performance, column-oriented, and scalable distributed storage system. The Hbase technology can be used to build large-scale structured storage clusters on inexpensive PC servers.

  • Sqoop

Sqoop is a tool used to transfer data between Hadoop and relational databases. It can import data from a relational database (MySQL,Oracle,Postgres, etc.) into Hadoop’s HDFS, or import data from HDFS into a relational database

  • Zookeeper

Zookeeper is a distributed and open source coordination service designed for distributed applications. It is used to solve data management problems frequently encountered in distributed applications, simplify the coordination and management of distributed applications, and provide high-performance distributed services.

  • Ambari

Ambari is a Web-based tool that enables provisioning, management, and monitoring of Hadoop clusters.

  • Oozie

Oozie is a workflow engine server that manages and coordinates tasks running on Hadoop platforms (HDFS, Pig, and MapReduce).

  • Hue

Hue is a Web-based monitoring and management system that implements web-based operations and management of HDFS, MapReduce/YARN, HBase, Hive, and Pig.

.

So much for the Hadoop ecosystem, if you’re interested in something else, fill it in.