Principle and Application of Big Data Technology

“This is the fifth day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Three informationization waves

Information wave	Time of occurrence	mark	Problem solved	On behalf of the enterprise
First wave	Around 1980	Personal computer	Information processing	Intel, AMD, IBM
Second wave	Around 1995	The Internet	Information transmission	Yahoo, Google, Alibaba
Third wave	Around 2010	Internet of Things, cloud computing and big data	The information explosion	Amazon, Google, Aliyun

Characteristics of big data

Large amount of data
Various data types
Fast processing speed
Low value density
authenticity

Storage location of the HDFS final data block

The location of the datanode

Master Specifies the role of the primary server

Master The Master server manages tables and regions.

Manage user operations such as adding, deleting, modifying, and querying tables. Load balancing is implemented among servers of different regions. After a Region is split or merged, adjust its distribution. Migrate the Region on the failed Region server.Copy the code

The role of the Region

Maintains the regions assigned to the master and processes THE I/O requests from these regions. Splits the regions that become too large during operationCopy the code

Hadoop features:

High reliability, high efficiency, high scalability, high fault tolerance, low cost, running on Linux platform, support a variety of programming languages.

HBase

It provides high reliability, high performance, scalable, real-time read and write, and distributed column database. HDFS is generally used as its underlying data storage.

HBase Access Interface

Native Java API features: The most common and efficient access mode Scenario: Suitable for Parallel batch processing HBase table data in Hadoop MapReduce jobs.

HBase Shell features: HBase command line tool. Simplest interface Scenarios: Suitable for HBase management.

HBase Programming Practice

Format command:./bin/ HDFS namenode-format

Hadoop fs -mkdir [-p] hadoop fs -mkdir [-p]

Each Strore corresponds to a column family

Hadoop core configuration file

core-site.xml

 <configuration>

    <property>

        <name>hadoop.tmp.dir</name>

       <value>file:/usr/local/hadoop/tmp</value>

       <description>Abase for other temporary directories.</description>

    </property>

     <property>

         <name>fs.defaultFS</name>

         <value>hdfs://localhost:9000</value>

     </property>

</configuration>
Copy the code

Hadoop.tmp. dir: directory for storing temporary files during hadoop running

Fs. defaultFS: the name of the default file system

hdfs-site.xml



 <configuration>

     <property>

         <name>dfs.replication</name>

         <value>1</value>

     </property>

     <property>

          <name>dfs.namenode.name.dir</name>

          <value>file:/usr/local/hadoop/tmp/dfs/name</value>

     </property>

     <property>

          <name>dfs.datanode.data.dir</name>

          <value>file:/usr/local/hadoop/tmp/dfs/data</value>

     </property>

</configuration>
Copy the code

Dfs. replication: indicates the redundancy number. Set this parameter to 1.

Dfs.namenode.name. dir: indicates the local disk directory where fsimage files are stored and metadata is stored in namenode of Hadoop

Dfs.datanode.data. dir: stores multiple data blocks in datanode, the datanode of Hadoop.

The characteristics of the hadoop

High reliability
High efficiency
High scalability
High fault tolerance
The cost is low
It runs on Linux
Support for multiple programming languages

Name nodes and data nodes

NameNode	DataNode
Storing metadata	Storing file contents
Metadata is stored in memory	The file content is saved on disk
Save the mapping between file blocks and Datanodes	Maintain the mapping between blocks and DataNode local files

HBase functional components

Library function
A Master Master server
Multiple Region servers

Cloud computing

· There are three typical service modes of cloud computing: IaaS (Infrastructure as a service), PaaS (platform as a service) and SaaS (software as a service). Addendum: DaaS (Data as a Service)

· There are 3 types of cloud computing: public cloud, private cloud and hybrid cloud.

· Key technologies of cloud computing: virtualization, distributed storage, distributed computing, multi-tenant, etc.

· The concept of cloud computing: Cloud computing realizes the provision of scalable and inexpensive distributed computing capability through the network. Users can obtain all kinds of IT resources they need whenever and wherever they have network access conditions. Cloud computing is the most representative network computing technology and mode in recent years, which represents the dynamic and extensible network application infrastructure with virtualization technology as the core and low cost as the goal.

The Internet of things

The Internet of Things has four layers: perception layer, network layer, processing layer and application layer.
The connection between big data, cloud computing and Internet of Things: Cloud computing provides the technical basis for big data, and big data provides the place for cloud computing; The Internet of Things is an important source of big data, and big data technology provides support for data analysis of the Internet of Things. Cloud computing provides massive data storage capacity for the Internet of Things, and the Internet of Things provides broad application space for cloud computing technology.
The difference between big data and cloud computing and Internet of Things: Big data focuses on the storage, processing and analysis of massive data, discovering value from massive data and serving production and life; Cloud computing essentially aims to integrate and optimize various IT resources and provide them cheaply to users as a service over the network; The development goal of the Internet of Things is to connect things, and application innovation is the core of the development of the Internet of Things.