Hadoop and its commands

A Hadoop cluster contains multiple nodes.

Large amount of data: How to store it? How do I do that? Several P’s of data

HDFS focuses on distributed storage. A machine can not be stored, so there is the concept of cluster.

Clusters are generally odd number, there is an election mechanism.

Suppose you have a 300M file with five machines in a cluster. How is clustering stored in this case?

Instead of storing the 300M file directly into a node, it cuts it up.

A Block is 128M. It will be split into three blocks: 128M, 128M, 44M.

The Leader (NameNode) is in charge of the cluster and is responsible for assigning which three nodes to live on. The exact allocation depends on the remaining space and physical distance of the cluster nodes.

HDFS:

NameNode does not store data and is used exclusively for macro control.

DataNode reads and reads data.

SecondNameNode assists NameNode and is NameNode’s assistant.

So HDFS has these three processes.

There are two ecospheres based on big data:

Hadoop ecosystem: Focus on 1. Distributed storage 2. Analytical computing. Hive is used for analysis

Spark ecosphere: Based on memory computing, it can be used for offline analysis or real-time computing.

Hadoop overview

Hadoop is a large database processing framework

Hadoop core three elements:

HDFS — Solves big data storage

Distributed computing framework MapReduce — Addresses big data computing

Distributed resource management system Yarn

The relevant components are placed under the Model folder

Hadoop Configuration Modification

Version: Hadoop – 3.2.1

Example Modify the mapping between a host name and an IP address

[hadoop@hadoop01 model]$sudo vi /etc/hosts 127.0.0.1 localhost localhost. localDomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.1.100 hadoop01 # note that the previous 10.2.0.181 Change to your own IP addressCopy the code

After the host mapping is changed, you must restart the system for it to take effect

Modify the hadoop – env. Sh

  • Location: / opt/model/hadoop – 3.2.1 / etc/hadoop

  • [hadoop@hadoop01 hadoop]$vi /opt/model/hadoop-3.2.1 /etc/hadoop-env. sh 37 export JAVA_HOM=/opt/model/jdk1.8/Copy the code

HDFS Service Management

The service start

Starting namenodes on [hadoop01] Starting datanodes Starting secondary namenodes [hadoop01] [hadoop@hadoop01 hadoop3.2.1]$start-yarn.sh # Starting resource management yarn Allocating disk space resourcemanager Starting nodemanagersCopy the code

[Verification Service]

[hadoop@hadoop01 hadoop-3.2.1]$JPS 2640 NodeManager # Belong to YARN 2246 Secondarynamenodes # belong to HDFS 3034 JPS 2059 datanodes # Belong to HDFS 2523 ResourceManager belongs to YARN 1934 NameNode Belongs to HDFSCopy the code

HDFS and Yarn are successfully started based on web pages

【 Verify HDFS】

The port is 9870

http://192.168.0.104:9870



There is currently one node

Take a look at the overall directory space

These folders are in the root path

[Verify Yarn]

The port number 8088

http://192.168.0.104:8088

Service is down

[hadoop@hadoop01 hadoop]$ stop-dfs.sh
[hadoop@hadoop01 hadoop]$ stop-yarn.sh
Copy the code

HDFS command

Prerequisite: start - DFS. ShCopy the code

HDFS common file commands

[hadoop@hadoop01 hadoop]$ hdfs --help Usage: hdfs [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS] OPTIONS is none or any of: --buildpaths attempt to add class files from build tree --config dir Hadoop config directory --daemon (start|status|stop) operate on a daemon --debug turn on shell script debug mode --help usage information --hostnames list[,of,host,names] hosts to use in worker mode --hosts filename list of hosts to use in worker mode --loglevel level set the log4j level for this command --workers turn on worker mode SUBCOMMAND is one of: Admin Commands: cacheadmin configure the HDFS cache crypto configure HDFS encryption zones debug run a Debug Admin to execute HDFS debug  commands dfsadmin run a DFS admin client dfsrouteradmin manage Router-based federation ec run a HDFS ErasureCoding CLI fsck run a DFS filesystem checking utility haadmin run a DFS HA admin client jmxget get JMX exported values from NameNode or DataNode. oev apply the offline edits viewer to an edits file oiv apply the offline fsimage viewer to an fsimage oiv_legacy apply the offline fsimage viewer to a legacy fsimage storagepolicies list/get/set/satisfyStoragePolicy block storage policies Client Commands: classpath prints the class path needed to get the hadoop jar and the required libraries dfs run a filesystem command on The file System # common envvars display computed Hadoop environment variables fetchdt fetch a delegation token from the NameNode getconf get config values from configuration groups get the groups which users belong to lsSnapshottableDir list all snapshottable dirs owned by the current user snapshotDiff diff two snapshots of a directory or diff the current  directory contents with a snapshot version print the version Daemon Commands: balancer run a cluster balancing utility datanode run a DFS datanode dfsrouter run the DFS router diskbalancer Distributes data evenly among disks on a given node journalnode run the DFS journalnode mover run a utility to move block replicas across storage types namenode run the DFS namenode nfs3 run an NFS version 3 gateway portmap run a portmap service secondarynamenode run the DFS secondary namenode sps run external storagepolicysatisfier zkfc run the ZK  Failover Controller daemon SUBCOMMAND may print help when invoked w/o parameters or with -h. [hadoop@hadoop01 hadoop]$Copy the code

HSDF is added before the command to perform cluster operations

Perform operations on cluster file systems. Prefix: HDFS DFS

function The command
Check the directory hdfs dfs -ls /
Create a directory hdfs dfs -mkdir -p /input/weather/data
Delete the directory hdfs dfs -rm -r /input/weather
upload HDFS dfS-put Local file location Cluster location
Appends content to a file HDFS dfS-appendtoFile Local file Location Cluster file location
Download (default download to current folder) HDFS dfs-get File location
Viewing file Contents HDFS dfs-cat File location
#1) Directory management[hadoop@hadoop01 hadoop]$HDFS DFS -ls / # Found 4 items drwxr-xr-x - Hadoop supergroup 0 2021-01-23 14:51 /flume drwxrwxrwx - hadoop supergroup 0 2020-07-20 10:51 /hive drwxrwxrwx - hadoop supergroup 0 2020-07-20 17:02 /tmp drwxrwxrwx - hadoop supergroup 0 2020-09-15 15:50 /warehouse [hadoop@hadoop01 hadoop]$ [hadoop@hadoop01 hadoop]$ hdfs [hadoop@hadoop01 hadoop]$HDFS DFS -mkdir -p /input/weather/data # create multi-level directory [hadoop@hadoop01 Hadoop]$HDFS DFS -rm -r /input/weather # Delete directory Deleted /input/weather

#2) Document management
#2.1) Upload the /opt/data/world. SQL file to the /input space of HDFS[hadoop@hadoop01 data]$HDFS DFS -put world.sql /input # upload file to current default HDFS#2.2) Upload a local file to a specified cluster
[hadoop@hadoop01 data]$ hdfs dfs -put market.sql hdfs://hadoop01:9000/input
#2.3) Upload order log to/INPUT[hadoop@hadoop01 data]$echo "s10001 ">> order.log [hadoop@hadoop01 data]$cat order.log S10001, Zhang SAN, icebox,5000,1,2021-01-19 [hadoop@hadoop01 data]$HDFS dfS-put order.log /input#2.4) Order. log Generates a log file every day and appends it to order.log[hadoop@hadoop01 data]$cat 2021_01_20_order.log s10002, Zhang SAN, refrigerator,5000,2,2021-01-20 S10003, Li Si. washing machine,4000,1,2021-01-20 [hadoop@hadoop01 data]$ hdfs dfs -appendToFile 2021_01_20_order.log /input/order.log# Appends the local 2021_01_20_order.log file to /input/order.log
#2.5) Review files[hadoop@hadoop01 data]$HDFS DFS -cat /input/order.log s10001, DFS /input/order.log s10001, DFS /input/order.log s10001, DFS /input/order

#2.6) Download the cluster file to Linux[hadoop@hadoop01 hdfs_data]$HDFS DFS -get /input/order.log # run the following command: [hadoop@hadoop01 hdfs_data]$ll-rw-rw-1 hadoop Hadoop 159 1月 20 16:49 order.log # check the downloaded files [hadoop@hadoop01 hdfs_data]$cat order.log # Check the downloaded files S10002, Zhang SAN, refrigerator,5000,2,2021-01-20 S10003, Li Si, washing machine,4000,1,2021-01-20 S10004, Wang Wu, COLOR TV, 60002,2021-01-20Copy the code

HDFS dfsadmin command

#Viewing Cluster Status
[hadoop@hadoop01 hdfs_data]$ hdfs dfsadmin -report
Copy the code

You can also check it out directly on the website

Viewing HDFS uploaded data blocks

For example, where is the order. Log uploaded to the cluster

Let’s go to the DFS folder under Hadoop

There are two folders below, with a number under data and a number for each block connection pool

We cut to the specified directory under the connection pool to see a list of data blocks

We can take a look at what’s in the data block, let’s take a look at the movie reviews that were uploaded yesterday

cd subdir29/
ll
cat blk_1073749459
Copy the code