I. Working mechanism

1. Basic description

On datanodes, data blocks are stored on disks in the form of files, including two files: the data itself and the data block metadata, including length, verification, and timestamp.

After DataNode starts, it registers with NameNode service and periodically reports all data block metadata information to NameNode.

A heartbeat mechanism exists between Datanodes and NameNode. Every three seconds, the return result contains the execution command sent by NameNode to the DataNode, such as data replication and deletion. If no heartbeat message is received from The DataNode within 10 minutes, the node is considered unavailable.

2. Customize the duration

Through the HDFS – site. The XML configuration files, modify the timeout value and heart rate, in which the heartbeat. Recheck. Interval milliseconds, DFS. Heartbeat. The interval unit for seconds.

<property>
    <name>dfs.namenode.heartbeat.recheck-interval</name>
    <value>600000</value>
</property>
<property>
    <name>dfs.heartbeat.interval</name>
    <value>6</value>
</property>
Copy the code

3. A new node goes online

The current node of the machine is Hop01, HOP02, hop03, and the new node hop04 is added on this basis.

Basic steps

Hop04 environment is cloned based on a current service node.

Modify basic Centos7 configurations and delete data and log files.

Start the DataNode and associate it with the cluster.

4. Configure multiple directories

This configuration synchronizes services in the cluster, formats and starts HDFS and YARN, and uploads files to test.

<property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///${hadoop.tmp.dir}/dfs/data01,file:///${hadoop.tmp.dir}/dfs/data02</value>
</property>
Copy the code

2. Blacklist and whitelist configuration

1. Set the whitelist

Configure the whitelist and distribute the whitelist to the cluster service.

[root @ hop01 hadoop] # PWD/opt/hadoop2.7 / etc/hadoop [root @ hop01 hadoop] # vim DFS. Hosts hop01 hop02 hop03Copy the code

Configure hdFS-site. XML and distribute the configuration to the cluster service.

<property>
    <name>dfs.hosts</name>
    <value>/ opt/hadoop2.7 / etc/hadoop/DFS hosts</value>
</property>
Copy the code

Refresh the NameNode

[root@hop01 hadoop2.7]# hdfs dfsadmin -refreshNodes
Copy the code

Refresh the ResourceManager

[root@hop01 hadoop2.7]# yarn rmadmin-refreshnodesCopy the code

2. Blacklist Settings

Configure the blacklist and distribute the blacklist to the cluster service.

[root @ hop01 hadoop] # PWD/opt/hadoop2.7 / etc/hadoop [root @ hop01 hadoop] # vim DFS. Hosts. Exclude hop04Copy the code

Configure hdFS-site. XML and distribute the configuration to the cluster service.

<property>
    <name>dfs.hosts.exclude</name>
    <value>/ opt/hadoop2.7 / etc/hadoop/DFS hosts. Exclude</value>
</property>
Copy the code

Refresh the NameNode

[root@hop01 hadoop2.7]# hdfs dfsadmin -refreshNodes
Copy the code

Refresh the ResourceManager

[root@hop01 hadoop2.7]# yarn rmadmin-refreshnodesCopy the code

Three, file

1. Basic description

HDFS is suitable for large files with massive data. If each file is small, it generates a large amount of metadata information, occupies too much memory, and slows down the interaction between Naemnodes and Datanodes.

HDFS can archive some small files, which can be understood as compressed storage, reducing NameNode consumption and interaction burden, and allowing access to small archived files, improving overall efficiency.

2. Operation process

Create two directories

[root@hop01 hadoop2.7]# hadoop fs-mkdir -p /hopdir/harinput # Hadoop fs-mkdir [root@hop01 hadoop2.7]# hadoop fs-mkdir -p /hopdir/haroutputCopy the code

Uploading test Files

[root@hop01 hadoop2.7]# hadoop fs -moveFromLocal license. TXT /hopdir/harinput [root@hop01 hadoop2.7]# Hadoop FS -moveFromLocal README.txt /hopdir/harinputCopy the code

Archive operation

[root@hop01 hadoop2.7]# bin/hadoop archive -archiveName output.har -p /hopdir/harinput /hopdir/haroutput
Copy the code

Viewing archive files

[root @ hop01 hadoop2.7] # hadoop fs - LSR har: / / / hopdir haroutput/output. The harCopy the code

In this way, the original small file blocks can be deleted.

Unarchiving

Remove # perform [root @ hop01 hadoop2.7] # hadoop fs - cp har: / / / hopdir haroutput/output. The har / * / hopdir haroutput # to check the file/root @ hop01 Hadoop fs -ls /hopdir/haroutputCopy the code

4. Recycle bin mechanism

1. Basic description

If the recycle bin function is enabled, deleted files can be recovered within a specified period of time to prevent data from being mistakenly deleted. The implementation of HDFS is to start a background thread Emptier in NameNode. This thread manages and monitors the files under the system recycle bin. If the files put into the recycle bin expire, they will be deleted automatically.

2. Enable the configuration

This configuration needs to be synchronized to all services in the cluster.

[root @ hop01 hadoop] # vim/opt/hadoop2.7 / etc/hadoop/core - site. XML # add content < property > < name > fs. Trash. Interval < / name > <value>1</value> </property>Copy the code

Fs.trash. interval=0 indicates that the recycle bin mechanism is disabled, and =1 indicates that the recycle bin mechanism is enabled.