The HDFS storage capacity is insufficient, so you need to add a new disk to expand capacity. You can expand capacity in either Linux or HDFS. You can view the dfs.datanode.data.dir value of the HDFS datanode directory in fFS-site. XML

Linux level

In Linux, the partition or logical volume mounted by the DATanode of the HDFS is expanded

If it is directly on the fixed partition and disk and allowance can be partitioned expansion reference: blog.51cto.com/wutou/17829… The disk space management mode mounted directly to the partition is very inconvenient, and the capacity expansion is prone to errors. In addition, only when the disk space is available, the space of two disks cannot be used together. If the space of one disk is used up, the partition on the original disk cannot be expanded even if a new disk is added. Hence the logical volume management LVM

If LVM manages disk partitions, you can easily create physical volumes, expand the volume group to which the logical volumes mounted by Datanode belong, and then expand the logical volumes. For details, see linux.cn/article-321…

HDFS level

capacity

This does not need to expand the original space. You need to modify the dfs.datanode.data.dir value of hdFS-site. XML to add a new directory, for example

<property>
	<name>dfs.datanode.data.dir</name>
	<value>
		/hadoop/hdfs/data,
		/data/hadoop/hdfs
	</value>
</property
Copy the code

Multiple directories are separated by commas, so as long as the new disk partition is mounted to the new directory (partition method is optional, preferably permanent), set user permissions for the new directory that needs to be added

chown -R hdfs:hadoop dirPath
Copy the code

Modify dfs.datanode.data.dir and restart datanode

./bin/hdfs dfsadmin -report
Copy the code

You can view the storage capacity of each Datanode

Resolving data skew

However, this method only adds a new disk to the HDFS, and the space of the original disk is still used up. If the capacity of the old disk is smaller than that of the new disk, the old disk still fills up faster than the new disk after the HDFS continuously adds new data. To prevent the HDFS storage capacity from expanding infinitely and filling up the disk, or even causing the system disk to fill up, you can set dfs.datanode.du.reserved to reserve disk space. This parameter is set in HDFS -site. XML.

<property>  
    <name>dfs.datanode.du.reserved</name>  
    <value>32212254720</value>  
</property>
Copy the code

30 GB disk space (byte) is reserved. If 30 GB or less disk space is available, data copies will not be saved to the disk.

However, this still does not solve the problem that the old disk space is still occupied during the current expansion. In this case, you need to move the copy of the old disk to the new disk. However, the HDFS does not provide this function, so you can only save the country by deleting unnecessary files (some file copies may exist on the old disk).

If you don’t want to move any data, can use a copy of the lifting methods, such as a copy of the original hair amount to 3, reduce the number of copies to 2, it will delete some copy, will delete existing old disk copy from some, and then rise in the number of copies to send a copy of the original number, such as set up disk space reserved, The new copy is stored in a new disk. The setrep -w command of HDFS is used to raise and lower the copy. -r adjusts all files in the entire folder recursively.

hadoop dfs -setrep -w -R 2 path
hadoop dfs -setrep -w -R 3 path
Copy the code

Finally, the Hadoop Balancer tool is used to adjust the data balance of the individual Datanodes.

Official description of Balancer:

The balancer is a tool that balances disk space usage on an HDFS cluster when some datanodes become full or when new empty nodes join the cluster.

The tool is deployed as an application program that can be run by the cluster administrator on a live HDFS cluster while applications adding and deleting files.

Balancer command

hdfs balancer [-policy <policy>] [-threshold <threshold>] [-exclude [-f <hosts-file> | <comma-separated list of hosts>]]  [-include [-f <hosts-file> | <comma-separated list of hosts>]] [-source [-f <hosts-file> | <comma-separated list of hosts>]] [-blockpools <comma-separated list of blockpool ids>] [-idleiterations <idleiterations>] [-runDuringUpgrade]Copy the code
COMMAND_OPTION Description
-policy datanode (default): Cluster is balanced if each datanode is balanced. blockpool: Cluster is balanced if each block pool in each datanode is balanced.
-threshold Percentage of disk capacity. This overwrites the default threshold.
-exclude -f | Excludes the specified datanodes from being balanced by the balancer.
-include -f | Includes only the specified datanodes to be balanced by the balancer.
-source -f | Pick only the specified datanodes as source nodes.
-blockpools The balancer will only run on blockpools included in this list.
-idleiterations Maximum number of idle iterations before exit. This overwrites the default idleiterations.
-runDuringUpgrade Whether to run the balancer during an ongoing HDFS upgrade. This is usually not desired since it will not affect used space on over-utilized machines.
-h --help

Here I just tune the percentage difference between datanodes

hdfs balancer -threshold 5
Copy the code

This can resolve hadoop data imbalances.