Hadoop-5-HDFS for big data

A brief introduction to HDFS

HDFS (Hadoop Distributed File System) is the core sub-project of Hadoop project. In the development of big data, massive data are stored and managed through Distributed computing.

HDFS is a typical distributed system with master/slave architecture. An HDFS cluster consists of a NameNode and a number of DataNodes.

For example, we can think of the NameNode as a warehouse manager, managing items in a warehouse; A DataNode can be imagined as a repository for goods, which are what we call data.

HDFS command-line operations

The command line interface is as follows:

$bin/hadoop fs - Command file path

$bin/ HDFS - Command file path

ls

Use the ls command to view directories and files on your HDFS system.

$ hadoop fs -ls /

Operation demonstration:

[root@centos01 ~]# hadoop fs -ls /
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2021-07-10 08:58 /input
drwx------   - hadoop supergroup          0 2021-07-10 08:38 /tmp

Recursively list all directories and files in the root directory of the HDFS file system:

[root@centos01 ~]# hadoop fs -ls -R /
drwxr-xr-x   - hadoop supergroup          0 2021-07-10 08:58 /input
-rw-r--r--   2 hadoop supergroup         83 2021-07-10 08:58 /input/wc.txt
drwx------   - hadoop supergroup          0 2021-07-10 08:38 /tmp
drwx------   - hadoop supergroup          0 2021-07-10 08:38 /tmp/hadoop-yarn
drwx------   - hadoop supergroup          0 2021-07-10 08:38 /tmp/hadoop-yarn/staging
drwx------   - hadoop supergroup          0 2021-07-10 08:38 /tmp/hadoop-yarn/staging/hadoop
drwx------   - hadoop supergroup          0 2021-07-10 08:49 /tmp/hadoop-yarn/staging/hadoop/.staging

put

Use the PUT command to upload local files to your HDFS system. To upload the local file a.txt to the input folder at the root of the HDFS file system, the command is as follows:

$ hadoop fs -put a.txt /input/

get

Use the GET command to download files from the HDFS file system to the local file. Note that the download cannot be the same as the local file name, otherwise the file will be indicated that the file already exists.

$ hadoop fs -get /input/a.txt a.txt

Download folder to local:

$ hadoop fs -get /input/ ./

Common commands

$hadoop dfs-ls: $hadoop dfs-ls: $hadoop dfs-ls: $hadoop dfs-ls: $hadoop dfs-ls $hadoop dfs-ls -r/create directory /input $hadoop dfs-mkdir /input List the files in the folder with HSFS named input $hadoop dfs-ls input will Test. TXT uploaded to the HDFS $hadoop fs - put/home/binguner/Desktop/test. The test of TXT/input will HSDF. TXT file saved to the local Desktop folder $hadoop DFS -get /input/test.txt /home/binguenr/Desktop $hadoop DFS -rmr /input/test.txt -get /input/test.txt /home/binguenr/Desktop $hadoop DFS -rmr /input/test.txt $hadoop dfsadmin -- Safemode enter $hadoop dfsadmin -- Safemode leave report $hadoop dfsadmin-report ($hadoop dfsadmin-report

Second, Java API operations

1, create,

Java API dependency packages for introducing Hadoop in the pom.xml file:

< the dependency > < groupId > org, apache hadoop < / groupId > < artifactId > hadoop - client < / artifactId > < version > 2.8.2 < / version > </dependency>

New com/homay/hadoopstudy/FileSystemCat. Java classes

package com.homay.hadoopstudy; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IOUtils; import sun.nio.ch.IOUtil; import java.io.InputStream; /** * @author: kaiyi * @Date 2021/7/12 0:25 */ public class FileSystemCat { public static void main(String[] args) throws Exception{ Configuration conf = new Configuration(); The conf. Set (fs. Defalut. "name", "HDFS: / / 192.168.222.10:9000"); FileSystem fs = FileSystem.get(conf); InputStream in = fs.open(new Path(" HDFS :/input/wc.txt")); IOUtils.copyBytes(in, System.out, 4096, false); // CloseStream (in); // CloseStream (in); }}

View the Hadoop file:

[hadoop@centos01 sbin]$ hadoop dfs -ls -R /
WARNING: Use of this script to execute dfs is deprecated.
WARNING: Attempting to execute replacement "hdfs dfs" instead.

drwxr-xr-x   - hadoop supergroup          0 2021-07-10 08:58 /input
-rw-r--r--   2 hadoop supergroup         83 2021-07-10 08:58 /input/wc.txt
drwx------   - hadoop supergroup          0 2021-07-10 08:38 /tmp
drwx------   - hadoop supergroup          0 2021-07-10 08:38 /tmp/hadoop-yarn
drwx------   - hadoop supergroup          0 2021-07-10 08:38 /tmp/hadoop-yarn/staging
drwx------   - hadoop supergroup          0 2021-07-10 08:38 /tmp/hadoop-yarn/staging/hadoo
drwx------   - hadoop supergroup          0 2021-07-10 08:49 /tmp/hadoop-yarn/staging/hadoo
[hadoop@centos01 sbin]$ hadoop -v

run

Run the file and submit them to the error: Java. IO. FileNotFoundException: HADOOP_HOME and hadoop. Home. The dir are unset

Abnormal local remote connection to Hadoop cluster, log as follows:

22:27:56. 703. [the main] the DEBUG org.. Apache hadoop. Util. Shell - Failed to detect a valid hadoop home directory java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448) at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)

The log description is clear. HADOOP_HOME and HADOOP.HOME.DIR are not set. What do these two terms do primarily? Is the Hadoop address configured in the local environment variable, so do I need to download the Windows version of Hadoop to set it? If you are remotely connected to a Hadoop cluster on Linux, there is no need to download and install the Windows version of Hadoop at all!!

The local remote connection to Hadoop system requires local configuration of related Hadoop variables, mainly including Hadoop. DLL and WinUtils.exe, etc.

Winutils: Since Hadoop is primarily written in Linux, Winutil.exe is mainly used to simulate the directory environment under Linux. This helper is required to run when Hadoop is running under Windows or calling a remote Hadoop cluster. Winutils is a binary file in Windows that works with different versions of Hadoop systems and is built on top of a Windows VM, which is used to test Hadoop related applications in Windows systems.

The solution

Once you know why, you can download the appropriate Winutils based on the version of the Hadoop cluster you installed.

Download address: https://github.com/stevelough…

Note: If you do not have the same version, you can choose the nearest version to download. If the version used in the cluster is 2.8.5, you can download the version file that uses 2.8.3.

Set the environment variable %HADOOP_HOME% to point to the directory above the BIN directory that contains WinUtils.exe. That is:

New system variable
Duplicate the bin folder in the 2.8.3 folder with the following address:
After restarting IDEA, the problem is solved.

Note: There is no need to download and install the Windows version of Hadoop, just introduce WinUtils.exe.

Wrong FS: /input/wc.txt, expected: file:///

Detailed error message:

23:51:26. 466. [the main] the DEBUG org.. Apache hadoop. Fs. The FileSystem - fs for file is class. Org. Apache hadoop. Fs. LocalFileSystem Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs:/input/wc.txt, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:730)

Solution:

Hadoop requires that core-site.xml and hdfs-site.xml on the cluster be placed under the current project.

1)hdfs-site.xml

2)core-site.xml

3)mapred-site.xml

The above three files are the configuration XML files for your Linux environment to install Hadoop. Put the core-site. XML and hdfs-site. XML on the Hadoop cluster in the SRC directory /resource of the project

Then execute the file, and you can see that the Java call to the Hadoop API succeeds

A brief introduction to HDFS

HDFS command-line operations

ls

put

get

Common commands

Second, Java API operations

1, create,

run

The solution

Related Posts

Hadoop Tutorial – HDFS Client Development

SQOOP learning summary

High availability of HDFS-NameNode