preface

This paper mainly introduces the HDFS architecture and its execution process, and gives a programming example of read and write operation, hoping to have a preliminary understanding of HDFS.

Introduction to the

HDFS (Hadoop Distributed File System) is a Distributed File System running on commercial PCS. Its design idea is derived from The Google File System, a paper published by Google in 2003. HDFS is designed to solve large-scale data storage and management problems.

The architecture

The figure above shows that HDFS is a standard master/slave architecture and consists of three parts:

  1. NameNode (Master node)
    • Manages MetaData, which consists of file path names, data block ids, and storage locations
    • Manage HDFS namespace.
  2. SecondaryNameNode
    • Periodically merge NameNode Edit logs (sequence of changes to the file system) to Fsimage (snapshot of the entire file system) and copy the modified fsimage to NameNode.
    • Provides a checkpoint of the NameNode (do not think of it as a backup of the NameNode) that can be used to recover the NameNode.
  3. DataNode (Slave node)
    • Provides file storage and data block operations.
    • Periodically reports block information to the NameNode.

Here are some of the concepts that appear in the figure:

  1. Replication

    To ensure high data availability, the HDFS stores three copies of written data in redundancy mode by default.

  2. Blocks

    A Block is a basic storage and operation unit (128 MB by default). A Block refers to a file system Block, not a physical Block, and its size is usually an integer multiple of the physical Block.

Execute the process

Read the file

The process of reading a file can be summarized as:

  1. The Client sends a request to the NameNode to obtain the location of the file data block
  2. The Client connects the data blocks according to the distance to the Client and reads the data

Write files

The process of writing a file can be summarized as:

  1. The Client sends a file write request to the NameNode to obtain information such as the DataNode list that can be written
  2. The Client blocks files based on the block size set by the HDFS
  3. The datanodes assigned by Client and NameNode constitute the pipeline and write data
  4. After the write is complete, the NameNode receives the message from the DataNode to update the metadata

Common commands

File operations

  1. Lists the files

    hdfs dfs -ls <path>
    Copy the code
  2. Create a directory

    hdfs dfs -mkdir <path>
    Copy the code
  3. Upload a file

    hdfs dfs -put <localsrc> <dst>
    Copy the code
  4. Output file contents

    hdfs dfs -cat <src>
    Copy the code
  5. The file is copied to a local directory

    hdfs dfs -get <src> <localdst>
    Copy the code
  6. Delete files and directories

    hdfs dfs -rm <src>
    hdfs dfs -rmdir <dir>
    Copy the code

management

  1. Viewing Statistics

    hdfs dfsadmin -report
    Copy the code
  2. Entering and exiting safe mode (this mode does not allow any file system changes)

    hdfs dfsadmin -safemode enter
    hdfs dfsadmin -safemode leave
    Copy the code

Programming instance

  1. IDEA Creates a Maven project

    After checking the relevant options, click Next to fill in the relevant information of the project

  2. Add dependencies to pom.xml

    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.9.2</version>// Select according to Hadoop version</dependency>
    
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.9.2</version>
        </dependency>
    
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.9.2</version>
        </dependency>
    </dependencies>
    
    Copy the code
  3. Read and write files

    Create the Sample class to write the corresponding read and write functions

    • Sample class

      import org.apache.hadoop.conf.Configuration;
      import org.apache.hadoop.fs.FSDataInputStream;
      import org.apache.hadoop.fs.FSDataOutputStream;
      import org.apache.hadoop.fs.FileSystem;
      import org.apache.hadoop.fs.Path;
      
      import java.io.*;
      
      / * * *@author ikroal
       */
      public class Sample {
          // The default HDFS address
          private static final String DEFAULT_FS = "hdfs://localhost:9000";
          private static final String PATH = DEFAULT_FS + "/tmp/demo.txt";
          private static final String DEFAULT_FILE = "demo.txt";
      
          public static void main(String[] args) {
              Configuration conf = new Configuration();
              FileSystem fs = null;
              conf.set("fs.defaultFS", DEFAULT_FS); // Set the HDFS address
      
              try {
                  fs = FileSystem.get(conf);
                  write(fs, DEFAULT_FILE, PATH);
                  read(fs, PATH);
              } catch (IOException e) {
                  e.printStackTrace();
              } finally {
                  try {
                      if(fs ! =null) { fs.close(); }}catch(IOException e) { e.printStackTrace(); }}}}Copy the code
    • Write a function

      /** * write file *@paramInputPath File path *@paramOutPath HDFS write path */
      public static void write(FileSystem fileSystem, String inputPath, String outPath) {
          FSDataOutputStream outputStream = null;
          FileInputStream inputStream = null;
          try {
              outputStream = fileSystem.create(new Path(outPath)); // Get the HDFS write stream
              inputStream = new FileInputStream(inputPath); // Read local files
              int data;
              while((data = inputStream.read()) ! = -1) { // Write operationoutputStream.write(data); }}catch (IOException e) {
              e.printStackTrace();
          } finally {
              try {
                  if(outputStream ! =null) {
                      outputStream.close();
                  }
                  if(inputStream ! =null) { inputStream.close(); }}catch(IOException e) { e.printStackTrace(); }}}Copy the code
    • Read function

      /** * read file *@paramPath Path of the file to be read in the HDFS */
      public static void read(FileSystem fileSystem, String path) {
          FSDataInputStream inputStream = null;
          BufferedReader reader = null;
          try {
              inputStream = fileSystem.open(new Path(path)); // Get the HDFS read flow
              reader = new BufferedReader(new InputStreamReader(inputStream));
              String content;
              while((content = reader.readLine()) ! =null) { // Read and output to the consoleSystem.out.println(content); }}catch (IOException e) {
              e.printStackTrace();
          } finally {
              try {
                  if(inputStream ! =null) {
                      inputStream.close();
                  }
                  if(reader ! =null) { reader.close(); }}catch(IOException e) { e.printStackTrace(); }}}Copy the code
  4. Create the file you plan to upload under the root of your project folder (in this case demo.txt) and fill in Hello World!

  5. Start Hadoop and run the program to see the results

    Write the results through http://localhost:50070/explorer.html#/ to view

    The console outputs the contents of the uploaded file

Thanks

  1. Have a basic knowledge of HDFS architecture and principles
  2. In-depth understanding of HDFS: Hadoop distributed file system
  3. HDFS read and write process (most refined and detailed ever)
  4. Hadoop Learning Path (11) HDFS read and write details