Goal:

  • Understand HDFS background and definition
  • Understand the advantages and disadvantages of HDFS
  • Understand the HDFS architecture
  • Know HDFS Shell operations

1 summary of HDFS

1.1 Background and definition

  • background

    In reality, as the amount of data increases, an operating system can not store all the data, so it is not convenient to manage and maintain more operating system disks, so it needs a system to manage files on multiple machine nodes.

    Hadoop mainly solves two problems: one is the storage of massive data, that is, the distributed file management system HDFS; The second is the computing problem of massive data, namely distributed computing MapReduce.

  • define

    Hadoop Distribute File System (HDFS) is a distributed File System that stores files and is located based on the directory tree.

    HDFS application scenarios: Suitable for multiple read and write scenarios, and does not support file modification. Suitable for data analysis, and not suitable for the use of network disk.

1.2 the advantages and disadvantages

  • advantages
    • High fault tolerance
      • Multiple copies of data are automatically saved. Improve fault tolerance by adding copies to the format.
      • If a copy is lost, it can be automatically restored.
    • Suitable for handling big data
      • Data scale: Can process data scale up to GB, TB or PB, such as daily user behavior log.
      • File size: can handle more than one million files, the number is quite large.
    • Can be built on cheap machines with multiple replicas for increased reliability (horizontal scaling costs are linear)
      • The problem of vertical scaling of performance per machine is that there are hardware bottlenecks, including exponential cost increases
    • Fast response to hardware failures
  • disadvantages
    • Not suitable for low latency data access, such as millisecond storage data is not possible
    • Unable to efficiently store large numbers of small files
      • Storing a large number of small files will take up a large amount of memory for storing file directories and block information, and NN’s memory is limited.
      • Small file storage takes longer to address than it takes to read, which conflicts with the design goals of HDFS.
    • Concurrent write and random file modification are not supported
      • A file can have only one write. Multiple threads are not allowed to write simultaneously.
      • Only data append (append) is supported, and random file modification is not supported.

2 HDFS architecture

2.1 Architecture

  • Client: indicates the Client
    • File splitting: During file uploading, the client splits the file into blocks and uploads the file
    • Interact with NN to get file location information
    • Interacts with the DN to read or write data
    • Client provides several commands to manage HDFS, such as NN formatting
    • The Client can use commands to access the HDFS, such as adding, deleting, and querying the HDFS
  • NameNode: Master, is a director, administrator
    • Manage each DN
    • Manage file information, including file name, file size, file size, and storage location, that is, manage metadata information
    • Configure a copy policy. After a DN hangs up, data is lost. Therefore, you need to control the backup of the DN.
    • Control the status of each node (DN) in the cluster based on PRC heartbeat mechanism
    • Process client read/write requests
    • NN has a single point of failure. You can enable an alternate NN to ensure the security of metadata information
  • DataNode: slave, NN gives commands, and DN performs actual operations
    • Store actual blocks of data
    • Performs read and write operations on data blocks
  • Secondary NameNode: Is not the hot standby of the NN. When the NN dies, it cannot immediately replace the NN to provide services
    • Assist NN to share its workload, such as regularly merging Fsimage and Edits and pushing them to NN
    • In an emergency, you can assist in restoring NN

2.2 File Block Size

  • Block storage, 128 MB (2.x) by default
    • Block size can be specified based on the configuration parameter dfs.blocksize. The default blocksize is 128 MB in hadoop2. x and 64 MB in hadoop1. x.
    • A block in a cluster is best if the addressing time is 10ms, which is generally 1% of the transfer time, so the transfer time is 10ms/0.01=1s, compared to the current disk transfer rate of 100MB /s
  • Size can be set:The block size of HDFS depends on the disk transfer efficiency
    • If HDFS blocks are set too small, it will increase the addressing time, and the program will always be looking for the starting location of the block
    • If blocks in the HDFS are too large, the time it takes to transfer data from the disk is significantly longer than the time it takes to locate the starting position of the block, causing the application to process the data very slowly.

3 HDFS Shell operations

3.1 Basic Syntax

  • Hadoop fs: This command can be used in other file systems, not just HDFS, which means that this command is more widely used
  • HDFS DFS: HDFS distributed file system

3.2 Common Commands

  • See all the commands that Hadoop can execute

    hadoop fs -help
    Copy the code
  • Create folders in the Hadoop root directory

    hadoop fs -mkdir -p /test/a/b
    Copy the code
  • Upload the local file to the HDFS

    hadoop fs -put ./test.txt /test/a/b
    Copy the code
  • Bring files (directories) from Hadoop to the local directory

    hadoop fs -get /test/a/b/test.txt .. /dataCopy the code
  • Delete the file

    hadoop fs -rm -R /test/a/b/test.txt
    Copy the code

OK, that’s all for today’s preliminary knowledge of HDFS, and then we’ll move on to the advanced part of HDFS.

Bye bye