Third, a preliminary understanding of HDFS

background

In reality, as the amount of data increases, an operating system can not store all the data, so it is not convenient to manage and maintain more operating system disks, so it needs a system to manage files on multiple machine nodes.

Hadoop mainly solves two problems: one is the storage of massive data, that is, the distributed file management system HDFS; The second is the computing problem of massive data, namely distributed computing MapReduce.
define

Hadoop Distribute File System (HDFS) is a distributed File System that stores files and is located based on the directory tree.

HDFS application scenarios: Suitable for multiple read and write scenarios, and does not support file modification. Suitable for data analysis, and not suitable for the use of network disk.

Client: indicates the Client
- File splitting: During file uploading, the client splits the file into blocks and uploads the file
- Interact with NN to get file location information
- Interacts with the DN to read or write data
- Client provides several commands to manage HDFS, such as NN formatting
- The Client can use commands to access the HDFS, such as adding, deleting, and querying the HDFS
NameNode: Master, is a director, administrator
- Manage each DN
- Manage file information, including file name, file size, file size, and storage location, that is, manage metadata information
- Configure a copy policy. After a DN hangs up, data is lost. Therefore, you need to control the backup of the DN.
- Control the status of each node (DN) in the cluster based on PRC heartbeat mechanism
- Process client read/write requests
- NN has a single point of failure. You can enable an alternate NN to ensure the security of metadata information
DataNode: slave, NN gives commands, and DN performs actual operations
- Store actual blocks of data
- Performs read and write operations on data blocks
Secondary NameNode: Is not the hot standby of the NN. When the NN dies, it cannot immediately replace the NN to provide services
- Assist NN to share its workload, such as regularly merging Fsimage and Edits and pushing them to NN
- In an emergency, you can assist in restoring NN

Block storage, 128 MB (2.x) by default
- Block size can be specified based on the configuration parameter dfs.blocksize. The default blocksize is 128 MB in hadoop2. x and 64 MB in hadoop1. x.
- A block in a cluster is best if the addressing time is 10ms, which is generally 1% of the transfer time, so the transfer time is 10ms/0.01=1s, compared to the current disk transfer rate of 100MB /s
Size can be set:The block size of HDFS depends on the disk transfer efficiency
- If HDFS blocks are set too small, it will increase the addressing time, and the program will always be looking for the starting location of the block
- If blocks in the HDFS are too large, the time it takes to transfer data from the disk is significantly longer than the time it takes to locate the starting position of the block, causing the application to process the data very slowly.

Hadoop fs: This command can be used in other file systems, not just HDFS, which means that this command is more widely used
HDFS DFS: HDFS distributed file system

See all the commands that Hadoop can execute
```
hadoop fs -help
Copy the code
```

Create folders in the Hadoop root directory

hadoop fs -mkdir -p /test/a/b
Copy the code

Upload the local file to the HDFS

hadoop fs -put ./test.txt /test/a/b
Copy the code

Bring files (directories) from Hadoop to the local directory
```
hadoop fs -get /test/a/b/test.txt .. /dataCopy the code
```

Delete the file

hadoop fs -rm -R /test/a/b/test.txt
Copy the code

OK, that’s all for today’s preliminary knowledge of HDFS, and then we’ll move on to the advanced part of HDFS.

Bye bye