Hadoop learning notes - 02HDFS theoretical basis and read and write process

Written in front: This paper introduces the storage model, architecture design and read and write process of HDFS in detail. As the core of divide and conquer and parallel computing of Hadoop computing layer, it lays a foundation for the subsequent introduction of MapRedcue.

The storage model

Files are linearly cut into byte chunks (block),blockwithoffsetandid
File and fileblockThey can be different sizes
One file except for the last oneblock, otherblockSame size
blockThe size of a piece of hardwareI/OCharacteristics of adjustment
blockIt is distributed to nodes in a re-clusterlocation
blockHave a copy (replication), there is no master-slave concept, and copies cannot appear on the same node
Replicas are key to meeting reliability and performance
File upload can be specifiedblockSize and number of copies, after uploading can only modify the number of copies
Multiple reads at one write are not supported (because one of them is modified)blockAnd the subsequentblockBoth have to be adjusted and redistributed to replicas, which degrades performance.)
Support for appending data

Architecture design

HDFSIs a master and slave (Master/Slaves) architecture
By aNameNode(Lord) and someDataNode(from) constitute, process
File oriented, including file data (data) and file metadata (metadata)
NameNodeStores and manages file metadata and maintains a hierarchical file directory tree (virtual directory structure)
DataNodeStorage of file data (blockBlock), and provideblockRead and write
DataNodewithNameNodeMaintain the heartbeat and report what you holdblockinformation
ClientandNameNodeInteractive file metadata andDataNodeInteractive fileblockdata

In the figure above, the file names, number of copies, and Blockid of two files are saved in the metadata maintained by NameNode. The number of copies of file part-0 is 2, and the block ids are 1 and 3. The number of copies of file Part-1 is 3, and the block ids are 2, 4, and 5. In datanodes, you can see that blocks 1 and 3 appear in two Datanodes, and blocks 2, 4, and 5 appear in three Datanodes

Role functions

NameNode

Completely memory based storage of file metadata, directory structures, filesblockMapping (memory based to provide services externally quickly)
Persistence scheme is required to ensure data reliability (memory is vulnerable to power failure and has limited size)
Provide copy prevention policies

DataNode

Storage based on local disksblockHDFS does not save data for us, only manages mapping.
And save theblockThe checksum is guaranteedblockThe reliability of the
withNameNodeKeep your heart beating. ReportblockA list of state

Metadata persistence

Any operation that causes changes to file system metadata,NameNodeThey all use something calledEditLogThe transaction log is logged
useFsImageStores all metadata states in memory
Save using the local diskEditLogandFsImage
EditLogWith integrity, less data loss, but slow recovery speed, and the risk of volume expansion (record real-time add, delete and change operations, small volume and less record the biggest advantage)
FsImageWith fast recovery speed, the volume is similar to that of memory data, but it cannot be saved in real time, and data loss is high (Full memory data is overwritten to disk based on a certain point in time at an interval, which has the greatest advantage in faster rolling update point in time)
NameNodeUsing theFsImage + EditLogIntegration scheme:
- The scrolling will be incrementalEditLogUpdate to theFsImageTo ensure a more recent point of timeFsImageAnd smallerEditLogVolume (also similar to Redis persistence)
- The specific scheme can be: hypothesisNameNodeWrite only once on the first boot at 8amFsImageBy 9 o ‘clockEditLogLog from 8 to 9, then update from 8 to 9 to 9FsImage,FsImageThe data point becomes 9 o ‘clock. But this time for the normal runningNameNodeIf you want to achieve the benefits of persistence, you can find another machine to do it. That isSecondaryNameNodeI’ll say later.

HDFS is started`EditLog`and`FsImage`Loading process

HDFS is formatted when it is set up. The formatting operation will generate an empty fileFsImage
whenNameNodeRead from hard disk at startupEditLogandFsImage
Will allEditLogTransactions in theFsImage
And put this new version ofFsImageSave from memory to local disk
Then delete the old oneEditLogBecause of this old oneEditLogTransactions have been implemented inFsImageOn the

Safe mode

NameNodeAfter startup, it enters a special state called safe mode, in safe modeNameNodeNo data block operations are performed
NameNodeFrom all of theDataNodeReceives heartbeat signals and reports on the status of data blocks
Every timeNameNodeA data block is considered safekt reokucated when a check confirms that the number of copies of the data block has reached the minimum.
In a certain percentage (configurable) of data blocks byNameNodeAfter the inspection is confirmed to be safe, and after waiting an additional 30 seconds,NameNode– Safe mode will be introduced
It then determines which blocks do not have the specified number of copies and copies them to othersDataNode

SecondaryNameNode (SNN)

In theHAMode,SNNNormally independent nodes, cycle completion pairsNameNodetheEditLogtoFsImageMerge, reduceEditLogSize, decreaseNameNodeThe startup time
According to the configuration filefs.checkpoint.periodSet the interval. The default interval is 3600 seconds
According to the configuration filefs.checkpoint.sizeSet up theEditLogSize, specifiedEditLogThe default maximum file size is 64 MB

Block copy prevention policy

First copy: placed in the uploaded fileDataNodeIn the case of out-of-cluster submission, randomly select a node whose disks are not too full and CPUS are not too busy
Second copy: Placed on a different rack node than the first copy
Third replica: nodes with the same rack as the second replica
More copies: Random nodes

HDFS writing process

Client search andNameNodeEstablish a connection and create metadata.NameNodeDetermine whether metadata is mailbox or not, triggering the replica placement policy to tell the client an orderedDataNodeThe list of
Client with the nearestDataNodeEstablish a TCP connection, firstDataNodeandDataNodeEstablish a TCP connection, and so on. (Pipeline of Datanodes)
The client willblockCut into 64 kpacket(padded with 512B of chrunk+ 4B of chunksum, i.e. 516B total) to the firstDataNode, the firstDataNodeOne copy in memory and one copy on disk, number oneDataNodeAnd then pass the data to the second oneDataNodeWhile the client will be the secondblockPass to the first oneDataNodeAnd so on. Streaming is also a parallel of variants.
When ablockAfter the transmission is complete, the client continues to send theNameNodeRequest transmission number twoblock.
HDFS uses this transmission mode, and the number of copies is transparent to clients. whenblockTransmission completed,DataNodeTo the respectiveNameNodeThe report is made while the client continues to transmit the nextblock. So client transfer andblockThe report is also parallel

HDFS reading process

Client search andNameNodeSet up a connection and obtainblockinformation
From close to the clientDataNodeEstablish connections (machine, this frame, other frame copies),DataNodereadblockAnd return it
The client tried to downloadblockAnd verify data integrity

The HDFS supports the client to define the file offset to connect to datanodes of blocks and obtain data. This is the core of divide-and-conquer, parallel computing that supports the computing layer.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Hadoop learning notes – 02HDFS theoretical basis and read and write process

The storage model

Architecture design

Role functions

NameNode

DataNode

Metadata persistence

HDFS is started`EditLog`and`FsImage`Loading process

Safe mode

SecondaryNameNode (SNN)

Block copy prevention policy

HDFS writing process

HDFS reading process

Hadoop learning notes – 02HDFS theoretical basis and read and write process

The storage model

Architecture design

Role functions

NameNode

DataNode

Metadata persistence

HDFS is startedEditLogandFsImageLoading process

Safe mode

SecondaryNameNode (SNN)

Block copy prevention policy

HDFS writing process

HDFS reading process

Related Posts

Work with practical JS tool function encapsulation

The debate between volatile and synchronized

Graduated in 2013, from outsourcing to Internet factory in two years!

HDFS is started`EditLog`and`FsImage`Loading process