Written in front: This paper introduces the storage model, architecture design and read and write process of HDFS in detail. As the core of divide and conquer and parallel computing of Hadoop computing layer, it lays a foundation for the subsequent introduction of MapRedcue.

The storage model

  • Files are linearly cut into byte chunks (block),blockwithoffsetandid
  • File and fileblockThey can be different sizes
  • One file except for the last oneblock, otherblockSame size
  • blockThe size of a piece of hardwareI/OCharacteristics of adjustment
  • blockIt is distributed to nodes in a re-clusterlocation
  • blockHave a copy (replication), there is no master-slave concept, and copies cannot appear on the same node
  • Replicas are key to meeting reliability and performance
  • File upload can be specifiedblockSize and number of copies, after uploading can only modify the number of copies
  • Multiple reads at one write are not supported (because one of them is modified)blockAnd the subsequentblockBoth have to be adjusted and redistributed to replicas, which degrades performance.)
  • Support for appending data

Architecture design

  • HDFSIs a master and slave (Master/Slaves) architecture
  • By aNameNode(Lord) and someDataNode(from) constitute, process
  • File oriented, including file data (data) and file metadata (metadata)
  • NameNodeStores and manages file metadata and maintains a hierarchical file directory tree (virtual directory structure)
  • DataNodeStorage of file data (blockBlock), and provideblockRead and write
  • DataNodewithNameNodeMaintain the heartbeat and report what you holdblockinformation
  • ClientandNameNodeInteractive file metadata andDataNodeInteractive fileblockdata

In the figure above, the file names, number of copies, and Blockid of two files are saved in the metadata maintained by NameNode. The number of copies of file part-0 is 2, and the block ids are 1 and 3. The number of copies of file Part-1 is 3, and the block ids are 2, 4, and 5. In datanodes, you can see that blocks 1 and 3 appear in two Datanodes, and blocks 2, 4, and 5 appear in three Datanodes

Role functions

NameNode

  • Completely memory based storage of file metadata, directory structures, filesblockMapping (memory based to provide services externally quickly)
  • Persistence scheme is required to ensure data reliability (memory is vulnerable to power failure and has limited size)
  • Provide copy prevention policies

DataNode

  • Storage based on local disksblockHDFS does not save data for us, only manages mapping.
  • And save theblockThe checksum is guaranteedblockThe reliability of the
  • withNameNodeKeep your heart beating. ReportblockA list of state

Metadata persistence

  • Any operation that causes changes to file system metadata,NameNodeThey all use something calledEditLogThe transaction log is logged
  • useFsImageStores all metadata states in memory
  • Save using the local diskEditLogandFsImage
  • EditLogWith integrity, less data loss, but slow recovery speed, and the risk of volume expansion (record real-time add, delete and change operations, small volume and less record the biggest advantage)
  • FsImageWith fast recovery speed, the volume is similar to that of memory data, but it cannot be saved in real time, and data loss is high (Full memory data is overwritten to disk based on a certain point in time at an interval, which has the greatest advantage in faster rolling update point in time)
  • NameNodeUsing theFsImage + EditLogIntegration scheme:
    • The scrolling will be incrementalEditLogUpdate to theFsImageTo ensure a more recent point of timeFsImageAnd smallerEditLogVolume (also similar to Redis persistence)
    • The specific scheme can be: hypothesisNameNodeWrite only once on the first boot at 8amFsImageBy 9 o ‘clockEditLogLog from 8 to 9, then update from 8 to 9 to 9FsImage,FsImageThe data point becomes 9 o ‘clock. But this time for the normal runningNameNodeIf you want to achieve the benefits of persistence, you can find another machine to do it. That isSecondaryNameNodeI’ll say later.

HDFS is startedEditLogandFsImageLoading process

  • HDFS is formatted when it is set up. The formatting operation will generate an empty fileFsImage
  • whenNameNodeRead from hard disk at startupEditLogandFsImage
  • Will allEditLogTransactions in theFsImage
  • And put this new version ofFsImageSave from memory to local disk
  • Then delete the old oneEditLogBecause of this old oneEditLogTransactions have been implemented inFsImageOn the

Safe mode

  • NameNodeAfter startup, it enters a special state called safe mode, in safe modeNameNodeNo data block operations are performed
  • NameNodeFrom all of theDataNodeReceives heartbeat signals and reports on the status of data blocks
  • Every timeNameNodeA data block is considered safekt reokucated when a check confirms that the number of copies of the data block has reached the minimum.
  • In a certain percentage (configurable) of data blocks byNameNodeAfter the inspection is confirmed to be safe, and after waiting an additional 30 seconds,NameNode– Safe mode will be introduced
  • It then determines which blocks do not have the specified number of copies and copies them to othersDataNode

SecondaryNameNode (SNN)

  • In theHAMode,SNNNormally independent nodes, cycle completion pairsNameNodetheEditLogtoFsImageMerge, reduceEditLogSize, decreaseNameNodeThe startup time
  • According to the configuration filefs.checkpoint.periodSet the interval. The default interval is 3600 seconds
  • According to the configuration filefs.checkpoint.sizeSet up theEditLogSize, specifiedEditLogThe default maximum file size is 64 MB

Block copy prevention policy

  • First copy: placed in the uploaded fileDataNodeIn the case of out-of-cluster submission, randomly select a node whose disks are not too full and CPUS are not too busy
  • Second copy: Placed on a different rack node than the first copy
  • Third replica: nodes with the same rack as the second replica
  • More copies: Random nodes

HDFS writing process

  • Client search andNameNodeEstablish a connection and create metadata.NameNodeDetermine whether metadata is mailbox or not, triggering the replica placement policy to tell the client an orderedDataNodeThe list of
  • Client with the nearestDataNodeEstablish a TCP connection, firstDataNodeandDataNodeEstablish a TCP connection, and so on. (Pipeline of Datanodes)
  • The client willblockCut into 64 kpacket(padded with 512B of chrunk+ 4B of chunksum, i.e. 516B total) to the firstDataNode, the firstDataNodeOne copy in memory and one copy on disk, number oneDataNodeAnd then pass the data to the second oneDataNodeWhile the client will be the secondblockPass to the first oneDataNodeAnd so on. Streaming is also a parallel of variants.
  • When ablockAfter the transmission is complete, the client continues to send theNameNodeRequest transmission number twoblock.
  • HDFS uses this transmission mode, and the number of copies is transparent to clients. whenblockTransmission completed,DataNodeTo the respectiveNameNodeThe report is made while the client continues to transmit the nextblock. So client transfer andblockThe report is also parallel

HDFS reading process

  • Client search andNameNodeSet up a connection and obtainblockinformation
  • From close to the clientDataNodeEstablish connections (machine, this frame, other frame copies),DataNodereadblockAnd return it
  • The client tried to downloadblockAnd verify data integrity

The HDFS supports the client to define the file offset to connect to datanodes of blocks and obtain data. This is the core of divide-and-conquer, parallel computing that supports the computing layer.