purpose

The default HDFS three-copy policy has a 200% overhead in storage space and other things such as network bandwidth, so the copy policy is expensive. But hot and cold data sets with relatively low I/O rarely access other replica blocks during normal operation, but still consume the same amount of resources as the first replica.

One improvement, therefore, is the use of Erasure Code (EC) to replace duplicate policies. Erasure codes provide the same fault tolerance as copies, but use less storage space. In typical erasure codes, the storage overhead does not exceed 50%. The copy factor of the erasure code file is always 1 and cannot be changed by using the -setrep command, so it is meaningless.

background

The most prominent use of EC in storage systems is redundant array of inexpensive disks (RAID). RAID implements EC by striping. Striping divides logically contiguous data (such as files) into smaller units (such as bits, bytes, or blocks) and stores contiguous units on different disks. Units of this stripe distribution are called striping units (or units). For each stripe of the original data unit, a number of parity blocks are calculated and stored, a process called encoding. Errors on any stripe cell can be recovered by decoding calculations of remaining data blocks and parity blocks.

Integrating THE EC and HDFS improves storage efficiency and provides the same fault tolerance as the HDFS. For example, a 3-copy file with 6 blocks will take up 6 * 3 = 18 disk space; But when deployed with EC (six data, three parity), it takes up only nine blocks of disk space.

architecture

Striping has several key advantages in the context of EC. First, it enables online EC (real-time writing of data in EC format), avoiding the conversion phase and immediately saving storage space. Online EC also increases I/O performance by leveraging multiple disk spindles in parallel. This is ideal for clusters with high concurrency networks; Second, it distributes a small file to multiple Datanodes without the need to bundle multiple files into one coding group. This greatly simplifies file operations such as deletion, quota reporting, and migration between federated namespaces.

In a typical HDFS cluster, small files can account for more than three-quarters of the total storage consumption. In the first phase of work, HDFS supported EC with striping to better support small files. In the future, HDFS will also support continuous EC designs. For more information, please see the design document and the relevant HDFS – 7285 (issues.apache.org/jira/browse…). The discussion.

  • The NameNode extension

Stripe HDFS files logically consist of block groups. Each block group contains a certain number of internal blocks. To reduce NameNode memory consumption for these blocks, a new hierarchical block naming protocol is introduced. The ID of a block group can be inferred from the ID of any of its internal blocks. This allows management at the block group rather than the block level.

  • Client extension

Client-side read and write paths have been enhanced to process multiple internal blocks in a block group in parallel. On the output/write path, DFSStripedOutputStream manages a set of data streams, one per data node, storing an internal block in the current block group. These flows often work asynchronously. The coordinator is responsible for the operation of the entire block group, including ending the current block group, allocating new blocks, and so on. DFSStripedInputStream converts the requested logical byte data into an internal block stored on the DataNodes on the input/read path. The read request is then issued in parallel. When a failure occurs, it issues additional read requests to decode.

  • Data node expansion

The data node runs the additional ErasureCodingWorker (ECWorker) task, which is used to recover the failed erasure code block in the background. The NameNode detects the failed EC block and selects a DataNode to restore the block. The recovery task is delivered as a heartbeat response. This process is similar to using a replication strategy to recover lost data blocks. Reconstruction performs three key tasks:

(1) Read data from the source node

Input data is read in parallel from the source node using a dedicated thread pool. Based on the EC policy, it plans to make read requests on all source data and only read a minimum number of input blocks for reconstruction.

(2) Decode data and generate output data

Decode new data and parity blocks to original data. All missing data is decoded with parity blocks.

(3) Transfer the generated data block to the target node

After decoding is complete, the recovered data block is transferred to the target DataNodes.

  • Erasure code policy

To accommodate asynchronous workloads, we allow files and directories in HDFS clusters to have different replication and erasure code policies. Erasure codes encapsulate how files are encoded/decoded. Each policy is defined by the following information:

EC pattern: This includes the number of data blocks and parity blocks in EC groups (e.g. 6 + 3), as well as erasure code algorithms (e.g. Reed-solomon, XOR).

Size of stripe cells. This determines the granularity of strip reads and writes, including buffer sizes and encoding efforts.

The policy is named codec – number of data blocks – parity blocks – cell size. Currently, six built-in policies are supported: RS-3-2-1024 K, RS-6-3-1024 K, RS-10-4-1024 K, RS-legacy-6-3-1024 K, XOR-2-1-1024K, and REPLICATION.

Replication is a special policy that can only be set on a directory to force a directory to adopt a 3x replication scheme without inheriting its original erasure code policy. This policy interleaves the 3x replication scheme directory with the erasure code directory.

The REPLICATION policy is always enabled by default. For other built-in policies, they are disabled by default.

Similar to the HDFS storage policy, configure erasure correction code policies on directories. After the file is created, it inherits the EC policy of its most recent ancestor directory.

Directory-level EC policies affect only new files created in directories. After a file is created, you can query its erasure code policy but cannot change it. If you rename a erasure code file to a directory with another EC policy, the file will retain its existing EC policy. Converting a file to another EC policy requires rewriting this data. Do this by copying the file (for example, through distcp) rather than renaming it.

We allow users to configure their EC policies through an XML file, which must contain the following three parts:

1. Layout version: this represents the version of the EC policy XML file format;

2. Patterns: This includes all user-defined EC patterns;

3. Policies: This includes all user-defined EC policies, each consisting of the schema ID and the cellsize of the striping cell.

A sample EC policy XML file named user_ec_policies.xml.template is available in the Hadoop conf directory for reference.

  • Intel ISA-L

Intel ISA-L stands for Intel Intelligent Storage Accelerator. Isa-l is an open source collection of low-level features optimized for storage applications. It includes fast block Reed-Solomon type erasure codes optimized for Intel AVX and AVX2 instruction sets. The HDFS erasure correction code can be accelerated by isA-L encoding and decoding calculation. Isa-l supports most major operating systems, including Linux and Windows. Isa-l is disabled by default. See the following instructions for how to enable ISA-L.

4 Deployment way

4.1 Cluster and hardware configuration

Erasure codes impose other requirements on the cluster in terms of CPU and network.

Encoding and decoding consume additional CPU resources on HDFS clients and Datanodes.

Erasure codes require that at least as many Datanodes in the cluster as the configured EC strip width. For EC policy RS (6,3), this means that at least nine datanodes are required.

Erasure code files are distributed on the whole rack to achieve fault tolerance. This means that when reading and writing striped files, most of the operations are done on the rack. Therefore, network bandwidth is very important.

For rack fault tolerance, it is also important to have racks that are at least as wide as the configured EC strip width. For EC policy RS (6,3), this means a minimum of nine racks, ideally 10 or 11 racks, to handle both planned and unplanned outages. For clusters with fewer racks than strip width, HDFS cannot maintain rack fault tolerance, but will still attempt to distribute striped files across multiple nodes to preserve node-level fault tolerance.

4.2 Configuration items

. By default, the DFS. The namenode. Ec system. The default. The policy enabled by default, all other built-in display delete code strategy have been disabled. The cluster administrator can run HDFS ec [-enablepolicy-policy] to enable the erasure-correction code measurement based on the cluster size and required fault tolerance properties. For example, for a cluster with nine racks, a policy like RS-10-4-1024 K would not preserve rack-level fault tolerance, while RS-6-3-1024 K or RS-3-2-1024 K might be more appropriate. If the administrator is only concerned with node-level fault tolerance, rS-10-4-1024 K still applies as long as there are at least 14 Datanodes in the cluster.

By “DFS. The namenode. Ec. System. The default. The policy” configuration to set the system default ec strategy. When using this configuration, if not passed on the command “- the setPolicy” any policy name, use the default EC strategy, “DFS. The namenode. EC. System. The default. The policy” by default “RS – 6-3-1024 k.

You can use the following client and DataNode configuration keys to configure the Reed Solomon and XOR codec implementation: the default RS codec IO. Erasurecode. Codec. RS. Rawcoders; IO. Erasurecode. Codec. Rs-legacy. rawcoders; XOR codec IO. Erasurecode. Codec. XOR. Rawcoders. Users can also use the key to configure a custom codec, for example: IO. Erasurecode. Codec. The self – defined – codec. Rawcoders. The values for these keys are a list of encoder names with alternate mechanisms. These codec factories are loaded in the order specified by the configuration value until the codec is successfully loaded. The default RS and XOR codec configurations prefer a pure Java native implementation. There is no rS-Legacy native codec implementation, so the default is just a pure Java implementation. All of these codecs have a pure Java implementation. For the default RS codec, there is also a native implementation that leverages the Intel ISA-L library to improve the codec’s performance. For XOR codecs, native implementations that leverage the Intel ISA-L library to improve codec performance are also supported. See the section “Enabling Intel ISA-L” for more details. The default implementation of RS Legacy is pure Java, and the default implementation of RS and XOR ISA native implementation using the Intel isa-l library.

You can also adjust background erasure code recovery on DataNode by setting the following parameters:

(1) the DFS. Datanode. Ec. Reconstruction. Stripedread. A timeout. Millis: strip read timeout, the default value is 5000 milliseconds.

(2) the DFS. Datanode. Ec. Reconstruction. Stripedread. Buffer. Size: the reader service of the buffer size, the default value is 64 KB.

(3) the DFS. Datanode. Ec. Reconstruction. Threads: datanode threads for the background reconstruction, the default value is 8 threads.

(4) the DFS. Datanode. Ec. Reconstruction. Xmits. Weight: ec background recovery compared with replication block recovery using xmit relative weights, the default value is 0.5. If the value is set to 0, the calculation weight of the EC recovery task is disabled. That is, the EC task always has 1 xMITS. The XMIT of the erasure code recovery task is calculated as the maximum value between the number of read streams and the number of write streams. For example, if the EC recovery task needs to read and write from 6 nodes to 2 nodes, its XMIT value is Max (6, 2) * 0.5 =3. The recovery task of a copied file is always counted as 1 xmit. The NameNode using DFS. The NameNode. Replication. Max – streams minus the total on the DataNode xmitsInProgress (copies from the file and EC xmit together) to schedule tasks to the DataNode recovery.

4.3 To enable the Intel ISA-L

The native IMPLEMENTATION of HDFS for the default RS codec leverages the Intel ISA-L library to improve encoding and decoding calculations. To enable and use Intel ISA-L, you need to perform three steps.

(1) Establish isA-L library. Please refer to the official website github.com/01org/isa-l…

(2) Support Hadoop construction with ISA-L. See the “Intel ISA-L Build Options” section of “Hadoop Build Notes” in the source code (building.txt);

(3) Use -dbundle. isal to copy the contents of isal.lib to the final tar file. Hadoop is deployed using tar files. Ensure that ISA-L is available on the HDFS client and DataNode.

To verify that Hadoop correctly detected ISA-L, run the Hadoop checknative command.

4.4 An admin command

HDFS provides an EC subcommand to execute administrative commands related to erasure codes.

hdfs ec [generic options]
     [-setPolicy -path <path> [-policy <policyName>] [-replicate]]
     [-getPolicy -path <path>]
     [-unsetPolicy -path <path>]
     [-listPolicies]
     [-addPolicies -policyFile <file>]
     [-listCodecs]
     [-enablePolicy -policy <policyName>]
     [-disablePolicy -policy <policyName>]
     [-help [cmd ...]]
Copy the code

Below are details about each command.

  • [-setPolicy -path  [-policy ] [-replicate]]

The erasure correction code policy is set for the directory in the specified path.

Path: indicates the directory in the HDFS. This parameter is mandatory. Setting a policy affects only newly created files, not existing files.

PolicyName: erasure code policy for files in this directory. If set the DFS. The namenode. Ec. System. The default. The policy “configuration, this parameter is ignored. Path in the EC policy is set to the default value in the configuration.

– Replicate: Applies a special REPLICATION policy to a directory, forcing the 3X REPLICATION solution.

-replicate and -policy are optional parameters. You cannot specify them at the same time.

  • [-getPolicy -path ]

Gets details about the erasure code policy for files or directories in the specified path.

  • [-unsetPolicy -path ]

Cancels erasure code policy previously set for directory calls. If the erasure code policy is inherited from the ancestor directory, the unsetPolicy is no-op. Unsetting a policy on a directory without an explicit policy set does not return an error.

  • [-listPolicies]

List all (enable, disable, and delete) erasure code policies registered in HDFS. Only enabled policies are suitable for use with the setPolicy command.

  • [-addPolicies -policyFile ]

Add the erasure code policy list. See etc/hadoop/user_ec_policies.xml.template for a sample file. The size of a cell in a property “DFS. The namenode. Ec. Policies. Max. Cellsize” in the definition, the default is 4 MB. Currently, HDFS allows users to add a total of 64 policies whose ids range from 64 to 127. If 64 policies have been added, adding policies fails.

  • [-listCodecs]

Obtain the list of erasure code codecs and encoders supported in the system. An encoder is an implementation of a codec. Codecs have different implementations and types. Codecs are listed in list order.

  • [-removePolicy -policy ]

Delete the erasure code policy.

  • [-enablePolicy -policy ]

Enable the erasure code policy.

  • [-disablePolicy -policy ]

Disable the erasure code policy.

5 limitations

Due to a number of technical challenges, some HDFS operations, such as Hflush, hsync, concat, setReplication, truncate, and Append, are not supported by some deletion code files.

  • Append () and truncate() on the erasure code file will throw IOException.
  • Concat () will raise IOException if the file is mixed with a different erasure code policy or copied file.
  • SetReplication () does not affect erasure code files.
  • Hflush () and hsync() on DFSStripedOutputStream are null directives. Therefore, calling hflush() or hsync() on erasure code files does not guarantee data persistence.

Clients can use the StreamCapabilities API to check whether OutputStream supports hflush() and hsync(). If the client wishes to persist data through hflush() and hsync(), the current remedy is to create such files in a directory of non-erasure codes, such as a regular 3X copy file, Or use FSDataOutputStreamBuilder# replicate () API in rectifying delete code directory to create 3 x copy file.

See: the original hadoop.apache.org/docs/r3.1.0…