• The source document: hadoop.apache.org/docs/r2.10….
  • Objective control study

HDFS High Availability Using the Quorum Journal Manager

The Purpose to

  • This guide provides an overview of the HDFS High Availability (HA) feature and how to configure and manage an HA HDFS cluster, using the Quorum Journal Manager (QJM) feature.

  • This guide provides an overview of HDFS high availability (HA) capabilities and how to configure and manage HA HDFS clusters using Quorum Journal Manager (QJM) capabilities.

  • This document assumes that the reader has a general understanding of general components and node types in an HDFS cluster. Please refer to the HDFS Architecture guide for details.

  • This document assumes a general understanding of common components and node types in HDFS clusters. For more information, see the HDFS architecture guide.

Note: Use Quorum Journal Manager or Conventional Shared Storage

  • This guide discusses how to configure and use HDFS HA using the Quorum Journal Manager (QJM) to share edit logs between the Active and Standby NameNodes. For information on how to configure HDFS HA using NFS for shared storage instead of the QJM, please see this alternative guide.. For information on how to configure HDFS HA with Observer NameNode, please see this guide.
  • This guide discusses how to configure and use HDFS HA using Quorum Journal Manager (QJM) to share edit logs between active and standby Namenodes. See this alternative guide for information on how to configure HDFS HA using NFS for shared storage instead of QJM. For information on how to configure HDFS HA using the Observer NameNode, see this guide.

Background Background

  • The Prior to the Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.

  • Prior to Hadoop 2.0.0, NameNode was a single point of failure (SPOF) in HDFS clusters. There is only one NameNode per cluster, and if that computer or process is unavailable, the entire cluster will be unavailable until the entire NameNode is restarted or started on another computer.

  • This impacted the total availability of the HDFS cluster in two major ways:

  • This affects the overall availability of HDFS clusters in two ways:

    • In the case of an unplanned event such as a machine crash, the cluster would be unavailable until an operator restarted the NameNode.

    • If an unexpected event occurs (such as a machine crash), the cluster will become unavailable until the operator restarts the NameNode.

    • Planned maintenance events such as software or hardware upgrades on the NameNode machine would result in windows of cluster downtime.

    • If an unexpected event occurs (such as a machine crash), the cluster will become unavailable until the operator restarts the NameNode.

  • The HDFS High Availability feature addresses the above problems by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.

  • The HDFS high availability feature addresses this problem by providing the option to run two redundant Namenodes in the same cluster in an active/passive configuration with hot standby capabilities. This allows for quick failover to a new NameNode in the event of a computer crash, or normal failover initiated by an administrator for scheduled maintenance purposes.

Architecture Architecture

  • In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the other is in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.

  • In a typical HA cluster, two separate machines are configured as Namenodes. At any point in time, exactly one of the Namenodes is active and the other is Standby. The Active NameNode is responsible for all client operations in the cluster, while the Standby simply acts as a slave and remains in sufficient state to provide quick failover if necessary.

  • In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called “JournalNodes” (JNs). When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is capable of reading the edits from the JNs, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the JournalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.

  • To keep the Standby node in sync with the Active node, both nodes communicate with a separate set of daemons called “JournalNodes” (JN). When the active node performs any namespace changes, it persistently logs the changes to the majority of these JNS. The Standby node can read edits from JN and constantly monitor them to see changes in the edit log. When the “standby node” sees the edit, it applies it to its own namespace. In the event of a failover, the standby server ensures that it has read all edits from JournalNode before upgrading itself to active. This ensures that namespace state is fully synchronized before a failover occurs.

  • In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both.

  • To provide fast failover, the standby node must also have up-to-date information about the location of blocks in the cluster. To achieve this, the DataNode is configured with two NameNode locations and sends block location information and heartbeats to both.

  • It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, Downshifting data loss or other incorrect results. In order to ensure this property and prevent the so-called “split-brain The scenario,” the JournalNodes will only ever allow a single NameNode to be a writer at a time. During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state, allowing the new Active to safely proceed with failover.

  • It is critical for the proper operation of the HA cluster that only one NameNode can be active at a time. Otherwise, the namespace state will quickly disperse between the two, risking data loss or other incorrect results. To ensure this property and prevent the so-called “split brain situation,” JournalNode will only allow a single NameNode to be an author once. During a failover, the NameNode that will become Active will only assume the role of writing to JournalNodes, which effectively prevents another NameNode from remaining Active so that the new Active node can safely fail over.

Hardware Resources Hardware resources

  • In order to deploy an HA cluster, you should prepare the following:

  • To deploy a high availability cluster, you should prepare the following:

    • NameNode machines – the machines on which you run the Active and Standby NameNodes should have equivalent hardware to each other, and equivalent hardware to what would be used in a non-HA cluster.

    • NameNode machines – The machines running active NameNode and Standby NameNode should have hardware equivalent to each other and the same hardware that will be used in the non-HA cluster.

    • JournalNode machines – the machines on which you run the JournalNodes. The JournalNode daemon is relatively lightweight, so these daemons may reasonably be collocated on machines with other Hadoop daemons, for example NameNodes, the JobTracker, or the YARN ResourceManager. Note: There must be at least 3 JournalNode daemons, since edit log modifications must be written to a majority of JNs. This will allow the system to tolerate the failure of a single machine. You may also run more than 3 JournalNodes, but in order to actually increase the number of failures the system can tolerate, you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N JournalNodes, the system can tolerate at most (N – 1) / 2 failures and continue to function normally.

    • JournalNode computer – The computer that runs JournalNode. JournalNode daemons are relatively light, so it makes sense to co-locate these daemons with other Hadoop daemons (such as NameNodes, JobTracker, or YARN ResourceManager) on your computer. Note: There must be at least 3 JournalNode daemons because edit log changes must be written to most JN. This will allow the system to tolerate the failure of a single computer. You might also run more than three JournalNodes, but to actually increase the number of failures your system can tolerate, you should run an odd number of JNS (that is, 3, 5, 7, and so on). Note that when running with N JournalNodes, the system can tolerate up to (n-1) / 2 failures and continue to operate normally.

  • Note that, in an HA cluster, the Standby NameNode also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to be HA-enabled to reuse the hardware which they had previously dedicated to the Secondary NameNode.

  • Note that in a high availability cluster, the standby name node also performs a checkpoint for namespace state, so there is no need to run the secondary name node, checkpoint node, or backup node in a high availability cluster. In fact, it would be a mistake to do so. This also allows users who are reconfiguring HDFS clusters that are not ha-enabled to enable HA to reuse their hardware previously dedicated to secondary Namenodes.

Deployment mode

Configuration Overview Configuration overview

  • Similar to Federation configuration, HA configuration is backward compatible and allows existing single NameNode configurations to work without change. The new configuration is designed such that all the nodes in the cluster may have the same configuration without the need for deploying different configuration files to different machines based on the type of the node.

  • Similar to the federated authentication configuration, the high availability configuration is backward compatible and allows the existing single NameNode configuration to work without changes. After designing the new configuration, all nodes in the cluster can have the same configuration without having to deploy different profiles to different machines depending on the type of node.

  • Like HDFS Federation, HA clusters reuse the nameservice ID to identify a single HDFS instance that may in fact consist of multiple HA NameNodes. In addition, a new abstraction called NameNode ID is added with HA. Each distinct NameNode in the cluster has a different NameNode ID to distinguish it. To support a single configuration file for all of the NameNodes, the relevant configuration parameters are suffixed with the nameservice ID as well as the NameNode ID.

  • As with HDFS federation authentication, HA clustering reuses the name service ID to identify a single HDFS instance that may actually consist of multiple HA Namenodes. In addition, HA adds a new abstraction called NameNode ID. Each different NameNode in the cluster is distinguished by a different NameNode ID. To support a single configuration file of all Namenodes, the associated configuration parameter suffixes are nameservice ID and NameNode ID.

Configuration Details Configuration details

  • To configure HA NameNodes, you must add several configuration options to your hdfs-site.xml configuration file.

  • To configure HA NameNode, you must add multiple configuration options to the hdFS-site.xml configuration file.

  • The order in which you set these configurations is unimportant, but the values you choose for dfs.nameservices and dfs.ha.namenodes.[nameservice ID] will determine the keys of those that follow. Thus, you should decide on these values before setting the rest of the configuration options.

  • The order in which you set these configurations is not important, but dfs.nameservices and dfs.ha.namenodes. The value selected by [nameservice ID] determines the subsequent key. Therefore, you should determine these values before setting the rest of the configuration options.

    • Dfs. nameservices – The logical name for this new Nameservice The logical name of the new nameservice service

    • Choose a logical name for this nameservice, for example “mycluster”, and use this logical name for the value of this config option. The name you choose is arbitrary. It will be used both for configuration and as the authority component of absolute HDFS paths in the cluster.

    • Select the logical name of the name service, such as “myCluster,” and use this logical name as the value for this configuration option. The name you choose is arbitrary. It can be used either for configuration or as a permission component for absolute HDFS paths in a cluster.

    • Note: If you are also using HDFS Federation, this configuration setting should also include the list of other nameservices, HA or otherwise, as a comma-separated list.

    • Note: If you also use HDFS Federation, this configuration setting should also include a comma-separated list of other named services (HA or otherwise).

    •     <property>
            <name>dfs.nameservices</name>
            <value>mycluster</value>
          </property>
      Copy the code
    • Dfs.ha. namenodes.[nameservice ID] – unique identifiers for each NameNode in the nameservice nameservice

    • Configure with a list of comma-separated NameNode IDs. This will be used by DataNodes to determine all the NameNodes in the cluster. For example, if you used “mycluster” as the nameservice ID previously, and you wanted to use “nn1” and “nn2” as the individual IDs of the NameNodes, you would configure this as such:

    • Use a comma-separated list of NameNode ids for configuration. DataNode will use this to determine all namenodes in the cluster. For example, if you previously used “mycluster” as the name service ID and wanted to use “nN1” and “nn2” as the ids of NameNode, you could configure it like this:

    •     <property>
            <name>dfs.ha.namenodes.mycluster</name>
            <value>nn1,nn2</value>
          </property>
      Copy the code
    • Note: Currently, only a maximum of two NameNodes may be configured per nameservice.

    • Note: Currently, you can configure a maximum of two Namenodes per name service.

    • dfs.namenode.rpc-address.[nameservice ID].[name node ID] – the fully-qualified RPC address for each NameNode to listen On The standard RPC address that each NameNode listens to

    • For both of the previously-configured NameNode IDs, set the full address and IPC port of the NameNode processs. Note that this results in two separate configuration options. For example:

    • For the two previously configured NameNode ids, set the full address of the NameNode process and the IPC port. Note that this results in two separate configuration options. Such as:

    •     <property>
            <name>dfs.namenode.rpc-address.mycluster.nn1</name>
            <value>machine1.example.com:8020</value>
          </property>
          <property>
            <name>dfs.namenode.rpc-address.mycluster.nn2</name>
            <value>machine2.example.com:8020</value>
          </property>
      Copy the code
    • Note: You may similarly configure the “servicerpc-address” setting if you so desire.

    • Note: If desired, you can similarly configure the “Servicerpc-address” setting.

    • dfs.namenode.http-address.[nameservice ID].[name node ID] – the fully-qualified HTTP address for each NameNode to listen On The standard HTTP address that each NameNode listens to

    • Similarly to rpc-address above, set the addresses for both NameNodes’ HTTP servers to listen on. For example:

    • Similar to the rpc-address above, set addresses for the HTTP servers of both Namenodes to listen. Such as:

    •     <property>
            <name>dfs.namenode.http-address.mycluster.nn1</name>
            <value>machine1.example.com:50070</value>
          </property>
          <property>
            <name>dfs.namenode.http-address.mycluster.nn2</name>
            <value>machine2.example.com:50070</value>
          </property>
      Copy the code
    • Note: If you have Hadoop’s security features enabled, you should also set the HTTPS -address similarly for each NameNode.

    • Note: If Hadoop’s security features are enabled, you should also similarly set the HTTPS – address for each NameNode.

    • dfs.namenode.shared.edits.dir – the URI which identifies the group of JNs where the NameNodes will write/read edits Identifies the URI of the JN group where NameNode will write/read the edits

    • This is where one configures the addresses of the JournalNodes which provide the shared edits storage, written to by the Active nameNode and read by the Standby NameNode to stay up-to-date with all the file system changes the Active NameNode makes. Though you must specify several JournalNode addresses, you should only configure one of these URIs. The URI should be of the form: qjournal://*host1:port1*; *host2:port2*; *host3:port3*/*journalId*. The Journal ID is a unique identifier for this nameservice, which allows a single set of JournalNodes to provide storage for multiple federated namesystems. Though not a Requirement, it’s a good idea to reuse the Nameservice ID for the journal identifier.

    • Here, you can configure the JournalNodes address that provides the shared edit store, which is written by the Active nameNode and read by the Standby nameNode to keep up to date with all file system changes made by the Active nameNode. Although you must specify multiple JournalNode addresses, you should configure only one of these URIs. The URI format should be: qjournal: // * host1: port1 *; * host2: port2 *; * host3: port3 * / * journalId *. The journal ID is a unique identifier for this name service, which allows a single JournalNode set to provide storage for multiple federated name systems. Although not required, it is a good idea to reuse the name service ID for the log identifier.

    • For example, if the JournalNodes for this cluster were running on the machines “node1.example.com”, “node2.example.com”, and “node3.example.com” and the nameservice ID were “mycluster”, you would use the following as the value for this setting (the default port for the JournalNode is 8485):

    • For example, if JournalNode for this cluster is running on machines “node1.example.com”, “node2.example.com”, and “node3.example.com”, and the name service ID is “mycluster”, The following values should be used for this setting (the default JournalNode port is 8485) :

    • <property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://node1.example.com:8485; node2.example.com:8485; node3.example.com:8485/mycluster</value> </property>Copy the code
    • dfs.client.failover.proxy.provider.[nameservice ID] – the Java class that HDFS clients use to contact the Active NameNode Java class used by the HDFS client to contact active NameNode

    • Configure the name of the Java class which will be used by the DFS Client to determine which NameNode is the current Active, and therefore which NameNode is currently serving client requests. The two implementations which currently ship with Hadoop are the ConfiguredFailoverProxyProvider and the RequestHedgingProxyProvider (which, for the first call, concurrently invokes all namenodes to determine the active one, and on subsequent requests, invokes the active namenode until a fail-over happens), so use one of these unless you are using a custom proxy provider. For example:

    • Configure the name of the Java class that the DFS client will use to determine which NameNode is currently Active and thus which NameNode is currently serving a client request. Hadoop currently comes with two implementations are ConfiguredFailoverProxyProvider and RequestHedgingProxyProvider (for the first call, they call all at the same time to determine the name of the activity name node, And invoke the active name node on subsequent requests until failover occurs), so unless you use a custom proxy provider, use one of them. Such as:

    •     <property>
            <name>dfs.client.failover.proxy.provider.mycluster</name>
            <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
          </property>
      Copy the code
    • dfs.ha.fencing.methods – a list of scripts or Java classes which will be used to fence the Active NameNode during a A list of failover scripts or Java classes that will be used to isolate Active NameNode during failover

    • It is desirable for correctness of the system that only one NameNode be in the Active state at any given time. Importantly, when using the Quorum Journal Manager, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential for corrupting the file system metadata from a split-brain scenario. However, when a failover occurs, it is still possible that the previous Active NameNode could serve read requests to clients, which may be out of date until that NameNode shuts down when trying to write to the JournalNodes. For this reason, it is still desirable to configure some fencing methods even when using the Quorum Journal Manager. However, to improve the availability of the system in the event the fencing mechanisms fail, it is advisable to configure a fencing method which is guaranteed to return success as the last fencing method in the list. Note that if you choose to use no actual fencing methods, you still must configure something for this setting, For example, “shell (/ bin/true)”.

    • To ensure that the system is correct, only one NameNode is Active at any given time. Importantly, with Quorum Journal Manager, only one NameNode will be allowed to write to JournalNodes, so file system metadata will not be corrupted due to split brain. However, when a failover occurs, the previous Active NameNode may still provide a read request to the client, which may expire until the NameNode is closed on an attempt to write to JournalNodes. Therefore, even with Quorum Journal Manager, some safeguards need to be configured. However, in order to improve the availability of the system in the event of a defense mechanism failure, it is recommended to configure a defense method to ensure a successful return to the last defense in the list. Note that if you choose not to use the actual guard method, you still have to configure something for this setting, such as “shell (/ bin/true)”.

    • The fencing methods used during a failover are configured as a carriage-return-separated list, which will be attempted in order until one indicates that fencing has succeeded. There are two methods which ship with Hadoop: shell and sshfence. For information on implementing your own custom fencing method, see the org.apache.hadoop.ha.NodeFencer class.

    • The guard method used during failover is configured as a carriage return-separated list that is tried sequentially until the guard is indicated as successful. Hadoop comes with two methods: shell and sshfence. Related to implement your own custom protection method in information, please see the org.. Apache hadoop. Ha. NodeFencer class.

    • Sshfence-ssh to the Active NameNode and kill the process SSH to the Active NameNode and terminate the process

    • The sshfence option SSHes to The target node and uses fuser to kill The process listening on The service’s TCP port.in order for this fencing option to work, it must be able to SSH to the target node without providing a passphrase. Thus, one must also configure the dfs.ha.fencing.ssh.private-key-files option, which is a comma-separated list of SSH private key files. For example:

    • The sshfence option SSHes the process of listening on the TCP port of the service by fusing it to the target node. For this protection option to work, it must be able to SSH to the target node without providing a password. Therefore, must also be configured DFS. Ha. Fencing. SSH. The private key – files option, this option is to use a comma to separate SSH private key file list. Such as:

    •     <property>
            <name>dfs.ha.fencing.methods</name>
            <value>sshfence</value>
          </property>
      
          <property>
            <name>dfs.ha.fencing.ssh.private-key-files</name>
            <value>/home/exampleuser/.ssh/id_rsa</value>
          </property>
      Copy the code
    • Optionally, one may configure a non-standard username or port to perform the SSH. One may also configure a timeout, in milliseconds, for the SSH, after which this fencing method will be considered to have failed. It may be configured like so:

    • You can optionally configure a non-standard user name or port to perform SSH. You can also configure a timeout for SSH in milliseconds, after which the defense will be considered a failure. It can be configured like this:

    •     <property>
            <name>dfs.ha.fencing.methods</name>
            <value>sshfence([[username][:port]])</value>
          </property>
          <property>
            <name>dfs.ha.fencing.ssh.connect-timeout</name>
            <value>30000</value>
          </property>
      Copy the code
    • Shell-run an arbitrary shell command to fence the Active NameNode Runs an arbitrary shell command to isolate the Active NameNode

    • The shell fencing method runs an arbitrary shell command. It may be configured like so:

    • <property> <name>dfs.ha.fencing.methods</name> <value>shell(/path/to/my/script.sh arg1 arg2 ...) </value> </property>Copy the code
    • The string between ‘(‘ and’) ‘is passed directly to a bash shell and may not include any closing parentheses.

    • The shell command will be run with an environment set up to contain all of the current Hadoop configuration variables, Replacing any ‘. ‘characters in the configuration keys. The configuration used has already had Any Namenode-specific configurations promoted to their generic forms — for example dfs_namenode_rpc-address will contain the RPC address of the target node, even though the configuration may specify that variable as dfs.namenode.rpc-address.ns1.nn1.

    • The shell command will run in an environment that is set up to contain all current Hadoop configuration variables and replace any “with an” _ “character. . Configure the characters in the key. The configuration used has promoted any name-specific node configuration to generic form, such as dfs_namenode_rpc-address will contain the RPC address of the target node, even though the configuration could specify that variable as dfs.namenode.rpc-address.ns1.nn1.

    • Additionally, the following variables referring to the target node to be fenced are also available:

    • In addition, the following variables are provided that refer to the target node to be quarantined:

      • $target_host hostname of the node to be fenced specifies the hostname of the node to be fenced
      • $target_port IPC port of the node to be FENCED IPC port of the node to be fenced
      • $target_address the above two, combined as host:port $target_address the above two, combined as host:port
      • $target_nameserviceid the nameservice ID of the NN to be fenced the nameservice ID of the NN
      • $target_namenodeID The namenode ID of the NN to be fenced The name service ID of the NN
    • These environment variables may also be used as substitutions in the shell command itself. For example:

    • These environment variables can also be used as substitutes in the shell command itself. Such as:

    •     <property>
            <name>dfs.ha.fencing.methods</name>
            <value>shell(/path/to/my/script.sh --nameservice=$target_nameserviceid $target_host:$target_port)</value>
          </property>
      Copy the code
    • If the shell command returns an exit code of 0, the fencing is determined to be successful. If it returns any other exit code, the fencing was not successful and the next fencing method in the list will be attempted.

    • If the shell command returns the exit code 0, the protection is successful. If any other exit code is returned, the guard is unsuccessful and the next guard method in the list will be tried.

    • Note: This fencing method does not implement any timeout. If timeouts are necessary, they should be implemented in the shell script itself (eg by forking a subshell to kill its parent in some number of seconds).

    • Note: This guard method does not implement any timeouts. If a timeout is required, it should be implemented in the shell script itself (for example, by forking the subshell to kill its parent in a few seconds).

    • Fs. defaultFS – The default path prefix used by the Hadoop FS client when none is given The default path prefix used by the Hadoop FS client when no path is specified

    • Optionally, You may now configure the default path for Hadoop clients to use the new ha-enabled logical URI. If you used “mycluster” as the nameservice ID earlier, this will be the value of the authority portion of all of your HDFS paths. This may be configured like so, in your core-site.xml file:

    • (Optional) You can now configure the default path of the Hadoop client to use the new ha-enabled logical URI. If you had previously used “mycluster” as the name service ID, it would have been the value of the authorization portion of all HDFS paths. You can configure this, in your core-site.xml file:

    •     <property>
            <name>fs.defaultFS</name>
            <value>hdfs://mycluster</value>
          </property>
      Copy the code
    • DFS. Journalnode. Edits. Dir – the path where the journalnode daemon will store its local state journalnode daemon will store the path of the local state

    • This is the absolute path on the JournalNode machines where the edits and other local state used by the JNs will be stored. You may only use a single path for this configuration. Redundancy for this data is provided by running multiple separate JournalNodes, or by configuring this directory on a locally-attached RAID array. For example:

    • This is the absolute path where the JournalNode machine will store edits and other local state used by JN. You can use only one path for this configuration. You can provide redundancy for this data by running multiple separate JournalNodes or by configuring this directory on a locally connected RAID array. Such as:

    •     <property>
            <name>dfs.journalnode.edits.dir</name>
            <value>/path/to/journal/node/local/data</value>
          </property>
      Copy the code

Deployment details

  • After all of the necessary configuration options have been set, you must start the JournalNode daemons on the set of machines where they will run. This can be done by running the Command “hadove-daemon. sh start journalnode” and waiting for the daemon to start on each of the relevant machines.

  • After you have set all the necessary configuration options, you must start the JournalNode daemon on the set of machines that will run them. This can be done by running the command “hadoop-daemon.sh start journalnode” and waiting for the daemon to start on each relevant computer.

  • Once the JournalNodes have been started, one must initially synchronize the two HA NameNodes’ on-disk metadata.

  • Once JournalNode is started, the disk metadata for both HA Namenodes must be synchronized first.

    • If you are setting up a fresh HDFS cluster, you should first run the format command (hdfs namenode -format) on one of NameNodes.

    • If you want to set up a new HDFS cluster, you should first run the format command (HDFS namenode-format) on one of the namenodes.

    • If you have already formatted the NameNode, or are converting a non-HA-enabled cluster to be HA-enabled, you should now copy over the contents of your NameNode metadata directories to the other, HDFS NameNode -bootstrapStandby on the unformatted NameNode. Running this command will also ensure that the JournalNodes (as configured by dfs.namenode.shared.edits.dir) contain sufficient edits transactions to be able to start both NameNodes.

    • If you have formatted NameNode or are converting a cluster that is not HA enabled, you should now copy the contents of the NameNode metadata directory to another NameNode that is not formatted by running the command “HDFS NameNode -. BootstrapStandby on an unformatted NameNode “. Run this command will also ensure JournalNode (by DFS. The namenode. Shared. The edits. Dir configuration) contain enough to start editing affairs two the namenode.

    • If you are converting a non-ha NameNode to be HA, you should run the command “HDFS Namenode-Initializesharededits”, which will initialize the JournalNodes with the edits data from the local NameNode edits directories.

    • To convert non-HA NameNode to HA, run the command “HDFS Namenode-Initializesharededits”, which initializes JournalNodes with the edits data in the local NameNode edits directory.

  • At this point you may start both of your HA NameNodes as you normally would start a NameNode.

  • At this point, you can start both HA Namenodes as you would normally start namenodes.

  • Browsing browsing is important to the browsing of browsing HTTP addresses. You can visit each of the NameNodes’ Web pages separately by browsing to their configured HTTP addresses Notice that next to the configured address will be the HA state of the NameNode (either “standby” or “active”). Whenever an HA NameNode starts, it is initially in the Standby state.

  • You can access the web pages of each NameNode individually by browsing to their configured HTTP addresses. You should note that next to the configured address will be the HA state of the NameNode (” Standby “or” Active “). Whenever HA NameNode is started, it is initially in Standby state.

Administrative commands

  • Now that your HA NameNodes are configured and started, you will have access to some additional commands to administer your HA HDFS cluster. Specifically, you should familiarize yourself with all of the subcommands of the “hdfs haadmin” command. Running this command without any additional arguments will display the following usage information:

  • Now that your HA NameNodes is configured and started, you will have access to some additional commands to manage HA HDFS clusters. Specifically, you should be familiar with all the subcommands of the “HDFS haadmin” command. Running this command without any other arguments displays the following usage information:

  •   Usage: haadmin
          [-transitionToActive <serviceId>]
          [-transitionToStandby <serviceId>]
          [-failover [--forcefence] [--forceactive] <serviceId> <serviceId>]
          [-getServiceState <serviceId>]
          [-getAllServiceState]
          [-checkHealth <serviceId>]
          [-help <command>]
    Copy the code
  • This guide describes high-level uses of each of these subcommands. For specific usage information of each subcommand, You should run HDFS haadmin -help.

  • This guide describes the high-level usage of each subcommand. For specific usage information about each subcommand, run HDFS haadmin -help < command >.

    • transitionToActive and transitionToStandby – transition the state of the given NameNode to Active or Standby Converts the state of the given NameNode to Active or Standby

    • These subcommands cause a given NameNode to transition to the Active or Standby state, respectively. These commands do not attempt to perform any fencing, and thus should rarely be used. Instead, One should almost always prefer to use the HDFS haadmin -failover subcommand.

    • These subcommands convert the given NameNode to the Active or Standby state, respectively. These commands do not attempt to perform any protection and should be used sparingly. Instead, you should almost always prefer to use the HDFS haadmin-failover subcommand.

    • Failover – Initiate a failover between two NameNodes Initiate failover between two NameNodes

    • This subcommand causes a failover from the first provided NameNode to the second. If the first NameNode is in the Standby state, this command simply transitions the second to the Active state without error. If the first NameNode is in the Active state, an attempt will be made to gracefully transition it to the Standby state. If this fails, the fencing methods (as configured by dfs.ha.fencing.methods) will be attempted in order until one succeeds. Only after this process will the second NameNode be transitioned to the Active state. If no fencing method succeeds, the second NameNode will not be transitioned to the Active state, and an error will be returned.

    • This subcommand causes failover from the first supplied NameNode to the second NameNode. If the first NameNode is Standby, this command will only convert the second NameNode to Active without error. If the first NameNode is active, it will try to gracefully transition to Standby. If it fails, the guard method (configured by dfs.ha.fencing. Methods) will be tried in sequence until it succeeds. Only after this process does the second NameNode transition to the Active state. If there is no successful guarding method, the second NameNode will not be converted to the Active state and an error will be returned.

    • GetServiceState – determine whether the given NameNode is Active or Standby

    • Connect to the provided NameNode to determine its current state, printing either “standby” or “active” to STDOUT appropriately. This subcommand might be used by cron jobs or monitoring scripts which need to behave differently based on whether the NameNode is currently Active or Standby.

    • Connect to the supplied NameNode to determine its current state and print “standby” or “active” as appropriate on STDOUT. This subcommand may be used by cron jobs or monitoring scripts that need to behave differently depending on whether the NameNode is currently active or standby.

    • GetAllServiceState – returns the state of all the NameNodes returns the status of all NameNodes

    • Connect to the configured NameNodes to determine the current state, print either “standby” or “active” to STDOUT appropriately.

    • Connect to the configured NameNode to determine the current state and print “standby” or “active” as appropriate on STDOUT.

    • Checkhealth-check The health of the given NameNode Checks the running status of the given NameNode

    • Connect to the provided NameNode to check its health. The NameNode is capable of performing some diagnostics on itself, including checking if internal services are running as expected. This command will return 0 if the NameNode is healthy, non-zero otherwise. One might use this command for monitoring purposes.

    • Connect to the provided NameNode to check its health. NameNode can perform some diagnostics on itself, including checking that internal services are working as expected. This command returns 0 if NameNode is healthy, non-zero otherwise. People may use this command for monitoring.

    • Note: This is not yet implemented, and at present will always return success, unless the given NameNode is completely down.

    • Note: This is not yet done, and unless the given NameNode is completely closed, success will always currently be returned.

Load Balancer Setup Load Balancer Settings

  • If you are running a set of NameNodes behind a Load Balancer (e.g. Azure or AWS ) and would like the Load Balancer to point to the active NN, you can use the /isActive HTTP endpoint as a health probe. http://NN_HOSTNAME/isActive will return a 200 status code response if the NN is in Active HA State, 405 otherwise.
  • If you are running a set of Namenodes behind a load balancer (such as Azure or AWS) and you want the load balancer to point to an active NN, you can use the/isActive HTTP endpoint as a health detector. If NN isActive HA, http:// NN_HOSTNAME/isActive returns a 200 status code response, otherwise 405 is returned.

In-progress Edit Log Tailing

  • Under the default settings, the Standby NameNode will only apply edits that are present in an edit log segments which has been finalized. If it is desirable to have a Standby NameNode which has more up-to-date namespace information, it is possible to enable tailing of in-progress edit segments. This setting will attempt to fetch edits from an in-memory cache on the JournalNodes and can reduce the lag time before a transaction is applied on the Standby NameNode to the order of milliseconds. If an edit cannot be served from the cache, the Standby will still be able to retrieve it, but the lag time will be much longer. The relevant configurations are:

  • By default, Standby NameNode will only apply edits that exist in the completed edit log segment. If you want to have a Standby NameNode with the latest namespace information, you can enable the tail processing of the editing segment in progress. This setting will attempt to fetch edits from the in-memory cache on JournalNode, and can reduce the latency before the transaction is applied to the Standby NameNode to the order of milliseconds. If it cannot be edited from the cache, the standby database will still be able to retrieve it, but with a longer lag time. The configuration is as follows:

    • dfs.ha.tail-edits.in-progress – Whether or not to enable tailing on in-progress edits logs. This will also enable the in-memory edit cache on the JournalNodes. Disabled by default. Whether to enable trailing in ongoing editing logs. This will also enable in-memory edit caching on JournalNodes. This function is disabled by default.

    • dfs.journalnode.edit-cache-size.bytes – The size of the in-memory cache of edits on the JournalNode. Edits take around 200 bytes each in a typical environment, so, for example, the default of 1048576 (1MB) can hold around 5000 transactions. It is recommended to monitor the JournalNode metrics RpcRequestCacheMissAmountNumMisses and RpcRequestCacheMissAmountAvgTxns, which respectively count the number of requests unable to be served by the cache, and the extra number of transactions which would have needed to have been in the cache for the request to succeed. For example, if a request attempted to fetch edits starting at transaction ID 10, but the oldest data in the cache was at transaction ID 20, A value of 10 would be added to the average. In a typical environment, each edit takes up about 200 bytes, so, for example, the default value of 1048576 (1MB) can hold about 5000 transactions. Suggest surveillance RpcRequestCacheMissAmountNumMisses JournalNode index and RpcRequestCacheMissAmountAvgTxns, they could not calculate by caching service to the number of requests, And the number of additional transactions that must exist in the cache for the request to succeed. For example, if the request tries to get edits starting at transaction ID 10, but the oldest data in the cache is at transaction ID 20, then add 10 to the average.

  • This feature is primarily useful in conjunction with the Standby/Observer Read feature. Using this feature, read requests can be serviced from non-active NameNodes; thus tailing in-progress edits provides these nodes with the ability to serve requests with data which is much more fresh. See the Apache JIRA ticket HDFS-12943 for more information on this feature.

  • This feature is mainly used in conjunction with the “standby/observer read” function. With this feature, requests can be read from inactive NameNode services; Therefore, ongoing post editing provides these nodes with the ability to provide the latest data for requests. For more information about this feature, see Apache JIRA ticket HDFS-12943.

Automatic Failover Automatic Failover

The Introduction introduces the

  • The above sections describe how to configure manual failover. In that mode, the system will not automatically trigger a failover from the active to the standby NameNode, even if the active node has failed. This section describes how to configure and deploy automatic failover.
  • The previous sections describe how to configure manual failover. In this mode, the system does not automatically trigger failover from the active NameNode to the standby NameNode even if the active node fails. This section describes how to configure and deploy automatic failover.

The Components Components

  • Automatic failover adds two new components to an HDFS deployment: a ZooKeeper quorum, and the ZKFailoverController process (abbreviated as ZKFC).

  • Automatic failover adds two new components to HDFS deployment: ZooKeeper mediation and the ZKFailoverController process (abbreviated ZKFC).

  • Apache ZooKeeper is a highly available service for maintaining small amounts of coordination data, notifying clients of changes in that data, and monitoring clients for failures. The implementation of automatic HDFS failover relies on ZooKeeper for the following things:

  • Apache ZooKeeper is a high availability service that maintains small amounts of coordination data, notifies clients of changes in the data, and monitors client failures. HDFS failover relies on ZooKeeper to perform the following operations:

    • Failure detection – each of the NameNode machines in the cluster maintains a persistent session in ZooKeeper. If the machine crashes, the ZooKeeper session will expire, notifying the other NameNode that a failover should be triggered. Each NameNode machine in the cluster maintains a persistent session in ZooKeeper. If the computer crashes, the ZooKeeper session terminates, informing the other NameNode that failover should be triggered.

    • Active NameNode election – ZooKeeper provides a simple mechanism to exclusively elect a node as active. If the current active NameNode crashes, another node may take a special exclusive lock in ZooKeeper indicating that it should become the next Active.ZooKeeper provides a simple mechanism to specifically elect a node as an active node. If the currently active NameNode crashes, another node may take a special exclusive lock in ZooKeeper, indicating that it should be the next active NameNode.

  • The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client which also monitors and manages the state of the NameNode. Each of the machines which runs a NameNode also runs a ZKFC, and that ZKFC is responsible for:

  • ZKFailoverController (ZKFC) is a new component that is a ZooKeeper client that also monitors and manages the state of NameNode. Each computer running NameNode also runs ZKFC, which is responsible for:

    • Health monitoring – the ZKFC pings its local NameNode on a periodic basis with a health-check command. So long as the NameNode responds in a timely fashion with a healthy status, the ZKFC considers the node healthy. If the node has crashed, frozen, or otherwise entered an unhealthy state, the health monitor will mark it as unhealthy.

    • Health monitoring -ZKFC uses health check commands to ping its local NameNode periodically. ZKFC considers the node healthy as long as the NameNode responds in a healthy state. If a node crashes, freezes, or otherwise enters an abnormal state, the health monitor flags it as abnormal.

    • ZooKeeper session management – when the local NameNode is healthy, the ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it also holds a special “lock” znode. This lock uses ZooKeeper’s support for “ephemeral” nodes; if the session expires, the lock node will be automatically deleted.

    • ZooKeeper Session Management – ZKFC keeps open sessions in ZooKeeper when the local NameNode is healthy. If the local NameNode is active, it will also hold a special “locked” ZNode. This lock uses ZooKeeper’s support for “temporary” nodes. If the session expires, the locked node is automatically deleted.

    • ZooKeeper-based election – if the local NameNode is healthy, and the ZKFC sees that no other node currently holds the lock znode, it will itself try to acquire the lock. If it succeeds, then it has “won the election”, and is responsible for running a failover to make its local NameNode active. The failover process is similar to the manual failover described above: first, the previous active is fenced if necessary, and then the local NameNode transitions to active state.

    • Zookeeper-based elections – if the local NameNode is healthy and ZKFC sees that no other node currently holds the lock ZNode, it will attempt to acquire the lock itself. If successful, it “won the election” and is responsible for running a failover to keep its local NameNode active. The failover process is similar to the manual failover described above: first, isolate the previously active node if necessary, and then convert the local NameNode to the active state.

  • For more details on the design of automatic failover, refer to the design document attached to HDFS-2185 on the Apache HDFS JIRA.

  • For more details on automatic failover design, see the design documentation that comes with FFS-2185 on Apache HDFS JIRA.

Deploying ZooKeeper Deploying ZooKeeper

  • In a typical deployment, ZooKeeper daemons are configured to run on three or five nodes. Since ZooKeeper itself has light resource requirements, it is acceptable to collocate the ZooKeeper nodes on the same hardware as the HDFS NameNode and Standby Node. Many operators choose to deploy the third ZooKeeper process on the same node as the YARN ResourceManager. It is advisable to configure the ZooKeeper nodes to store their data on separate disk drives from the HDFS metadata for best performance and isolation.

  • In a typical deployment, the ZooKeeper daemon is configured to run on three or five nodes. ZooKeeper has requirements on optical resources. Therefore, you can co-locate the ZooKeeper Node on the same hardware as HDFS NameNode and Standby Node. Many operators choose to deploy the third ZooKeeper process and YARN ResourceManager on the same node. It is recommended that the ZooKeeper node be configured to store its data and HDFS metadata on a separate disk drive for optimal performance and isolation.

  • The setup of ZooKeeper is out of scope for this document. We will assume that you have set up a ZooKeeper cluster running on three or more nodes, and have verified its correct operation by connecting using the ZK CLI.

  • ZooKeeper Settings are outside the scope of this document. We will assume that you have set up a ZooKeeper cluster running on three or more nodes and verified that it is working correctly by connecting using the ZK CLI.

Before you begin

  • Before you begin configuring automatic failover, you should shut down your cluster. It is not currently possible to transition from a manual failover setup to an automatic failover setup while the cluster is running.
  • Before you start configuring automatic failover, shut down the cluster. There is currently no way to transition from manual failover Settings to automatic failover Settings while the cluster is running.

Configuring automatic failover

  • The configuration of automatic failover requires the addition of two new parameters to your configuration. In your hdfs-site.xml file, add:
  • The configuration of automatic failover requires two new parameters to be added to the configuration. In your hdFS-site. XML file, add:
  •    <property>
         <name>dfs.ha.automatic-failover.enabled</name>
         <value>true</value>
       </property>
    Copy the code
- This specifies that the cluster should be set up for automatic failover. In your core-site.xml file, add: - This specifies that the cluster should be set up for automatic failover. In your core-site.xml file, add:  - ``` <property> <name>ha.zookeeper.quorum</name> <value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</value> </property>Copy the code
  • This lists the host-port pairs running the ZooKeeper service.

  • This lists the host port pairs running the ZooKeeper service.

  • As with the parameters described earlier in the document, these settings may be configured on a per-nameservice basis by suffixing the configuration key with the nameservice ID. For example, in a cluster with federation enabled, you can explicitly enable automatic failover for only one of the nameservices by setting dfs.ha.automatic-failover.enabled.my-nameservice-id.

  • As with the parameters described earlier in this document, you can configure these Settings on a per-name service basis by configuring the key suffix name service ID. For example, in a cluster with federated authentication enabled, you can explicitly enable automatic failover for only one of the name services by setting dfs.ha.automatic-failover.enable.my-nameservice -id.

  • There are also several other configuration parameters which may be set to control the behavior of automatic failover; however, they are not necessary for most installations. Please refer to the configuration key specific documentation for details.

  • Several other configuration parameters can be set to control the behavior of automatic failover. However, they are not required for most installations. Refer to the configuration key specific documentation for more information.

Initializing HA state in ZooKeeper Initializing the HA state in ZooKeeper

  • After the configuration keys have been added, the next step is to initialize required state in ZooKeeper. You can do so by running the following command from one of the NameNode hosts.

  • After adding the configuration key, the next step is to initialize the required state in ZooKeeper. You can do this by running the following command from one of the NameNode hosts.

    • [hdfs]$ $HADOOP_PREFIX/bin/hdfs zkfc -formatZK
  • This will create a znode in ZooKeeper inside of which the automatic failover system stores its data.

  • This will create a ZNode in ZooKeeper, where the automatic failover system will store its data.

Starting the cluster with start-dfs.sh Run the start-dfs.sh command to start the cluster

  • Since automatic failover has been enabled in the configuration, the start-dfs.sh script will now automatically start a ZKFC daemon on any machine that runs a NameNode. When the ZKFCs start, they will automatically select one of the NameNodes to become active.
  • Because automatic failover is enabled in the configuration, the start-dfs.sh script will now automatically start the ZKFC daemon on any computer running NameNode. When ZKFC starts, they will automatically select a NameNode to activate.

Starting the cluster Manually Starting a cluster

  • If you manually manage the services on your cluster, you will need to manually start the zkfc daemon on each of the machines that runs a NameNode. You can start the daemon by running:

  • If you manually manage services on the cluster, you will need to manually start the ZKFC daemon on each machine where NameNode is running. You can start the daemon by running the following command:

    • [hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --script $HADOOP_PREFIX/bin/hdfs start zkfc

Securing access to ZooKeeper Ensures access to ZooKeeper

  • If you are running a secure cluster, you will likely want to ensure that the information stored in ZooKeeper is also secured. This prevents malicious clients from modifying the metadata in ZooKeeper or potentially triggering a false failover.

  • If you are running a security cluster, you may want to ensure that the information stored in ZooKeeper is also protected. This prevents malicious clients from modifying metadata in ZooKeeper or potentially triggering faulty failover.

  • In order to secure the information in ZooKeeper, first add the following to your core-site.xml file:

  • To protect the information in ZooKeeper, first add the following to your core-site.xml file:

  •    <property>
         <name>ha.zookeeper.auth</name>
         <value>@/path/to/zk-auth.txt</value>
       </property>
       <property>
         <name>ha.zookeeper.acl</name>
         <value>@/path/to/zk-acl.txt</value>
       </property>
    Copy the code
  • Please note the ‘@’ character in these values – this specifies that the configurations are not inline, but rather point to a file on disk.

  • Notice the ‘@’ character in these values – this indicates that the configuration is not inline, but points to a file on disk.

  • The first configured file specifies a list of ZooKeeper authentications, in the same format as used by the ZK CLI. For example, you may specify something like:

  • The first configuration file specifies the ZooKeeper authentication list in the same format used by the ZK CLI. For example, you can specify the following:

    • digest:hdfs-zkfcs:mypassword
  • … where hdfs-zkfcs is a unique username for ZooKeeper, and mypassword is some unique string used as a password.

  • … HDFS – ZKFCS is the unique user name of ZooKeeper, and mypassword is the unique character string used as the password.

  • Next, generate a ZooKeeper ACL that corresponds to this authentication, using a command like the following:

  • Next, generate the ZooKeeper ACL corresponding to this authentication using a command like the following:

    • ` [HDFS] Java – cpjava – cpjava – cpZK_HOME/lib / * : $ZK_HOME/zookeeper – 3.4.2. Jar org.apache.zookeeper.server.auth.DigestAuthenticationProvider hdfs-zkfcs:mypassword

output: hdfs-zkfcs:mypassword->hdfs-zkfcs:P/OQvnYyU/nF/mGYvB/xurX8dYs=`

  • Copy and paste the section of this output after the ‘->’ string into the file zk-acls.txt, Prefixed by the string “digest:”.

  • Copy the part after the ‘->’ string of this output and paste it into the file zK-acls.txt prefixed with ‘digest:’. Such as:

    • digest:hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=:rwcda
  • In order for these ACLs to take effect, you should then rerun the zkfc -formatZK command as described above.

  • For these ACLs to take effect, you should re-run the zKFC-formatzk command as described above.

  • After doing so, you may verify the ACLs from the ZK CLI as follows:

  • After doing so, you can follow these steps to verify the ACL from the ZK CLI:

    • `[zk: localhost:2181(CONNECTED) 1] getAcl /hadoop-ha

‘digest,’hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM= : cdrwa`

Verifying automatic failover

  • Once automatic failover has been set up, you should test its operation. To do so, first locate the active NameNode. You can tell which node is active by visiting the NameNode web interfaces – each node reports its HA state at the top of the page.

  • After you set up automatic failover, you should test its operation. To do this, first find the active NameNode. You can determine which node is active by visiting the NameNode Web interface – each node reports its HA status at the top of the page.

  • Once you have located your active NameNode, you may cause a failure on that node. For example, you can use kill -9 to simulate a JVM crash. Or, you could power cycle the machine or unplug its network interface to simulate a different kind of outage. After triggering the outage you wish to test, the other NameNode should automatically become active within several seconds. The amount of time required to detect a failure and trigger a fail-over depends on the configuration of ha.zookeeper.session-timeout.ms, but defaults to 5 seconds.

  • Once the active NameNode is found, it may cause that node to fail. For example, you can simulate a JVM crash with kill -9 <NN pid >. Alternatively, you can restart the computer’s power supply or unplug its network interface to simulate another interruption. After the interrupt you want to test is triggered, the other NameNode should automatically become active within a few seconds. The time required to detect faults and trigger failover depends on the configuration of ha.zookeeper.session-timeout.ms, but the default value is 5 seconds.

  • If the test does not succeed, you may have a misconfiguration. Check the logs for the zkfc daemons as well as the NameNode daemons in order to further diagnose the issue.

  • If the test is not successful, there may be a configuration error. Check the logs of the ZKFC daemon and the NameNode daemon to further diagnose the problem.

Automatic Failover FAQ Common problems about Automatic Failover

  • Is it important that I start the ZKFC and NameNode daemons in any particular order? Is it important to start the ZKFC and NameNode daemons in any particular order?

  • No. On any given node you may start the ZKFC before or after its corresponding NameNode. Can’t. On any given node, you can start ZKFC before or after its corresponding NameNode.

  • What additional monitoring should I put in place? What other surveillance should I conduct?

  • You should add monitoring on each host that runs a NameNode to ensure that the ZKFC remains running. In some types of ZooKeeper failures, for example, the ZKFC may unexpectedly exit, and should be restarted to ensure that the system is ready for automatic failover. You should add monitoring on each host where NameNode is running to make sure ZKFC stays up and running. For example, in some types of ZooKeeper failure, ZKFC may exit unexpectedly and should be restarted to ensure that the system is ready for automatic failover.

  • Additionally, you should monitor each of the servers in the ZooKeeper quorum. If ZooKeeper crashes, then automatic failover will not function. In addition, you should monitor each server in the ZooKeeper mediation. Automatic failover will not work if ZooKeeper crashes.

  • What happens if ZooKeeper goes down? What if ZooKeeper crashes?

  • If the ZooKeeper cluster crashes, no automatic failovers will be triggered. However, HDFS will continue to run without any impact. When ZooKeeper is restarted, HDFS will reconnect with no issues. Automatic failover will not be triggered if the ZooKeeper cluster crashes. However, HDFS will continue to run without any impact. After ZooKeeper is restarted, HDFS is reconnected without any problems.

  • Can I designate one of my NameNodes as primary/preferred? Can I designate one of my Namenodes as primary/preferred?

  • No. Currently, this is not supported. Whichever NameNode is started first will become active. You may choose to start the cluster in a specific order such that your preferred node starts first. No. This function is not currently supported. Whichever NameNode is started first becomes active. You can choose to start the cluster in a specific order so that your preferred node starts first.

  • How can I initiate a manual failover when automatic failover is configured? How do I start manual failover after automatic failover is configured?

  • Even if automatic failover is configured, you may initiate a manual failover using the same hdfs haadmin command. It will perform a coordinated failover. Even if automatic failover is configured, you can use the same HDFS haadmin command to start manual failover. It will perform coordinated failover.

HDFS Upgrade/Finalization/Rollback with enabling HA HA Enabled HDFS Upgrade/final/Rollback

  • When moving between versions of HDFS, sometimes the newer software can simply be installed and the cluster restarted. Sometimes, however, upgrading the version of HDFS you’re running may require changing on-disk data. In this case, one must use the HDFS Upgrade/Finalize/Rollback facility after installing the new software. This process is made more complex in an HA environment, since the on-disk metadata that the NN relies upon is by definition distributed, both on the two HA NNs in the pair, and on the JournalNodes in the case that QJM is being used for the shared edits storage. This documentation section describes the procedure to use the HDFS Upgrade/Finalize/Rollback facility in an HA setup.

  • When moving between different versions of HDFS, you can sometimes simply install newer software and restart the cluster. However, sometimes upgrading a running VERSION of HDFS may require changes to the data on the disk. In this case, HDFS upgrade/finalization/rollback functions must be used after installing new software. In an HA environment, the process becomes more complicated because the disk metadata on which the NN depends is, by definition, distributed both over the two HA NNS in the pair and, in the case of QJM, over the JournalNode shared edit storage. This document describes how to use HDFS upgrade, finalization, and rollback functions in HA Settings.

  • To perform an HA upgrade, the operator must do the following:

  • To perform an HA upgrade, the operator must do the following:

    • Shut down all of the NNs as normal, and install the newer software. Shut down all NN normally, then install newer software.

    • Start up all of the JNs. Note that it is critical that all the JNs be running when performing the upgrade, rollback, or finalization operations. If any of the JNs are down at the time of running any of these operations, the operation will fail. Start all JN. Note that it is critical that all JN be running when performing an upgrade, rollback, or completion operation. If any of these operations are run with any JN turned off, the operation will fail.

    • Start one of the NNs with the ‘-upgrade’ flag. Start one of the NNS with the “-upgrade” flag.

    • On start, this NN will not enter the standby state as usual in an HA setup. Rather, this NN will immediately enter the active state, perform an upgrade of its local storage dirs, and also perform an upgrade of the shared edit log. On startup, this NN will not go into standby as it usually does in HA Settings. Instead, the NN will immediately become active, upgrading its local storage directory, as well as the shared edit log.

    • At this point the other NN in the HA pair will be out of sync with the upgraded NN. In order to bring it back in sync and once again have a highly available setup, you should re-bootstrap this NameNode by running the NN with the ‘-bootstrapStandby’ flag. It is an error to start this second NN with the ‘-upgrade’ flag. At this point, the other NN in the HA pair will be out of sync with the upgraded NN. To get it synchronized and set for high availability again, you should reboot this NameNode by running NN with the ‘-bootstrapStandby’ flag. It is an error to start the second NN with the “-upgrade” flag.

  • Note that if at any time you want to restart the NameNodes before finalizing or rolling back the upgrade, you should start the NNs as normal, i.e. without any special startup flag.

  • Note that if you want to restart NameNode at any time before completing or rolling back the upgrade, NN should be started normally, that is, without any special start flags.

  • To finalize an HA upgrade, the operator will use the `hdfs dfsadmin -finalizeUpgrade’ command while the NNs are running and one of them is active. The active NN at the time this happens will perform the finalization of the shared log, and the NN whose local storage directories contain the previous FS state will delete its local state.

  • To complete the HA upgrade, the operator will use the “HDFS dfsadmin-finalizeupgrade” command while the NN is running and one of them is active. The activity that occurs at this time completes the termination of the shared log, and the NN whose local storage directory contains the previous FS state will delete its local state.

  • To perform a rollback of an upgrade, both NNs should first be shut down. The operator should run the roll back command on the NN where they initiated the upgrade procedure, which will perform the rollback on the local dirs there, as well as on the shared log, either NFS or on the JNs. Afterward, this NN should be started and the operator should run `-bootstrapStandby’ on the other NN to bring the two NNs in sync with this rolled-back file system state.

  • To perform rollback of the upgrade, first close two NNS. The operator should run the rollback command on the NN that started the upgrade process, which will perform the rollback on the local directory as well as the shared logs (NFS or JN). After that, this NN should be started and the operator should run “-bootstrapStandby” on the other NN to synchronize both NNS with the rollback file system state.