This article describes the architecture and features of HDFS router-based Federation.


Review of the previous period:
Rambling Hbase Fil
ter

Abstract

In order to solve the problem of HDFS horizontal scaling, the Hadoop community implemented the ViewFs based Federation architecture in the earlier version, and in the latest version of Hadoop, the community implemented the Router based Federation architecture. Many features are implemented on top of this architecture to enhance cluster management capabilities. The Router removes the mount table from the Client and solves the problem of inconsistent mount tables. In this article, we will introduce the architecture and features of HDFS router-based Federation.

Background

In the HDFS single-cluster architecture, Block Managers and namespaces consume more and more resources of NameNode as the cluster scale expands, making NameNode unable to provide reliable services. Hence the Federation architecture.

The Federation architecture refers to a Federation cluster consisting of multiple subclusters. Usually, these subclusters share datanodes. The mapping between Federation namespaces and subgroup namespaces is maintained by the mount table, which is stored in the client’s local configuration file and parsed by the client to access the correct subgroup. In the community implementation, a new protocol viewfs:// is used to access Federation Namespace.

Xiaomi has made many internal optimizations to the Federation implementation. In order to improve user ease of use and enable users to cooperate with cluster migration, we shield users from details as much as possible, so that users do not need to modify the code or even configuration to access Federation cluster and normal cluster.

  • Added a layer of encapsulation to the community ViewFs, using the original HDFS :// protocol

  • The mount table is stored in the Zookeeper cluster. The client periodically checks whether the mount table is updated

  • Enables rename between subgroups in a Federation cluster

But the Federation structure also brings some problems,

  • The mount table is implemented by the client, and the code logic needs to be modified to consider compatibility with old and new clients and to distribute new clients

  • When implementing the Rebalance for subset namespaces, it can be difficult to ensure that all clients update to the new mount table in sync

Therefore, the community proposed a new Federation architecture: router-based Federation

Router

In order to shield users from the details of Federation implementation and take mount table configuration and implementation out of the client side, it was a natural idea to introduce a proxy service where clients directly request the proxy service, which then parses the mount table and forwards the request to the correct subset. We call this proxy service Router.

Router Rpc Forwarding

Let’s first look at how the Router proxys RPCS.


The figure identifies the invocation relationship. The Router starts the RouterRpcServer service. This class implements ClientProtocol just as NameNodeRpcServer does. The Router can be accessed as a Namenode. Of course, the Router also implements other protocols for administrators to manage Router or cluster state. After RouterRpcServe receives RPC, the Router displays that the mount table is resolved to obtain the corresponding subset group and its path. Then, the RPC Client corresponding to NameNode is constructed by ConnectionManager, and the RPC is forwarded by the Client. ConnectionManager maintains a set of connection pools. Each RPC UserGroupInformation, the NameNode Address and Protocol constitute the Key of the connection pool. The pool is constructed to create a certain number of RPC clients. Then, for each incoming RPC, an idle RPC Client is found in the pool to send RPCS. When there are not enough idle RPC clients, Creator thread in the background asynchronously constructs new connections, while Cleaner thread in the background is responsible for cleaning the connection pool. Information on how the Router represents the user is explained in the Router Security section below.



MountTableResolver

In the Router, each Federation Namespace mapping to a subset Namespace corresponds to a MountTable. All MountTable are mount tables of clusters. MountTableResolver is managed by members of type TreeMap

. Key is the path under the Federation Namespace and Value is the corresponding MountTable. The Router supports nested mount tables. This is different from the community’s original implementation of ViewFs. The Router supports nested mount tables.
,>

The community has also implemented a Resolver that supports the ability to hang a path across multiple clusters, which can determine which subgroups to map subdirectories to based on specified rules such as consistency hashing. The mount table is set by the administrator using commands, but where should the mount table persist so that all routers can read the latest mount table and the Router does not need to reset the mount table after restarting?


State Store

In order to easily manage the configuration and State of the Router, we introduced the State Store, which is an abstraction of our storage service that stores the Router State. Currently, there are two implementations based on file system and Zookeeper.

Responsible for communicate with the State Store is StateStoreDriver, defines some basic GET/PUT interface, by StateStoreConnectionMonitorService maintenance. StateStoreService is a service that the Router manages the StateStore. It pulls data from the StateStore and updates the cache of the registered RecordStore. The one stored on the State Store is called Record, and there are currently only Protobuf-based serialization implementations. For example, the Mount Table we mentioned above is an implementation of a RecordStore. Each Mount Table is a Record that is serialized by a Protobuf and stored in the State Store.



Router Security

With the above architecture, the Router can function as a stateless proxy layer. However, the Client can no longer communicate with the NameNode directly, so the security authentication scheme of the non-RBF cluster is invalid. Therefore, a security authentication scheme of the Router layer is created.

In HDFS practice, there are two authentication schemes, Kerberos and Delegation Token, which are used for different applications. Let’s start by looking at how Kerberos is implemented at the Router layer.


Obviously, we can register the Router as a Service with Kerberos, and the Router authenticates the Client. At the same time, the Router as the HDFS super user to proxy Client user information, in the code can be so simple to achieve

UserGroupInformation routerUser = UserGroupInformation.getLoginUser();
connUGI = UserGroupInformation.createProxyUser(    
    ugi.getUserName(), routerUser);Copy the code

The Delegation Token is not so easy to implement.

As the community currently implements, the Delegation Token is constructed by the Router, which certifies the Client. In order for all the routers to synchronize the Delegation Token you constructed, you need to Store it in the State Store for synchronization between the routers. The advantages of this approach are simple to implement, and authentication can be completed at the Route R layer without communicating with all namenodes. The downside is that synchronization between routers is required, which can cause performance problems, and because Zookeeper does not guarantee strong consistency, the Router may not read the Delegation Token constructed by another Router, resulting in Client authentication failures.

Other Features

The community has also implemented a number of interesting features within this architecture

  • Permissions can be set on the mount table

  • Aggregate Quotas. Since subdirectories under a mount table can be mounted on different subsets, aggregate quotas are needed to manage quotas on all subsets.

  • Disabled Namespace.

Community Practice

The RBF architecture of the community is shown below

The recommended practice mentioned in the community documentation is to start a Router service on each NameNode machine, State Store is the Zookeeper implementation, The path of the Federation Namespace and subset Namespace in the mount table mapping is the same.


Future Works

  • Zookeeter-based Delegation Token authentication can fail, so we need to call the synchronization interface or switch to a more consistent store, such as ETCD.

  • The Router hides the Clien T IP from NameNode. The Router should also send the RPC context to NameNode so that it can record the correct Client information in the Audit Log.

  • Rebalance. The community doesn’t have cross-subgroup migration, but xiaomi has a Federation Rename. We can incorporate this feature into the community Rebalance solution in the future.


This article was first published on the public account “Miui Cloud Technology”. Please note the source of reprint is the public account: Miui Cloud Technology, ID: Mi-cloud-tech. HDFS router-based Federation