In the era of big data with information explosion, how to solve the storage problem of massive data with lower cost has become an important part of enterprise big data business. The new generation of object storage service US3 developed by UCloud in the past period of time has introduced the solution of computing storage separation and big data backup for big data business scenarios.

The main reasons behind this include:

1. Due to the rapid development of network technology, network transmission performance is no longer the bottleneck of high throughput business demand in the big data scenario; 2. The operation and maintenance of HDFS storage solution in the Hadoop technology stack are complex and costly; 3. Object storage service US3, built on the cloud platform based on the mass storage resource pool, has the advantages of on-demand use, simple operation, reliability and stability, and low price, and is the best storage solution choice to replace HDFS. Therefore, in order to make it more convenient for users to use US3 to realize computing storage separation and big data backup solution in Hadoop scenario, US3 has developed three components of US3Hadoop Adapter, US3VMDS and US3DISTCP.

This paper mainly introduces some ideas and problem solving in the development and design process of US3Hadoop adapter.

Overall design idea

Storage operations in the Hadoop ecosystem are mostly done through a common FileSystem base class, Filesystem. The US3Hadoop Adapter (referred to as the Adapter) implements this base class through the US3Filesystem to operate US3. DistributedFilesystem similar to HDFS implementation and S3afilesystem based on AWS S3 protocol implementation. The adapter sends both IO and index requests directly to US3, as shown in the diagram below:

Here the index operation is mainly does not involve reading and writing data API, such as: Headfile, ListObjects, Rename, Deletefile, Copy(for modifying metadata); API for IO operations, such as getFile, putFile (files smaller than 4M) have been shard upload related 4 APIs: InitiateMultiPartupload, UploadPart, FinishMultiPartupload, AbortMultiPartupload. Now that US3 has these APIs, see how they match up with the FileSystem member methods and see which FileSystem methods need to be overridden. Combined with the actual requirements and the reference implementation of DistributedFilesystem and S3afilesystem, we identified the main methods that need to be overridden: Initialize, create, rename, getFileStatus, open, listStatus, mkdirs, setOwner, setPerpermission, setReplication, setWorkingDirectory, GetWorkingDirectory, GetSchem, GetUri, GetDefaultBlockSize, Delete. Some methods that are difficult to simulate, such as the Append member method, are overridden as exceptions.

In fact, judging from the above description of FileSystem member methods, the semantics are similar to the interface semantics of a stand-alone FileSystem engine, which basically organizes and manages filesystems in a directory tree structure. The ListObjects API provided by US3 also happens to provide a way to pull the directory tree. When Hadoop calls the listStatus method, it can pull all the children of the current directory (prefix) through the listObjects loop to return the corresponding result.

Setting the user/group to which the file belongs, operation permissions and other related operations use the metadata function of US3 to map these information to the KV metadata pair of the file. The file stream will be cached in memory for up to 4MB of data first, and then the decision of using the PutFile or the sharded upload API will be made based on the subsequent operation.

To read the file stream, GetFile returns the stream instance to read the expected data. While the implementation of these methods may seem straightforward, there is a lot of potential for optimization.

By analyzing the FileSystem calls, we can see that indexing operations account for more than 70% of the big data scenarios, and GetFileStatus has the highest proportion of indexing operations, so it is necessary to optimize it. So what’s the optimization point?

Because the “directories” in US3 are keys ending in a ‘/’, Filesystem operates on files through a Path structure that does not end in a ‘/’. If the Key is a directory in US3, the HeadFile will return a 404, which can only be confirmed by using “Key/” in the Head of US3 for the second time. If the Key directory still does not exist, the getFileStatus delay will increase significantly.

So the US3 adapter does two things when it creates a directory :1. Write to US3 an empty file with MIME-type “file/path” and file name “Key”; 2. Write to US3 an empty file with the MIME-type “application/x-director” and the file name “Key/”;

The MIME-type for ordinary files is “application/octet-stream”. This way, GetFileStatus uses the HeadFile API once to determine whether the current Key is a file or a directory. At the same time, when the directory is empty, the directory can also be displayed on the US3 console. Moreover, since the Hadoop scenario mainly writes large files, the write time of adding an empty file index is at the MS level, and the delay is basically negligible.

In addition, GetFileStatus has a distinct “spatio-temporal locality” in Hadoop usage, with keys that were recently operated on by GetFileStatus in a specific Filesystem instance being operated on multiple times in a short period of time. Taking advantage of this feature, the US3Filesystem implementation inserts a fileStatus with a 3S lifetime into the Cache before the fileStatus returns. If the Key is deleted from the Cache, the delete operation will mark a valid fileStatus in the Cache as a 404 Cache with a lifespan of 3 3s. Or simply insert a 404 Cache with a 3S life cycle. In the case of rename, the Cache of the source is reused to construct the Cache of the destination Key, and the source is deleted, thus reducing the amount of interaction with US3. Cache hits (US level) reduce latency on GetFileStatus by a factor of 100.

Of course, this introduces some consistency issues, but only if there is at least one “write” in multiple concurrent jobs, such as in the case of delete and rename. If there is only a read, it does not matter. But the big data scenario is basically the latter.

ListObjects consistency issues

The ListObjects interface in US3 is similar to the other object storage solutions in that it is only ultimately consistent (although a strongly consistent ListObjects interface will be introduced by US3 in the future), so adapters for other object storage implementations will also be written to a file. Occasionally the file does not exist when ListStatus is called immediately. Other object storage schemes sometimes mitigate this by introducing a middleware service (typically a database) that writes the file index to the middleware when a file is written, and merges the middleware index information when ListStatus is written. This further improves consistency.

But not enough. For example, if the object is written to the store successfully, but the program crashes while writing to the middleware, this leads to the problem of inconsistency, which goes back to the problem of final consistency.

The implementation of the US3Hadoop adapter is relatively simple and efficient, does not require additional services, and provides read-your-writes consistency at the index operation level, which is roughly equivalent to strong consistency in most Hadoop scenarios. Unlike the S3afilesystem implementation, which returns immediately after create or rename or delete, the US3Hadoop adapter internally calls the ListObjects interface for a “check” and returns until the “check” results meet expectations.

Of course, there is also optimization space, such as delete a directory, the corresponding will pull out all the files in the directory first, and then call Deletefile API to delete, if every Deletefile API deletion is “checking” once, then the entire delay will be doubled. The US3Hadoop adapter does this by “checking” only the last index operation. This is because the index oplog is synchronized to the list service in time order. If the last index is checked successfully, then the previous oplog must have written successfully to the list service.

Deep customization of Rename

The aforementioned rename is also an important optimization of US3. Other implementations of object storage schemes typically use the Copy interface to Copy the file before deleting the source file. If the rename file is large, then the entire process of rename is bound to cause high latency.

US3 specifically developed the Rename API for this scenario, so the US3Hadoop adapter implements Rename with relatively light semantics, and the latency is maintained at the MS level.

Ensure the efficiency and stability of READ

Read is a high-frequency operation in big data scenarios, so the read stream implementation of the US3Hadoop adapter is not a simple encapsulation of the body of the HTTP response, but takes into account various optimizations. For example, the optimization of read stream can reduce the frequency of network IO system call and the waiting delay of read operation by adding pre-read Buffer, especially the IO of large batch sequential reads.

In addition, the read stream of FileSystem has a seek interface, which means it needs to support random reads. There are two scenarios here:

If the Underlay Stream seeks to a position prior to that of the read Stream, then the body of the HTTP response as the Underlay Stream is deprecated and closed. If the Underlay Stream seeks to a position prior to that of the read Stream, then the body of the HTTP response is deprecated and closed. Get the body Stream of its HTTP response as the new Underlay Stream. When seek occurs, the US3Hadoop adapter is implemented to only mark the position of the stream that has been read. When seek occurs, the US3Hadoop adapter only marks the position of the stream that has been read. When seek occurs, the US3Hadoop adapter only marks the position of the stream that is read. The Underlay Stream is closed and opened by deferring the read process. Also, if the seek position is still in the Buffer, the Underlay Stream is not reopened. Instead, the consumption offset of the Buffer is changed.

2. Seek to the post position of the read stream. The same lazy Stream opening is used here, but it is not necessary to close the old Underlay Stream and open the new Underlay Stream at the target location when you are sure you want to do a real seek. Because the current read position may be close to the post position of seek (say only 100KB), it may be completely within the range of the read Buffer. This can also be achieved by changing the consumption offset of the Buffer.

The US3Hadoop adapter does this, but the current rule is that the distance between seek’s post position and the current read stream position is less than or equal to the sum of the remaining read Buffer plus 16K. Then locate the post position of seek directly by modifying the consumption offset of the preread Buffer and consuming the data in the Underlay Stream. The 16K is added to account for the data that TCP receives in the cache. Of course, the subsequent cost of determining the consumption of N bytes of data from a ready Underlay Stream is roughly equal to the cost of relaising a GetFile API before the HTTP response body is ready to be transferred, and the N bytes are also taken into account in the offset calculation.

The final Stream optimization also takes into account the case of an Underlay Stream exception, such as the HBase scenario holding an open Stream for a long time but not operating on it due to other operations. US3 may voluntarily close and release the TCP connection corresponding to the Underlay Stream. Subsequent operations on the Underlay Stream report TCP RST exceptions. To provide availability, the implementation of the US3Hadoop adapter is to reopen the Underlay Stream at the point where it has been read.

Write in the last

The implementation of the US3Hadoop adapter further optimizes the relevant core problem points, improves the reliability and stability of Hadoop’s access to US3, and plays an important role in bridging Hadoop and US3 in many customer cases, helping users improve the efficiency of storage, reading and writing in big data scenarios.

However, the US3Haoop adapter still has a lot of room for improvement. Compared with HDFS, there is still a gap in the delay of index and IO, and the atomicity guarantee is relatively weak. These are the problems we need to think about and solve next. The current release of US3VMDS addresses most of the indexing delay issues, resulting in significantly improved performance for US3 operating through the US3Hadoop adapter and approaching native HDFS performance in some scenarios. Specific data can refer to the official documentation (…

In the future, US3 products will continue to improve and optimize the storage solutions for big data scenarios, so as to reduce the cost of big data storage and further improve the user experience in big data scenarios.