Data lake accelerator GooseFS is a high-performance, highly available and resilient distributed cache solution launched by Tencent Cloud. Based on the cost advantage of Cloud Object Storage (COS) as the data lake Storage base, it provides a unified data lake entrance for computing applications in the data lake ecology, and accelerates the performance of mass data analysis, machine learning, artificial intelligence and other business access Storage.

GooseFS adopts a distributed cluster architecture, which is characterized by flexibility, high reliability, and high availability. It provides a unified namespace and access protocol for upper-layer computing applications, and facilitates users to manage and transfer data between different storage systems.

Zero. Product background

In recent years, the trend of using object storage as unified data lake storage is more and more obvious. Object storage has the characteristics of low cost, high reliability, flexibility, etc., so it is very suitable for the storage of massive data in the era of information explosion. More and more enterprises migrate big data storage from HDFS to object storage, and adopt object storage or object storage +HDFS hybrid storage architecture to realize enterprise-level hot and cold data layered scheme. However, under the data lake scheme, enterprises still face the following problems:

Performance issues: In the big data scenario, both Map and Reduce links need to frequently perform List and Rename operations on files; But the flat architectural design of object storage results in a natural performance bottleneck for these operations. In addition, data storage across the computer room will further increase the request delay under the data lake architecture. In recent years, the application of stream-batch integration is more and more extensive and in-depth, and the real-time requirements of big data business are more and more high. Therefore, it is necessary to make hot data closer to the computing end as far as possible to improve business performance.

Cost problem: For offline big data services, it is often necessary to pull a large number of repeated data to the computing cluster for analysis as quickly as possible. Under the storage and computation separation architecture of the data lake, there will be great pressure on the storage bandwidth. In this mode, the peak bandwidth is high, the average bandwidth is small, and it is easy to produce a large amount of resource waste and cost consumption. Therefore, caching hot data to compute nodes and reducing bandwidth consumption can reduce business costs.

Operation and maintenance problem: quite a lot of businesses use different storage services such as HDFS and object storage to build a hybrid storage architecture. Under this business model, it is necessary to maintain a variety of different storage interfaces, which increases the complexity of operation and maintenance. Therefore, if a set of storage services can connect with different back-end storage systems and provide consistent access views for upper computing businesses, it will greatly reduce the difficulty of business development and improve the efficiency of storage service use.

First, product functions

GooseFS aims to provide a one-stop-shop caching solution with a natural advantage in leveraging data nativity and caching, uniform storage access semantics, etc. GooseFS plays a core role in the Tencent Cloud Data Lake ecosystem of “computing from the top, storage from the bottom”, as shown below.

Designed and developed on the basis of the open source big data caching solution Aluxio, GooseFS provides more key features, stability and performance optimization than the open source solution. At the same time, it deeply integrates Tencent cloud ecology, docking with Tencent cloud TKE, EMR and other computing services, providing users with the ability to use them out of the box.

The main functions are as follows:

Cache acceleration and data localization: GooseFS can be deployed mixed with compute nodes to improve data nativity, take advantage of caching capabilities to address storage performance issues, and improve the efficiency of reading and writing objects to store COS files.

Converged storage semantics: unified interface protocol on top of GooseFS supports docking object storage COS, HDFS on cloud and privatized storage CSP, and has been specially optimized for Tencent cloud COS, CHDFS, CSP and other products, suitable for a variety of ecological and application scenarios.

Unified Tencent cloud related ecological services: including Tencent cloud monitoring, logging and authentication support. GooseFS has successfully connected with Tencent Cloud EMR, Tencent Cloud TKE and Tencent Cloud Eks; At the same time, support docking Tencent cloud monitoring, Tencent cloud log service CLS and Tencent cloud ES, Prometheus and Grafana and other services.

Metadata management: GooseFS supports asynchronously caching data stored on COS or CHDFS to local nodes at the Hive Table or Table Partition level; Support to configure different metadata management schemes according to NAMESPACE.

Second, product advantages

GooseFS has several obvious advantages in a data lake scenario:

1. Data I/O performance

GooseFS deployments provide distributed, shared caches on the near computing side, allowing upper-layer computing applications to transparently and efficiently cache frequently accessed hot data from the remote storage to the near computing side, accelerating data I/O performance.

GooseFS provides metadata-aware Table functionality, which speeds up the performance of metadata operations such as List files, Rename files, and so on in big data scenarios. In addition, businesses can choose different storage media such as MEM, HDD, SSD, NVME SSD on demand to balance business cost and data access performance.

2. Integration of storage

GooseFS provides a unified namespace and a unified interface protocol for the upper layer, while the lower layer supports docking of different storage services such as COS, CHDFS and CSP, simplifying the configuration of operation and maintenance of the business side. Storage integration can break through the barriers of different data bases, facilitate the upper application management and flow of data, and improve the efficiency of data utilization.

3. Ecological affinity

GooseFS is fully compatible with Tencent’s cloud big data platform framework, and also supports custom local deployment on the client side, with excellent ecological affinity. The business side can not only be used in the elastic MapReduce product of Tencent Cloud, but also accelerate the big data business of GooseFS. It can also be conveniently localized to deploy GooseFS in the public cloud CVM or self-built IDC. In addition, GooseFS also supports transparent acceleration, enabling access to object storage through the COSN Interface; For those who already use the COS big data plug-in COSN, it is very easy to introduce GooseFS into COSN for use.

Three, endnotes

GooseFS is designed to provide a one-stop data lake cache acceleration solution that enables users to manage and move data across different storage systems and improve your data utilization efficiency.

If you want to learn more about GooseFS, deploy and play with it, you can see the GooseFS configuration documentation by clicking Read More.

-- the END --