Today, we are excited to release the file system for the Curve project, along with a new deployment tool. This is the first beta release of CurveFS and represents a step closer to a more usable cloud-native software-defined storage, thanks to the concerted efforts of the Curve community.

Version Address:

Github.com/opencurve/c…

Curve team agreed to develop a distributed shared file system in the first half of 2021. Our Roadmap Outlines some of the key features that we plan to implement in the Roadmap. These include:

  • FUSE – based user – mode file read and write interface and POSIX – compliant

  • Supports data storage to object storage systems

  • Supports cloud native deployment, o&M, and usage

  • Support for multiple file systems

The first version of CurveFS currently implements these features, and more are still in development. Please try them out.

Why CurveFS

The NetEase Shufan storage team, which supports the development of multi-field digital business and open source Curve, has been the first to feel the demand of a new generation of distributed file system in practice and has been resonated by Curve community members.

Through communication with NetEase internal products and sufan commercial customers, the distributed file system used by users is mainly CephFS (used in collaboration with Kubernetes for PV). In recent years, users have encountered problems that are difficult to be completely solved in the following scenarios:

** Scenario 1: ** Machine learning scenario where both performance and capacity are expected

In a service machine learning scenario, the CephFS training is expected to take as short a time as possible and the training results are expected to be saved for a long time. However, the access frequency is very low. Therefore, the CephFS can be actively or passively deposited to capacity storage pools. Active refers to that services can change the storage type (capacity or performance) of a directory. Passive refers to that services can change the storage pool by configuring certain life cycle rules (or cache management policies).

CurveFS In this scenario, multi-level caches (CurveFS client-side memory cache, client-side hard disk cache, CurveBs-based data cache, and curveBs-based high-performance data pool) can be used to speed up read and write of training samples, while cold data is stored in volume-based storage pools, namely object storage. When users want to train a sample set that has been precipitated into the cold data pool, they can proactively warm up the data in advance to speed up the training process (or passively load it). This feature will be supported in a later version.

Scenario 2: Business B that expects rapid, resilient delivery across the cloud

Normally, the private deployment of a CephFS cluster requires the preparation of multiple storage nodes, which requires a long period of machine preparation and shelving. However, it is not practical to deploy a storage cluster in public cloud scenarios. Therefore, storage services provided by public clouds are generally used. For the business that wants multi-cloud unified management, it needs to be developed accordingly. We expect one-click deployment to multiple public clouds to provide uniform usage logic for the business.

In this scenario, CurveFS can quickly deploy a distributed shared file system with almost unlimited capacity using the existing object storage service, and the deployment process is very simple and quick. In addition, if the performance of the object storage engine fails to meet the requirements, CurveFS can use cloud disks such as EBS and ESSD to speed up read and write (both client and server caching are supported).

The storage engine based on object storage facilitates cross-cloud deployment. O&m personnel can use the same set of deployment tools to implement cross-cloud deployment by simply modifying a few parameters.

Scenario 3: Low-cost service C with large capacity requirements

The capacity requirement is the first, but the writing speed should be fast. At present, all the scenarios are 3 copies, the cost is slightly higher, hope to support. CurveFS in this scenario, the client memory cache and hard disk cache can be used to speed up the write speed, and then asynchronously upload to the low-cost object storage cluster (EC erasure codes are usually used to reduce copy redundancy). For users who do not want to purchase storage servers to deploy storage clusters, it is cost-effective to use the object storage service of public cloud to meet storage requirements with low cost and large capacity.

Scenario 4: Automatic separation of hot and cold data from ES middleware

Hot data is stored on high-performance storage hardware, and cold data is stored on low-performance storage hardware. Therefore, you need to manually configure hot data to automatically remove cold data from underlying storage engines. CurveFS In this scenario, hot data can be stored in the CurveBS cluster of three replicas, and cold data can be transferred to the object storage cluster after it has not been accessed for a certain period of time according to the configured life cycle rule.

Scenario 5: Unified S3 and POSIX access requirements for a service

You want to mount the FUSE client to produce data and access the data through the S3 interface. CurveFS supports S3 and POSIX for accessing files in file systems. After S3 is supported, Curve becomes a unified storage system that supports block storage, file storage, and object storage, bringing more convenience to users.

Future planning scenarios:

Curve Functions as a unified storage layer for multiple storage systems (such as HDFS and S3 compatible object storage) to take over and accelerate the access of all systems. Curve can take over multiple storage systems and implement unified cache acceleration.

Of course, we encountered some problems in CephFS that were difficult to be solved through configuration modification or simple secondary development, which was also the source of motivation for our own research:

  • Performance bottlenecks are serious in some scenarios. In particular, metadata latency cannot meet service requirements even when multiple MDSS and static directory binding are enabled, metadata storage pools use SSDS, and kernel-based clients are used

  • High availability risk: Multi-MDS scenario + static directory binding After the function is enabled, if an active MDS fails, the switchover takes a long time, interrupting services

  • Metadata load balancing problem: static directory binding is barely available, but not operable and difficult to implement; At present, the availability of dynamic directory migration is poor, which causes frequent and repeated migration and affects the stability of metadata access

  • Metadata lock implementation logic complex, difficult to understand, learning threshold is too high: comprehensive function, but performance will inevitably be affected, in addition to the developer maintenance difficulty, secondary development difficulties, encountered problems are also very difficult to analyze

  • Balance problem: Ceph uses CRUSH algorithm to place objects, which may lead to less ideal cluster balance, resulting in a short board effect, resulting in less available cluster capacity and higher cost

CurveFS architecture design

The architecture of CurveFS is shown below:

CurveFS consists of three parts:

  1. Curve-fuse interacts with the metadata cluster to process requests for adding, deleting, modifying, and querying file metadata, and interacts with the data cluster to process requests for adding, deleting, modifying, and querying file data.

  2. Metaserver Cluster, used to receive and process metadata requests (inodes and dentries). The architecture of Metaserver Cluster is similar to CurveBS, featuring high reliability, high availability and high scalability:

    1. MDS is used to manage cluster topology and schedule resources.

    2. Metaserver is a data node. One Metaserver manages one physical disk. CurveFS uses Raft to ensure the reliability and availability of metadata. The basic unit of a Raft replication group is a copyset. A Metaserver contains multiple copyset replication groups.

  3. Data cluster: receives and processes file data to add, delete, modify, and query. Data Cluster currently supports two storage types: object storage that supports S3 interfaces and CurveBS (under development).

The main function

An overview of the

CurveFS has the following features:

  • POSIX compatibility: Used as a local file system, services can be seamlessly connected

  • High scalability: Metadata cluster size can scale linearly

  • Cache: The client has both memory and disk cache acceleration

  • Object storage and CurveBS support for data storage to S3 interfaces (in development)

Client

CurveFS client interconnects with FUSE to implement complete file system functions, called curve-FUSE. Curve-fuse supports data storage in S3 compatible object storage and Curve block storage (support for other block storage is also planned). Currently, IT supports S3 storage back end, and is still being improved to CurveBS back end. In the future, it may also support mixed S3 and Curve block storage. Let the data flow between S3 and Curve blocks according to the degree of cold and hot. The architecture diagram of Curve-Fuse is as follows:

The curve – fuse architecture diagram

Curve-fuse contains several main modules:

  • Libfuse, which connects to its Lowlevel FUSE API and supports FUSE user-mode file systems;

  • Metadata cache: fsinfo, inode cache, and dentry cache to cache metadata.

  • Meta RPC client, which interconnects with metadata cluster, implements meta OP sending, timeout retry and other functions.

  • S3 client: Connects to S3 interfaces to store data in S3.

  • S3 Data cache, the cache layer of S3 data storage, serves as the data cache to accelerate THE READ and write performance of S3 data.

  • Curve Client Connects to the CURVE block storage SDK to store data in the Curve block storage cluster.

  • Volume Data cache, which is the cache layer when data is stored in Curve block storage to speed up data read and write performance (under development);

Curve-fuse has been connected to a complete FUSE module, basically realizing POSIX compatibility. At present, the pass rate of PJDTest is 100%.

S3 storage engine support

The S3 client is responsible for translating the read and write semantics of files into the upload and download semantics of S3 storage. Considering the poor performance of S3 storage, we have implemented dataCache and diskCache for this layer. The overall architecture is as follows:

S3ClientAdaptor mainly includes the following modules:

  • FsCacheManager: Manages the cache of the entire file system, including inode to FileCacheManager mapping, read and write cache size statistics, and control

  • FileCacheManager: Manages the cache of individual files

  • ChunkCacheManager: Cache a chunk of a file

  • DataCache: specifies the minimum granularity for cache management. It corresponds to a contiguous data space in a chunk. Data is eventually mapped in the DataCache layer to one or more objects in S3 storage for uPOLoAD

  • DiskCache: manages local diskCache. Data persistence can be written to local disks first and then asynchronously to S3 storage, effectively reducing latency and improving throughput

  • S3Client: Is responsible for calling the BACK-END S3 storage interface, currently using AWS SDK

MDS

MDS is refers to the metadata management services, CurveFS MDS is similar to the MDS CurveBS (CurveBS of MDS is introduced: zhuanlan.zhihu.com/p/333878236…

CurveFS MDS has the following features:

  • The Topology module manages topO information for the entire cluster and manages the entire TopO lifecycle

  • Manage fs super block information through FS sub-module; You can create, delete, mount, and query a file system. Responsible for the distribution of metadata such as INOde and dentry of FS in metaserver

  • The heartbeat module maintains the heartbeat with the Metaserver and collects the metaserver status

  • Scheduling is performed through the scheduling system. Curvefs metadata uses consistency protocols to ensure reliability. When a copy becomes unavailable, the scheduler automatically recovers it. Scheduling is under development

As a centralized metadata management service, its performance, reliability and availability are also very important.

  • ** In terms of performance: ** First of all, metadata on the MDS is cached in the memory to speed up its search. Secondly, after the FS is created, the MDS allocates a shard for storing inode and dentry information to the FS. In the system, a shard is called a partition. After partition allocation is complete, fs metadata operations are directly sent from the client to the Metaserver. Since then, metadata management of FS inode and dentry does not go through MDS

  • ** In terms of reliability and availability: ** Metadata of MDS is persisted to ETCD, relying on etCD of 3 copies to ensure the reliability of metadata. Multiple MDSS can be deployed. However, one MDS provides external services. If the active MDS fails due to special reasons, the system automatically selects a new active MDS from the remaining MDSS through the primary selection algorithm

MetaServer

MetaServer is a distributed metadata management system that provides metadata services for clients. File system metadata is fragmented. Each metadata fragment provides consistency in the form of three copies. The three copies are called copysets and Raft consistency protocol is used internally. Also, a Copyset can manage multiple metadata shards. So, metadata management for the entire file system looks like this:

There are two copysets, three copies on three machines. P1, P2, P3, and P4 are metadata fragments of a file system. P1 and P3 belong to one file system, and P2 and P4 belong to one file system.

Metadata management

Metadata of a file system is managed in fragments. Each fragment is called a Partition. A Partition provides interfaces for adding, deleting, modifying, and querying dentries and inodes.

Inode a file or directory in a file system that records metadata information, such as atime/ctime/mtime. When the inode represents a file, data addressing information for the file is also recorded. Each Parition manages inodes within a fixed range and is divided according to inodeids. For example, inodeid [1-200] is managed by Partition 1, inodeid [201-400] by Partition 2, and so on.

Dentry is a directory entry in a file system that records the mapping between file names and inodes. The dentry information of all files/directories in a parent directory is managed by the Partition where the parent directory inode resides.

consistency

File system metadata fragments are stored in the form of three copies. Raft algorithm is used to ensure the consistency of three copies of data. Metadata requests from clients are processed by raft Leader. At the implementation level, we use the open source Braft (github.com/baidu/braft…

Highly reliable

The guarantee of high availability comes from two main sources. Firstly, the RAFT algorithm ensures data consistency and the RAFT heartbeat mechanism allows the remaining replicas within the replication group to quickly run for the leader and provide services to the outside world in the event of a Raft Leader failure.

Secondly, Raft’s quorum-based consistency protocol requires only two copies to survive in the case of three copies. However, running two copies for a long time is also a test of usability. Therefore, we add timed heartbeat between Metaserver and MDS. Metaserver periodically sends its statistics to MDS, such as memory usage, disk capacity, and replication group information. After a Metaserver process exits, the INFORMATION about the replication group is not reported to the MDS. In this case, the MDS finds that only two copies of some replication groups exist. Therefore, the MDS sends a configuration change request through heartbeat to restore the replication group to the normal state of three copies.

New deployment tool CurveAdm

In order to improve the operation and maintenance of Curve, we designed and developed the CurveAdm project, which is mainly used to deploy and manage Curve clusters, and currently supports the deployment of CurveFS (CurveBS support is under development).

Project Address:

Github.com/opencurve/c…

CurveFS deployment process:

Github.com/opencurve/c…

CurveAdm’s design architecture is shown below:

  • CurveAdm has an embedded SQLite (all DBS are a file) that stores information about the cluster topology and each service, such as serviceId and containerId

  • CurveAdm logs in to the target machine through SSH and executes commands through the Docker CLI to control the container. The image used by the container is each release of Curve, and the default version is latest

CurveAdm has several advantages over previous Ansible deployment tools:

  • CurveAdm supports cross-platform running, independent packaging without other dependencies, can be installed with one click, easy to use

  • The Curve component runs in a container, addressing component dependencies and distribution adaptation issues

  • Using Golang development, development iteration speed is fast, high degree of customization

  • Cluster logs can be collected by themselves, packaged and encrypted, and uploaded to Curve team for easy analysis and solution

  • CurveAdm itself supports one-click self-update for easy upgrades

Currently, the following functions are supported:

If you are interested in the CurveAdm project, feel free to contribute (submit issues or requirements, develop features, write documentation, etc.).

Problem to be solved

CurveFS is the first beta release of CurveFS and is not recommended for production use.

  • Shared read and write not supported (under development)

  • Disk cache space management policy and flow control function

  • Random read/write performance issues: this is determined by the characteristics of S3 engine, we will further optimize, such as concurrent shard upload, range read, etc

  • Automatic recovery of abnormal nodes (under development)

  • Recycle bin function: mistakenly deleted data can be retrieved and recovered

  • Concurrent read/write feature: Multiple nodes sharing a file system can read/write data at the same time

  • Monitoring access: Collect monitoring information using Prometheus and display it using Grafana

Welcome to submit issues and bugs on GitHub, or add wechat opencurve to invite you to join the group.

Outlook for Future releases

The Curve of the overall project release rhythm is usually a large version, every six months each quarter a small version, CurveFS for the new version of a larger, the current starting version still has a lot of the function is not complete, need to continue to improve, the next big version of our main development target for (may according to the actual demand to adjust) :

  • CurveBS storage engine support

  • Data management across the engine lifecycle

  • CSI plug-in

  • Complete deployment tools

  • K8s-based cluster deployment: The helm deployment mode is supported, and will be further optimized to support higher cloud native O&M levels

  • Write and read much

  • O&m tool optimization (monitoring alarms and locating problems)

  • The recycle bin

  • Optimized for HDD scenarios

  • NFS, S3, and HDFS

  • The snapshot

If you have relevant demands, please communicate with us.

What is the Curve

The Curve of the positioning

Positioning: Open source cloud native software-defined storage system with high performance, easy operation and maintenance, supporting a wide range of scenarios.

Vision: Easy-to-use cloud native software-defined storage.

CurveBS introduction

CurveBS is one of the core components of Curve cloud native software-defined storage system. It has the characteristics of high performance, high reliability and easy operation and maintenance, and can be well adapted to cloud native scenarios to realize the architecture of storage and computing separation. CurveFS will also support the use of CurveBS as a storage engine. The overall architecture of CurveBS is shown below:

Detailed design documentation is available in previous articles:

  • www.opencurve.io/docs/home/

  • Github.com/opencurve/c…

  • Github.com/opencurve/c…

  • zhuanlan.zhihu.com/p/311590077

The recent planning

  • PolarFS adaptation: single PFSD + single CurveBS volume connection has been completed, the subsequent support for multiple PFSD + single CurveBS volume feature, code base: github.com/skypexu/pol…

  • ARM64 platform adaptation: Basic function tests have been completed, performance optimization and stability verification will be carried out later, code base: github.com/opencurve/c…

  • FIO CurveBS Engine: supported, codebase: github.com/skypexu/fio…

  • NVME/RDMA adaptation: Validation and performance optimization will be implemented in the near future

  • ISCSI interface support: widely used, high universality, plan to support in the near future

  • Raft optimization: Tries to optimize log management, improve I/O concurrency, support follower read, and downgrade Raft (only one copy of the three copies is still available for external services)

Curve For more information, see:

  • Curve homepage: www.opencurve.io/
  • Source code address: github.com/opencurve/c…
  • Roadmap:github.com/opencurve/c…
  • Interpretation technology collection: zhuanlan.zhihu.com/p/311590077