Hello, everyone. I am Xiaoyan, a developer of Yanrong cloud storage system. In this article, I will discuss Ceph data recovery with you.

As we all know, Ceph is the most popular open source distributed storage system in recent years. Different from other distributed Storage systems, Ceph provides block Storage, file Storage and object Storage interfaces on the core object Storage architecture called RADOS. Therefore, Ceph can be called Unified Storage. In this article, we will discuss the data recovery logic at the core level of Ceph RADOS.

However, like distributed systems, Ceph also uses multiple copies to ensure high data reliability (note: EC can be understood as a copy mechanism at the implementation level. Given a copy of data, Ceph automatically stores multiple copies (usually three copies) in the background, preventing data loss and even keeping data online in case of disk damage, server failure, or cabinet power failure. However, Ceph needs to recover the lost data copy in time after a failure occurs to maintain high data reliability.

Therefore, the multi-copy mechanism is one of the core mechanisms of distributed storage system, which brings high data reliability and improves data availability. While nothing is perfect, the multi-copy mechanism also introduces one of the biggest problems of distributed systems: data consistency. However, Ceph data consistency is not the topic of this article. I will talk about data consistency related to data recovery when I have the opportunity to share with you later.

In summary, we discussed why Ceph uses multiple replicas and the need to make up for lost replicas through data recovery in a timely manner. We also discussed that data consistency is one of the difficulties in the recovery process. In addition, Ceph data recovery (or distributed storage data recovery) has several other difficulties:

Detects faults and automatically triggers data recovery. This reduces the pressure on operation and maintenance, and enables fault detection and troubleshooting in a more timely manner.

Minimize the consumption of cluster resources during data recovery. For example, the most obvious, how to reduce the occupation of network bandwidth. When Ceph recovers data, whether to copy the whole 4M object or only recover the data with differences will directly affect the amount of data transmitted between networks.

Does data recovery affect users’ online business, and how does Ceph control and mitigate this impact?

With these questions in mind, let’s talk about the Ceph data recovery process and key details.

Ceph troubleshooting process

Let’s take a look at the main process of Ceph troubleshooting, which is divided into three steps:

  1. Sensed cluster status: First, Ceph should be able to detect cluster faults in time, determine the status of nodes in the cluster, determine which nodes leave the cluster, and provide authoritative basis for determining which copies of data are affected by faults.

  2. Identify data affected by failures: Ceph calculates and determines missing copies of data based on the new cluster status.

  3. Restore the affected data.

Let’s take a closer look at each of these steps.

Aware of cluster status

Ceph clusters are divided into MON clusters and OSD clusters. The MON cluster consists of an odd number of Monitor nodes. These Monitor nodes form a decision maker cluster through Paxos algorithm, and jointly make decisions and broadcast of key cluster events. OSD Node Out and OSD Node In are two key cluster events.

The MON cluster manages the member status of the Ceph cluster and stores the OSD node status information in the OSDMap. The OSD node periodically sends heartbeat packets to the MON cluster and its Peer OSD node to declare that it is online. MON Receives a heartbeat message from the OSD node to confirm that the OSD node is online. In addition, the MON also receives fault detection from the OSD node for the Peer OSD node. MON Determines whether the OSD node is online based on information such as heartbeat interval, updates the OSDMap, and notifies each node of the latest cluster status. For example, if a server is down, the heartbeat communication between OSD nodes and the MON cluster times out, or the failure notifications sent by OSD peers exceed the threshold, the MON cluster determines that OSD nodes are offline.

Ceph randomly distributes the latest OSDMap to an OSD node using the message mechanism after determining that the OSD node is offline. When a client (peer OSD node) finds that its OSDMap version is too early to process I/O requests, the client requests the MON for the latest OSDMap. The other two copies of PGS in each OSD node may reside in any OSD node in the cluster. After a period of time, all OSD nodes in the cluster receive OSDMap updates.

Identify affected data

A Placement Group (PG) is responsible for the maintenance of object data in Ceph. As the smallest data management unit in Ceph, PG directly manages object data. Each OSD node manages a certain number of PGS. Client I/O requests for object data are evenly distributed among PGS based on the Hash value of object IDS. A PGLog is maintained in the PG to record data changes for the PG, and these records are persisted to the backend storage.

PGLog records the data of each operation and the PG version. Each data change increases the PG version. By default, 3000 records are saved in PGLog. In general, pgloGs in different copies of the same PG should be consistent. After a fault occurs, pglogs in different copies may be in an inconsistent state.

When the OSD receives an OSDMap update message, it scans all PGS on the OSD node, deletes PGS that do not exist (for example, PGS that have been deleted), and initializes PGS. If the PGS on the OSD node are Primary PGS, it will be operated at its flag flag. During 太 flag wearing, PG will check the consistency of multiple copies according to PGLog and try to calculate the data loss of different copies of PG. Finally, a complete object loss list is obtained, which can be used as the basis for subsequent Recovery operations. For PGS that cannot calculate the lost data according to PGLog, the data of the entire PG needs to be copied by Backfill operation to recover. It should be noted that PG data is unreliable before the 太 阳 process is completed, so PG will suspend all IO requests of clients during 太 阳 process.

Data recovery

Wearing flag at American flag, PG enters Active state, and marks itself as Degraded/Undersized state according to the Degraded/Undersized state of PG. In Degraded state, the number of logs stored by PGLog will be expanded to 10000 by default. Provide more data records to facilitate data recovery after the replica node goes online. After entering the Active state, PG becomes available and starts to accept the data IO request, and decides whether to carry out Recovery and Backfill operations according to the flag information.

The Primary PG copies the data of an object according to the missing list. For the missing data in the Replica PG, the Primary PG pushes the missing data through Push. For the missing data in the Primary PG, the Primary PG obtains the missing data from the copy through Pull. During the recovery operation, PG will transfer the full size of 4M objects. For those that cannot rely on PGLog to make Recovery, PG will Backfill operation to make full copy of data. After data is synchronized, the PG state is marked Clean. Data in the replicas is consistent, and data restoration is complete.

Control recovery impact

Through Ceph’s troubleshooting process, we can see how Ceph deals with common problems associated with cluster failures. The first is to reduce resource consumption: Ceph can restore only changed data in the event of a failure such as a power outage and restart, thus reducing the amount of data recovered. On the other hand, the MON does not proactively push cluster status to all OSD nodes. Instead, the OSD node proactively obtains the latest OSDMap to prevent burst traffic when a large-scale cluster fault occurs.

In addition, Ceph uses a Primary PG to process I/O operations. If the OSD where the Primary PG resides breaks down, I/O operations cannot be performed properly. To ensure that normal service I/OS are not interrupted during data restoration, the MON system allocates PG Temp to process I/O requests temporarily and removes the PG Temp after data restoration is complete.

During the recovery process, Ceph allows users to adjust the number of recovery threads, the number of recovery operations performed at the same time, and the priority of data network transmission to limit the recovery speed, thus reducing the impact on normal services.

conclusion

Above, Xiaoyan discussed with you the main technical process and key points of Ceph data recovery. You can see that Ceph has done a good job in the design and detail of data recovery. However, the granularity of Ceph recovery is still relatively large, requiring more IO consumption. In addition, passive OSDMap updates may cause PG to fail to recover faulty data in a timely manner, reducing data reliability for a period of time.

At the same time, remind us to pay attention to the details of Ceph daily troubleshooting:

PGLog is an important basis for Ceph to recover data, but the number of logs recorded is limited. Therefore, after a fault occurs, the failed node should be brought online as soon as possible to avoid Backfill operation, which can greatly shorten the recovery time.

PG will block client IO in the flag flag phase, so PG can be rapidly flag.

If the PGLog is lost during the fault process, it cannot be carried out at American flag, and PG will enter the Incomplete state. In this case, the faulty node should be put online to help complete data repair.

Although Recovery operation of Ceph can avoid many unnecessary object data Recovery, it still uses full object copy. For further optimization, we can consider recording the specific data location of operation object in PGLog or using a mechanism like Rsync to recover only the difference data between object copies.

In this paper, Ceph can help us maintain Ceph cluster better in daily application and ensure the continuous availability of business data by simply sorting out Ceph’s handling of cluster failures. In the future, we will publish a series of articles that will analyze the details of cloud computing/storage related technologies, analyze the problems and solutions in application scenarios. Please look forward to our next technical article by Yan Rongyun.