1. Crush algorithm and its functions

CRUSH, short for Controlled Replication Under Scalable Hashing, is a controllable, Scalable, and distributed replica data placement algorithm CRUSH algorithm is used to calculate the location of data storage to determine how to store and retrieve data.

  • Ensure balanced data distribution

    Data can be evenly distributed among nodes, and data access operations can be balanced among nodes and disks.

  • Flexible scalability of clusters

    It can easily and quickly add or delete nodes, automatically detect and process the failed nodes, automatically realize the data balance, and change the data as little as possible.

  • Support for larger clusters

    The data distribution algorithm can maintain small metadata and computation, with the continuous increase of cluster size, keep small data distribution algorithm overhead, not linear growth.

2. Crush algorithm description

The PG-OSD mapping algorithm is called the CRUSH algorithm. It is a pseudo-random process. You can randomly select an OSD set from all OSD nodes.

CRUSH enables Ceph clients to communicate directly with OSDs, rather than through a centralized server or agent. Through algorithmically determined data storage and retrieval methods, single points of failure, performance bottlenecks, and physical limitations on their scalability are avoided.

Crush Map describes all hardware resources of the system as a tree structure, and then generates a logical tree structure based on this structure according to certain fault tolerance rules. The last leaf node of the tree is device, also known as OSD, and the other nodes are called bucket nodes, virtual nodes abstracting according to the physical structure. Includes data center abstraction, machine room abstraction, rack abstraction, and host abstraction.

3. Principle of Crush algorithm

  1. Storage structure of Ceph

    Ceph first builds a pool to store objects. The pool can be likened to a warehouse. Saving a new object is like putting a package into the warehouse.

    To improve efficiency, a Pool can be divided into Placement groups (PG). This is similar to a warehouse with different racks. All racks form a complete warehouse, and all PGS form a Pool.

  2. PG allocation storage

    How are objects saved to which PGS? Assume that the Pool name is RBD and there are 256 PGS. Each PG is named 0x0, 0x1, 0x2… 0 XFF. How exactly should it be distributed? This can be done in Hash mode.

    Suppose we have two object names, bar and foo, and Hash based on the object names:

    HASH (‘ bar ‘) = 0x3E0A4162

    HASH (‘ foo ‘) = 0x7FE391A0

    Hash to get a random string of hexadecimal values. For the same object name, the result will always be the same, but we pre-allocate 256 PG, which requires further modulo, and the result will fall within 0x0, 0xFF:

    0x3E0a4162% 0xFF = = = > 0x62

    0x7FE391A0%0xFF = = = > 0xA0

    In Ceph, there are many pools, and each Pool has several PG’s. If the PG numbers in two pools are the same, how can I distinguish them? Ceph numbers each pool. For example, Ceph adds a new pool whose ID is 0 and whose ID is 1. The actual PG number is pool_id +. Therefore, the bar object will be stored in pool 0, pg 62.

  3. Storage allocated by OSD

    Ceph uses a disk or partition as an OSD node. At the logical level, objects are stored in PGS. Now we need to connect PGS and OSD nodes. This is what Crush algorithm mainly does: calculate the mapping between PG -> OSD.

    As we know above, there are two main steps:

    POOL_ID + HASH (‘ object name ‘) % pg_num (HASH group) = = > PG_ID (HASH group)

    CRUSH (PG_ID) = = > OSD (Object storage device location)

  4. Why Crush

    CRUSH (PG_ID) % OSD_NUM HASH (PG_ID) % OSD_NUM There are some problems.

    1) If an OSD fails, all OSD_NUM remainders will change, and the previous data may need to be reshuffled. A good storage architecture should be able to minimize data migration costs in the event of a failure, as CRUSH can do.

    2) If an OSD is added, the number of OSD_NUM will increase, which will also cause data reshuffled. However, CRUSH can ensure data evenly spread to the newly added machine without reshuffled.

    3) If you save multiple copies, you need to be able to obtain the output of multiple OSD results. However, only one can be obtained by HASH, but the CRUSH algorithm of CEPH can obtain multiple results.

  5. How to implement Crush algorithm

    Each OSD node has a different capacity, for example, 4 TB or 800 GB capacity. You can define the weight of each OSD node based on its capacity. For example, if the weight of each OSD node is 4,800 GB, set the weight to 0.8.

    How to map PG to OSD with different weights? CRUSH Straw is used to select the longest TAB, and this TAB is the weight of OSD. If the node is stored on the OSD with the largest capacity each time, it is easy to fill up the node, which requires a random weight algorithm to implement.

    Main steps:

    • Compute HASH: CRUSH_HASH (PG_ID, OSD_ID, r) = = > draw

      Take r as a constant and PG_ID and OSD_ID as input together to get a HASH value.

    • Add OSD weight: (Draw & 0xFFFF) * osd_weight = = > osd_straw

      Put the HASH value together with the weight of the OSD node to obtain the HASH length of each OSD node. A larger weight indicates a larger value.

    • The traversal selects the highest weight: high_draw

    Crush is used to randomly select an OSD node. The OSD node with a larger weight has a higher probability of being selected. To ensure randomness, multiply the weight of each OSD node by a random number (HASH value), and then select the node with the largest result. If the sample size is large enough, the influence of random number on the selected result gradually decreases, and the decisive factor is the weight of OSD. The greater the weight of OSD, the greater the probability of being selected, so as to achieve effective data distribution.

    The random number calculated by Crush is obtained by HASH, which guarantees the same output result for the same input. So Crush isn’t really a random algorithm, it’s a pseudo-random algorithm.

    In this case, only one OSD node is calculated. Ceph clusters have multiple copies. How to solve the problem that a PG is mapped to multiple OSD nodes?

    Add 1 to the constant r and calculate it again. If the number of OSD is different from the previous one, select it. If they are the same, add r+2 and recalculate until three different OSD numbers are selected.

    If r=0, run the CRUSH_HASH (0xFFFF) * weight algorithm to calculate the largest OSD node. If r=1, run the CRUSH_HASH (0x39A00) * weight algorithm to generate the first OSD node. We get the third OSD in turn.

4. IO Flowchart

Steps:

  1. The client connects to the Monitor to obtain cluster map information.
  2. At the same time, the new master OSD1 will actively report to Monitor and ask OSD2 to take over the master temporarily because there is no PG data.
  3. The temporary active OSD2 fully synchronizes data to the new active OSD1.
  4. Client I/O read/write The client directly connects to the temporary active OSD2 for read/write.
  5. Osd2 receives read/write I/OS and simultaneously writes to the other two replica nodes.
  6. Wait for OSD2 and the other two copies to write successfully.
  7. Osd2 returns to the client after the three data copies are written successfully. Then, the CLIENT finishes reading and writing I/OS.
  8. If OSD1 data is synchronized, temporary osD2 relinquishes the primary role.
  9. Osd1 becomes the primary node, and OSD2 becomes the copy.

5. Ceph communication mechanism

There are three different implementations of the network communication framework:

  • Simple thread mode
    • Features: For each network connection, two threads are created, one for receiving and one for sending.
    • Disadvantages: A large number of links can generate a large number of threads, which can consume CPU resources and affect performance.
  • I/O multiplexing mode of Async events
    • Features: This is the current network communication widely used in the way. The new version already uses Asnyc asynchronous mode by default.
  • XIO is implemented using accelio, an open source network communication library
    • Features: This method relies on third-party library Accelio stability, currently in the experimental stage.

The content of the message is divided into three parts:

  • Header // Message header type The envelope of the message
  • User data // Actual data to be sent
    • Payload // Saves metadata
    • Middle // Reserved field
    • Data // Reads and writes data
  • Footer // The end of the message

Steps:

  • The Accepter calls SimpleMessenger::add_accept_pipe() to create a Pipe and Pipe SimpleMessenger:: Pipes to process the request.

  • Pipe is used to read and send messages. This class has two main components, Pipe::Reader and Pipe::Writer, which handle message reading and sending.

  • Messenger is the publisher of the message, and each Dispatcher subclass is the subscriber of the message. Once Messenger receives the message, it reads it through Pipe and forwards it to the Dispatcher for processing.

  • Dispatcher is the subscriber’s base class. The subscription backend inherits this class and registers with Messenger:: Dispatchers via Messenger:: add_Dispatcher_tail /head when initialized. After receiving the message, notify the class to process it.

  • The DispatchQueue class is used to cache incoming messages and then wake up the DispatchQueue:: Dispatch_thread thread to find the dispatch_thread on the back end to process the messages.

6. Ceph RBD BLOCK storage I/O flow chart

Osd writing process:

  1. In the form of LIBRbd, librBD is used to create a block device and write data to the block device.
  2. Librados interface is called locally on the client, and then layer by layer mapping through pool, RBD, Object and PG. In the PG layer, you can know which three OSD nodes are storing data. The relationship between the three OSD nodes is master and slave respectively.
  3. The client establishes SOCKET communication with the primary OSD node. The data to be written is sent to the primary OSD node, which then sends the data to other replica OSD data nodes.

7. Ceph heartbeat and fault detection mechanism

Question:

How to reduce the fault detection time and the load caused by heartbeat packets?

  1. If the heartbeat frequency is too high, too many heartbeat packets affect system performance.
  2. If the heartbeat frequency is too low, the time for discovering faulty nodes is prolonged, which affects system availability.

A fault detection strategy should be able to:

Timeliness: The cluster can sense node anomalies such as outages or network outages within an acceptable time range.

Appropriate pressure: including pressure on nodes, and pressure on the network.

Tolerate network jitter: The network is occasionally delayed.

Diffusion mechanism: The metamodel changes caused by the node status changes need to spread to the whole cluster through some mechanism.

The OSD node listens on public, Cluster, front, and back ports

  • Public port: Listens for connections from Monitor and Client.
  • Cluster port: listens for connections from OSD peers.
  • Front port: nic used by clients to connect to clusters and temporarily heartbeat between clusters.
  • Back port: a nic used in a cluster. Heartbeat is performed between clusters.
  • Hbclient: Messenger that sends the ping heartbeat.

Ceph Detects the heartbeat between OSD nodes

  • OSD nodes in a PG heartbeat with each other and send PING and PONG information to each other.

  • Check every 6s (a random time is actually added to this to avoid spikes).

  • No heartbeat reply was detected for 20s, and the system joined the failure queue. Procedure


This article was created and shared by Mirson. For further communication, please add to QQ group 19310171 or visit www.softart.cn