Ceph operation -CRUSH MAP

1. Introduction

CRUSH algorithm determines how to store and retrieve data by calculating where it is stored. CRUSH authorizes Ceph clients to connect directly to OSD rather than through a central server or agent. The use of data storage and retrieval algorithms enables Ceph to avoid single points of failure, performance bottlenecks, and physical limitations of scaling.

CRUSH requires a Map of the cluster, and uses CRUSH Map to distribute data pseudo-randomly to OSD nodes in the whole cluster as evenly as possible. The CRUSH Map contains a list of OSD nodes, a list of “buckets” that aggregate devices into physical locations, and a list of rules that tell CRUSH how to copy data in a storage pool.

It is also possible to manage CRUSH Map completely manually by setting it in the config file:

osd crush update on start = falseCopy the code

2. Operate CRUSH Map

2.1 Extracting CRUSH Map

$ceph osd getCrushmap -o {compiled-crushmap-filename} $ceph osd getCrushmap -o/TMP /crush crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename} $ crushtool -d /tmp/crush -o /tmp/decompiled_crushCopy the code

2.2 Injecting CRUSH Map

Compiled - crushkool-c {decomcompiled -crush-map-filename} -o {compiled-crush-map-filename} $crushkool-c Compiled -crushmap -i {compiled-crushmap-filename} $ceph osd setcrushmap -i /tmp/crush_newCopy the code

3. CRUSH Map parameter

CRUSH Map consists of four sections.

Device: Consists of any object storage device, that is, the memory corresponding to a CEPh-OSD process. Each OSD in the Ceph configuration file should have one device.
Bucket type: defines the types of buckets to be used in the CRUSH hierarchical structure. Buckets are composed of storage locations aggregated step by step (such as row, cabinet, chassis, host, etc.) and their weights.
Bucket instance: After the bucket type is defined, you must also declare the bucket type of the host and other planned fault domains.
Rules: consists of methods for selecting buckets.

3.1 CRUSH Map device

To Map PG to OSD, CRUSH Map requires an OSD list (that is, the name of the OSD daemon defined in the configuration file), so they first appear in CRUSH Map. To declare a device in the CRUSH Map, create a new line after the device list and enter device followed by a unique numeric ID followed by the corresponding ceph-OSD daemon instance name.

# Devices device {num} {osd.name} # Example: # Devices device 0 osD. 0 device 1 osD. 1 device 2 osD. 2 device 3 OSD. 3Copy the code

3.2 Bucket type of CRUSH Map

The second list in the CRUSH Map defines the bucket type, which simplifies the node and leaf hierarchies. Node (or non-leaf) buckets generally represent physical locations in a hierarchical structure. Nodes aggregate other nodes or leaves. Leaf buckets represent the CEPh-OSD daemon and its corresponding storage media.

To add a bucket type to the CRUSH Map, add a new line below the list of existing bucket types, type, followed by a unique numeric ID, and a bucket name. By convention, there will be a leaf bucket of type 0, however you can specify any name (e.g. Osd, Disk, Drive, storage, etc.) :

# types type {num} {bucket-name} #  # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 rootCopy the code

3.3 Bucket level of CRUSH Map

CRUSH algorithm distributes data objects to storage devices according to the weight and roughly uniform probability of each device. CRUSH distributes objects and their copies according to a cluster diagram you define. CRUSH Maps represent available storage devices and the logical units that contain them.

To Map PGS to OSD across faulty domains, a CRUSH Map defines a series of hierarchical bucket types (i.e., under # type of the existing CRUSH Map). The purpose of creating a bucket hierarchy is to isolate leaf nodes by fault domain, such as host, chassis, cabinet, power distribution unit, cluster, row, room, and data center. Except for OSD, which represents leaf nodes, the hierarchical structure is arbitrary and can be defined as needed.

To declare a bucket instance, you must specify its type, a unique name (string), a unique negative integer ID (optional), a weight related to the total capacity/capability of each item, a bucket algorithm (usually straw), and a hash (usually 0, rjenkins1). A bucket can contain one to more entries, which can be made up of node buckets or leaves, and can have a weight that reflects the relative weight of the entries.

You can declare a node bucket with the following syntax:

[bucket-type] [bucket-name] { id [a unique negative numeric ID] weight [the relative capacity/capability of the item(s)]  alg [the bucket type: uniform | list | tree | straw ] hash [the hash type: 0 by default] item [item-name] weight [weight]} # For example, we can define two host buckets and a cabinet bucket, the cabinet bucket contains two host buckets, OSD is declared as an entry in the host bucket. Host Node1 {id-1 ALg Straw Hash 0 item osd.0 weight 1.00 item osd.1 weight 1.00} host node2 {id-2 Alg straw hash 0 item osd.2 weight 1.00 item osD. 3 weight 1.00} rack rack1 {id -3 alg straw hash 0 item node1 weight 2.00 item node2 weight 2.00}Copy the code

3.3.1 Adjusting the Weight of buckets

Ceph uses double – precision data to represent bucket weights. The weight is different from the device capacity. You are advised to use 1.00 as the relative weight of a 1TB storage device. In this case, 0.5 represents 500GB and 3.00 represents 3TB. The weight of the higher buckets is the sum of the weight of all the leaf buckets.

The weight of a bucket is one-dimensional. You can also calculate item weights to reflect storage device performance. For example, if you have a lot of 1TB hard drives, some of which have relatively low data transfer rates and others have relatively high data transfer rates, you should set different weights even if they have the same capacity (for example, weight 0.8 for low throughput disks and 1.20 for high throughput disks).

3.4 Rules of CRUSH Map

CRUSH Map supports the concept of “CRUSH rules” to determine the distribution of data in a storage pool. CRUSH rules define a attribution and copy strategy, or distribution strategy, that can dictate how CRUSH places copies of objects. For large clusters, you might create many storage pools, each with its own set of CRUSH rules and rules. In the default CRUSH Map, each storage pool has one rule and one rule set assigned to each default storage pool.

Note: In most cases, you don’t need to change the default rules. The default rule set for a newly created storage pool is 0.

The rule format is as follows:

rule <rulename> {
 
        ruleset <ruleset>
        type [ replicated | erasure ]
        min_size <min-size>
        max_size <max-size>
        step take <bucket-type>
        step [choose|chooseleaf] [firstn|indep] <N> <bucket-type>
        step emit
}Copy the code

Parameter Description:

Ruleset: means of distinguishing whether a rule belongs to a ruleset. Activated after setting a rule set for the storage pool.
Type: rule type. Currently, only replicated and Erasure are supported. The default is replicated.
Min_size: indicates the minimum number of copies in a storage pool that can be selected for this rule.
Max_size: indicates the maximum number of copies in a storage pool for which this rule can be selected.
Step take

: selects the initial bucket name and iterates to the bottom of the tree.
Step choose firstn {num} type {bucket-type} : indicates the number of buckets of a specified type. This number is usually the number of duplicates (pool size) of the storage pool. If {num} == 0, select pool-num-replicas as buckets (all available); If {num} > 0 && < pool-num-replicas, select that many buckets; If {num} < 0, it means that pool-num-replicas – {num} buckets are selected.
Step chooseleaf firstn {num} type {bucket-type} : Select a bucket set of type {bucket-type} and select a leaf node from the subtree of each bucket. The number of buckets is usually the number of copies (pool size) of a storage pool. If {num} == 0, select pool-num-replicas as buckets (all available); If {num} > 0 && < pool-num-replicas, select that many buckets; If {num} < 0, it means that pool-num-replicas – {num} buckets are selected.
Step emit: Prints the current value and clears the stack. Usually used at the end of a rule, but also when the same rule is applied to different trees.

4. Master compatibility

When a Ceph client reads or writes data, it always connects to the primary OSD in the acting set (e.g. Osd.2 is the primary in [2, 3, 4]). Sometimes one OSD is not the best choice for the primary OSD (for example, its hard disk or controller is slow) compared to others. To prevent performance bottlenecks (especially for read operations) when maximizing hardware utilization, you can adjust the primary affinity of the OSD so that CRUSH tries not to use it as the primary OSD in the acting set.

ceph osd primary-affinity <osd-id> <weight>Copy the code

The default active affinity is 1 (that is, the OSD node can serve as the active OSD node). The value ranges from 0 to 1, where 0 indicates that the OSD node cannot be used as the host, and 1 indicates that the OSD node can be used as the host.

If the weight is less than 1, CRUSH is less likely to select the primary OSD.

5. Add or move an OSD node

To add or move CRUSH Map entries corresponding to OSD nodes in an online cluster, run the ceph OSD CRUSH set command.

ceph osd crush set {id-or-name} {weight} {bucket-type}={bucket-name} [{bucket-type}={bucket-name} ...]Copy the code

6. Adjust the OSD CRUSH weight

To adjust the CRUSH weight of an OSD node in an online cluster, run the following command:

ceph osd crush reweight {name} {weight}Copy the code

7. Remove OSD

To remove an OSD node from the CRUSH Map or from a specified OSD node in an online cluster, run the following command:

Delete osd $ceph osd crush rm osd.0 from crush mapCopy the code

8. Increase the barrels

Ceph osd CRUSH add-bucket to create a bucket in the CRUSH Map, run the ceph osd CRUSH add-bucket command:

ceph osd crush add-bucket {bucket-name} {bucket-type}Copy the code

9. Mobile barrels

To move a bucket to a different location in the CRUSH Map, run the following command:

ceph osd crush move {bucket-name} {bucket-type}={bucket-name} [{bucket-type}={bucket-name} ...]Copy the code

To remove a bucket from the hierarchical structure of the CRUSH Map, run the following command:

ceph osd crush remove {bucket-name}Copy the code

Note: The bucket must be empty when removed from the CRUSH hierarchy.