I think it was a very lucky thing that I had the opportunity to experience another large distributed storage solution.

So I can compare HDFS and Ceph, which are almost completely different storage systems, with their advantages and disadvantages and suitable scenarios.

From the perspective of SRE, I think distributed storage, especially open source distributed storage, mainly solves the following problems for commercial companies:

  • Scalability to meet the massive data storage requirements caused by service growth.
  • Cheaper than commercial storage, greatly reducing costs.
  • Stable, controllable, good luck dimension.

The overall goal: to be easy to use, cheap and stable. But the reality, it seems, is not so rosy…

Based on these three fundamental values, this paper will analyze my experience in operating and maintaining Ceph, and compare with centralized distributed storage systems, such as HDFS.

scalability

Ceph claims to be infinitely scalable because it is based on the CRUSH algorithm and has no central node. In fact, Ceph can be extended indefinitely, but the process of unlimited expansion of Ceph is not good.

New objects in Ceph need to be written to a predefined PG Hash layer, then PG, and then the Hash formed by OSD of all physical machines in the cluster, and finally to a physical disk.

Therefore, all Ceph objects are pre-hash to a fixed number of buckets (PGS) and then selected to be dropped on a specific machine disk according to the overall physical architecture of the cluster crushmap.

How does that affect capacity expansion?

Expansion granularity

My definition of capacity expansion granularity is: how many machines can be added at one time.

Ceph in practice, capacity expansion is restricted by “fault tolerant domain”, can only expand one “fault tolerant domain”.

A fault-tolerant domain is a Replica isolation level, i.e. data from the same Replica is stored on different disks/machines/racks/equipment rooms. The concept of a fault-tolerant domain is found in many storage solutions, including HDFS.

Why is Ceph affected? Because Ceph has no centralized metadata nodes, data placement strategies are affected.

A data replica is a data replica. Which machine and which hard disk is a data replica.

Centralized, such as HDFS, records the location of each file and each data block below.

This position does not change very often, only when a file is created, a Balancer is rebalanced, a hard disk fails, and the central node is repositioned for data on damaged hardware.

Ceph, because of its decentralization, causes the location of the PG that holds the data to change according to Crushmap.

With the arrival of new machines and hard drives, new locations have to be calculated for some of the affected PGS. The technology based on consistent hashing also faces the same problem in capacity expansion.

Therefore, Ceph expansion needs to be adjusted by PG. As a result of this adjustment, Ceph is subject to “fault tolerant domains”.

For example: there is one PG, which is 3 copies, and the Ceph cluster has one configuration which is PG to provide normal services to the outside, there are at least 2 full copies.

When the fault-tolerant domain of the data Pool is Host and two machines are expanded simultaneously, it is possible for some PGS to map two of the three replicas to two new machines.

Both copies are new and do not have complete and up-to-date data. The remaining one copy cannot meet the requirement that the old machine has at least two complete copies and cannot provide normal read and write services.

This will cause all objects in this PG to stop serving outside.

As Admin, you can lower the configuration of the data Pool min_size to 1. However, this configuration, even under normal circumstances, may lose data due to disk failure, so it is generally not set up in this way.

So when you expand, one machine at a time, is it safe?

This ensures that all PGS have at least 2 complete copies on the old machine. However, even if one machine is expanded, there is still an extreme case where the full copy of PG is reduced to 1 due to the failure of a hard disk in the old machine during expansion.

Although PGS may not be serviceable, the persistence of the data is fine. The service reliability of domestic AT cloud is not particularly high, achieving three or four nines like persistence.

I’m not sure if the object storage in these two big clouds is using Ceph, but any object storage implemented based on similar decentralized technologies such as CRUSH or consistent hashing will probably have some data temporarily unserviceable.

Let’s throw out the most extreme case, which assumes that no disk damage occurs temporarily when the machine is added as a “fault tolerant domain” during capacity expansion. So is there a way to increase the granularity of capacity expansion?

The idea is to set up larger levels of “fault tolerant domains”, such as Rack, when you start planning Ceph clusters.

It can be a real Rack, or even a logical Rack. This allows you to expand a logical “fault tolerant domain,” which allows you to expand an entire Rack, or at least several machines, out of the limitations of one machine.

Tips: I haven’t talked here about why small granularity is a bad thing. At many companies, the daily increase in data is likely to be greater than the storage capacity of a single machine.

This will cause the capacity expansion speed can not keep up with the writing speed of the awkward situation. This can do a lot of damage to clusters that aren’t designed to Deploy quickly in the beginning.

Crushmap changes during capacity expansion

Ceph re-hashes the physical location of a PG based on crushmap. If a disk fails halfway through expansion, Ceph will re-hash the physical location of a PG based on crushmap.

If you’re unlucky, it’s likely that a machine’s expansion schedule will take a long time to return to a stable state.

The Ceph rebalancing caused by this crushmap change is a headache for a large storage cluster at almost any time, not just during capacity expansion.

When a new cluster is created, the hard disks are relatively new, so the failure rate is not high. However, in large storage clusters that have been running for 2-3 years, disk failures are really a common occurrence, and it is not unusual for a 1000 – sized cluster to fail 2-3 disks a day.

Crushmap changes frequently, internal instability to Ceph, really a big impact. This can be followed by an overall IO drop (the disk IO is exhausted by repeated rebalance) or even some data being temporarily unavailable.

So overall, the expansion of Ceph is a bit unpleasant. Ceph does offer unlimited scalability, but the process is neither smooth nor completely manageable.

Crushmap is designed to achieve good decentralization, but it also creates a hole for the instability of the cluster.

Compared to HDFS, which has centralized metadata, there’s almost no limit to how much you can scale. The relocation and rebalancing of old data is handled by a separate Job, which is efficient.

It pairs full and empty nodes to move enough data from the old node to fill the new machine. Centralized metadata becomes an advantage when it comes to capacity expansion and rebalancing.

After the capacity is expanded to a certain level, adjust the number of PGS

As shown in the Ceph data writing flow chart above, the minimum unit of placement of Ceph objects is PG, which in turn is placed on the hard disk. In theory, PG is the bigger the better.

In this way, the randomness of data fragments is better, and the problem of large capacity deviation of a single block disk caused by pseudo-randomness is better covered.

However, in reality, PG quantity is not always better because it is limited by hardware, such as CPU, memory and network. Therefore, when planning the number of PGS, we do not increase the number blindly. The community also recommends 200pg/osd.

Suppose we now have 10 machines with 10 disks per hard disk and 1024 PGS. PGS are single copies, so each disk will store 100 PGS.

This was healthy at the time, but as our cluster grew to 1000 machines, we only had one PG per hard disk, which magnified the pseudo-random imbalance. As a result, Admin is faced with adjusting the PG count, which creates problems.

Turning PG basically means that the whole cluster will go into a seriously abnormal state. Almost 50% of objects involved in adjusting PGS need to be physically relocated, which can cause a serious deterioration in quality of service.

While tweaking PG is not a regular event, in a large storage, as it evolves, it is inevitable to go through this big test.

Cheaper than commercial storage

When we compare to commercial storage, we generally compare to hardware and software storage solution vendors like EMC and IBM, or cloud solutions like Aliyun and AWS.

Building their own computer room, of course, in the hardware unit price is cheaper, but need to consider the comprehensive cost, including:

  • Hardware cost
  • Self-maintenance personnel costs
  • Service quality gradually improved from average

I will not talk about the metaphysical issue of human cost, this article will only talk about Ceph in the hardware cost of what interesting place.

Be reasonable, build your own computer room, hardware cost should be no doubt cheap, so Ceph here what is special? The problem is, cluster reliable utilization.

Reliable cluster usage: When the capacity of the cluster reaches a certain level, the cluster cannot provide external services, or the services cannot remain highly available.

For example, is our phone flash/computer hard drive still working at 99%? Of course, because it’s local storage. With cloud solutions, there is no such problem.

For commercial storage solutions, such as EMC’s Isilon distributed file system, storage capacity can reach even 98-99% and still be serviced.

For HDFS, below 95%, storage is also well serviced externally. Hadoop jobs running in HDFS fail because they cannot be written locally.

Ceph does not perform well in this area. As a rule of thumb, an unstable state is possible after the overall cluster utilization reaches 70%.

Why is that? The problem is that decentralization brings tradeoff.

Ceph is a decentralized distributed solution in which object metadata is distributed across physical machines. Therefore, all objects are allocated to each disk “pseudo-randomly”.

Pseudo-randomization cannot guarantee the complete uniform distribution of all disks, nor can it reduce the probability that many large objects fall on one disk at the same time (I understand that adding a layer of PG and making PG replicas more can reduce the variance of disks). Therefore, the utilization rate of some disks will always be higher than the mean.

When the overall cluster usage is not high, there is no problem. After 70% utilization, administrators need to step in.

Because if you have a big variance, you’re more likely to hit 95%. The Admin starts to reduce the Reweight of the disk whose capacity is too high.

However, if some more disks become full before the Reweight adjustment ends, the administrator will be forced to create another disk with too high Reweight before Ceph reaches a stable state.

This led to another change to crushmap, which caused Ceph to move further and further away from a stable state. And expansion is not timely at this time, it is even worse.

In addition, the intermediate state of the previous Crushmap will cause some PGS to be half migrated, and these “incomplete” PGS will not be deleted immediately, which will further burden the already tight disk space.

Some of you may wonder why Ceph is unavailable when a disk is full. This is indeed the design of Ceph, because Ceph cannot guarantee that new objects will fall on empty disks instead of full disks, so Ceph chooses to deny service when a disk is full.

After consulting with some of my colleagues and industry peers, I learned that almost all Ceph clusters are ready to expand when they reach 50% utilization.

This is actually quite uneconomical, as the storage resources of a large number of machines must be vacant. And the bigger the future cluster, the bigger the vacancy effect, meaning more money/electricity wasted.

However, in many traditional centralized distributed storage systems, the main control node can select a relatively idle machine to write data, so there is no problem that the whole cluster cannot write data when some disks are full.

That’s why you can write to 95% overall and still be usable.

I haven’t really accounted for the cost of this effect, but at least it seems a little imperfect.

For example, when I expected to have 50PB of storage, I needed 300 physical machines, but I had to purchase another 200-300 physical machines in advance, which could not be used immediately and had to be plugged in.

So Ceph isn’t necessarily cheap, and decentralized distributed storage isn’t all that great.

But the harm of centralization seems to be an uncontroversial issue (single point, central node scalability issues, etc.), so there really is no silver bullet in distribution, only tradeoff.

Ceph also has the option of expanding the cluster by the entire Pool. If a Pool is full, the cluster will not be expanded. If a new Pool is opened, the new object can only be written to the new Pool, and the objects of the old Pool can be deleted and read.

This may seem like a great solution at first glance, but if you think about it, it’s no different from HDFS federation, MySQL database table, and front-end Hash.

This is not “infinite expansion”, and requires a front routing layer to be written.

Stable, manageable, good luck dimension

This stable good luck dimension, basically looked at the hard strength of the team. Familiarity and experience with open source software can really make a difference.

This is also affected by the quality of documentation in the open source community. Ceph’s open source community is pretty good, and after Red Hat bought and dominated Ceph, they reorganized the Red Hat version of Ceph documentation, which I think reads more logically.

Building your own operations documentation within the company is also critical. A novice is likely to make many mistakes that lead to accidents. But for companies, step on the pit once, try not to step on the second time.

This has brought some challenges to the company’s technology accumulation management, technical document management and core talent drain management.

I once encountered a thorny problem in Ceph operation and maintenance. That is, when the Ceph cluster reaches 80%, some disks often become full, and then the administrator has to intervene to lower the Reweight of excessively high disks.

Before the disk usage drops, more disks become full, and the administrator has to step in and adjust the Reweight. Ceph has never been in a stable state again, and the administrator has to keep an eye on the cluster.

This leads to a huge investment in operations, so things like this have to be avoided, which is very damaging to the morale of the operations staff.

Should an early capacity warning start the procurement process?

But this brings us back to the problem of wasted resources. In addition, Ceph objects do not have last_access_time metadata, so the cold/hot division of Ceph objects requires additional development.

As the cluster becomes larger, how to clean up the garbage data and how to archive the cold data also brings challenges.

Conclusion thinking

Ceph does have unlimited capacity, but it requires good initial planning and the expansion process is not perfect.

Centralization makes the upper limit of capacity expansion the physical limit of a single Master node, and creates the theoretical basis for unlimited capacity expansion. However, in actual capacity expansion, service quality will be severely restricted.

Ceph has some waste of hardware, so more should be taken into account in cost accounting.

Ceph’s decentralized design sacrifices metadata, such as LastacessTime, which puts pressure on future data governance and requires stronger teams to operate and redevelop.

Accumulation of operation and maintenance experience and accumulation of operation and maintenance team is the core of mastering open source distributed storage. As the adversary gets stronger over time, the operations team needs to get better and better to match the production relationship with productivity.

Technology itself is not absolutely good or bad, and different technologies are used to solve different problems. But there are good and bad technologies in scenarios.

Because in a scenario, you have a position, you have a priority for the problem to be solved, and you are sure to choose the best technology for you.