Introduction: During the epidemic, Tencent Medical has provided timely and accurate epidemic information services to the people of China. Tencent Cloud Kafka as a key component in Tencent’s medical big data architecture. In the face of multiple data storage requirements in a short period of time, how to respond quickly and expand quickly to support the stable operation of services? This article will explain how to choose the right disk solution for different service requirements scenarios from the design scheme of Kafka cluster physical machine layer disks. (Edit: Middleware Q sister)

Healthcare Information Scenario

Tencent medical use scenario is a typical log analysis system. As message middleware, Kafka plays the role of data aggregation and traffic peak cutting. As shown below:

In the medical case, Kafka carries peak data throughput of GB/s and a lot of data storage stress. This puts great demands on the performance of the underlying hard disk. Such as throughput/storage capacity, rapid capacity expansion and shrinkage capacity, etc.

Therefore, the following factors need to be considered comprehensively in the design of the hard disk scheme:

  • Bulk storage
  • High throughput I/O capability
  • Rapid capacity expansion and shrinkage
  • Data security
  • Low redundancy storage

Common hard disk construction schemes for Kafka clusters include single-disk read/write schemes, multi-directory read/write schemes, disk arrays (RAID0 and RAID10), and logical volumes (LVM). The following in-depth analysis of the advantages and disadvantages of each scheme for readers to choose reference.

1. Disk solution Overview

The design of the hard disk storage scheme uses the now mature industrial scheme, and there is no special innovation. The choice of hard disk storage solution is more from the perspective of Apache Kafka products, considering which solution is more suitable for the user’s business needs.

2. Select hard disk media

The industrial hard drive market is dominated by mechanical hard drives and solid-state drives (SSDS). In the case of very large storage capacity, SSD price is still its Achilles’ heel. For high IO applications such as Kafka, SSD corruption rates and service life are a big problem. Therefore, the mechanical hard disk with its cheap price and large capacity has become the only choice.

Two problems need to be solved in the mechanical hard disk: how to improve the I/O capability of the hard disk; How to maintain the stability of the service system when disk damage becomes a common occurrence? Let’s start with these two aspects.

3. Improve the DISK I/O capability

The following indicators are used to measure disk performance:

  • IOPS: number of read/write operations per second (unit: times). The underlying driver type of a storage device determines the IOPS.
  • Throughput: Read/write data volume per second, expressed in MB/s.
  • Latency: the time elapsed between the sending time of an I/O operation and the receiving confirmation, in seconds. Kafka program itself through sequential read and write, Page Cache, zero copy and other solutions, from the application level of the maximum utilization of hard disk performance. However, once the hard disk capacity is insufficient, the Kafka cluster’s ability to provide services is greatly reduced. Because of how Kafka is used and how it operates, the most important performance metric is throughput.

In the self-built cluster scenario, mounting a hard disk to an independent host is the most common solution. On the basis of reading and writing data from a single hard disk, the solutions to improve the hard disk throughput are as follows:

  • Single disk read/write
  • Kafka multiple directory reads and writes
  • RAID disk array solution
  • Logical Volume Manage(LVM) striping scheme

Solution 1: Read/write data from a single hard disk

Single-node single-disk deployment is the most common solution in self-built clusters, and it is also the most commonly used solution in practice. The diagram below:

The figure above shows a Kafka cluster consisting of three nodes. Each node of the cluster has a SATA/SSD disk to store Kafka data. The characteristic of this scheme is simple thinking and quick construction. Great for small self-built clusters.

This solution, when it is found that the hard disk capacity is insufficient, the most direct and effective solution is vertical expansion. That is, the IO capacity of a single disk is improved, for example, the disk speed of 5400 RPM is changed to 7400 RPM, or 10000 RPM, or even higher than 10000 RPM. If the capacity of a mechanical hard disk is insufficient, replace it with a large-capacity SSD.

The disadvantages of this scheme are also obvious. First of all, there is a limit to the throughput of a single SSD. When Kafka’s traffic increases, SSDS can no longer handle it. The throughput of a single hard disk limits the throughput of a Kafka cluster. SSDS also cost several times as much as SATA. Taking the current hard disk price on Tencent Cloud as an example, SSD price is three times that of high-performance cloud hard disk.

Therefore, this solution is not a long-term option when the cluster size continues to grow.

Scheme 2: Kafka multi-directory read and write

What if the pressure on the cluster continues to increase and a single hard disk cannot meet the demand or SSDS are no longer needed due to cost considerations? Apache Kafka officially began in 0.8, providing the ability to read and write multiple directories. Change the log.dir property to log.dirs. Here’s the official explanation:

A comma-separated list of one or more directories in which Kafka data is stored. Each new partition that is created will be placed in the directory which currently has the fewest partitions

Copy the code

To put it simply, you can configure multiple log folders separated by commas. This has great benefits in practical projects, that is, it supports the read and write capability of multiple hard disks. Add the above configuration to the server.properties configuration file:

log.dirs=/data,/data1,/data2
Copy the code

After adding this configuration, see the following figure:

As shown above, suppose you have a topicA with 9 partitions and 1 copy. The nine partitions will be evenly distributed on nodes 1, 2, and 3. Assume that partition 0, 1, and 2 are allocated on node 1. Kafka places the data directories of these three partitions in /data, /data1, and /data2 respectively. The partition data directories are /data/ topica-0, /data1/ topica-1, and /data2/ topica-2. As for why the uniform distribution is not detailed, interested students can refer to the relevant materials.

When Kafka writes data to these three partitions, it takes advantage of the IO capacity of the three hard drives. This is actually a very useful solution. However, there are some cases in the multi-directory read/write scheme that cannot be handled.

Let’s take a look at this situation: due to different service characteristics, some services will have the problem of hot or cold data. Some partitions may be large and some may be small. It is possible that the amount of a single Partion reaches the IO bottleneck of the hard disk, and then the problem of the single-disk solution is again encountered.

Therefore, the multi-directory solution may have some limitations when the service has this scenario.

Solution 3: RAID disk array

RAID is a disk array that consists of many independent hard disks combined into a large-capacity hard disk group and uses the addition effect of multiple hard disks to improve the I/O capability of the entire hard disk. Typical examples include RAID0, RAID1, RAID5, and RAID10. For reasons of space, I will not expand on the knowledge related to RAID. Interested can refer to the relevant information. A brief comparison of the advantages and disadvantages of the above schemes:

Disk array advantages disadvantages
RAID0 Parallel IO, improve IO capability Single copy. Lost data cannot be recovered
RAID1 Double copy to improve data security The I/O capability of bidirectional data writing is the same as that of a single disk
RAID10 I have parallel IO, and I have double copies Multiple hard disks are required and the cost is relatively high
RAID5 RAID10 compromise, with the benefits of RAID10, without the need for redundant data.” RAID10 compromise, which has the advantages of RAID10 without redundant data

1. RAID10 scheme

The RAID solution is used to solve the I/O bottleneck of a single hard disk. After comparing the above solutions, we take RAID10 as an example to expand to explain what RAID does and which RAID solution should be selected. Take a look at the picture below:

In the preceding figure, a single machine uses four disks to form RAID10. That is, two disks are used to form a virtual RAID1 disk, and two virtual disks are used to form a virtual RAID0 disk, and the disks are mounted to the /data directory. Theoretically, if the throughput of a single disk is 100MB, then a RAID10 array consisting of four disks can handle 200MB, and each disk is stored at the bottom and is a double copy.

In this case, even if one of the underlying disks is damaged, the system automatically reads and writes data from the other backup disk. At this point, the system can still be used normally. RAID0 is used to parallelize IO and improve overall IO capability. In theory, the parallel operation of three hard disks increases the disk read/write speed by three times in the same time. However, due to the influence of bus bandwidth and other factors, the actual lifting rate will certainly be lower than the theoretical value. However, there is no doubt that the speed of parallel transmission of large amounts of data is significantly improved compared with serial transmission.

It is important to note that RAID0 is not better than the multi-directory solution in terms of I/o capability. RAID0 provides higher I/o capability and larger capacity than a physical disk. These two characteristics can solve the situation of cold and hot data and data skew.

2. Can you use RAID0 directly?

Careful students may find such a problem, right? Suppose we have a topicB with 1 partition and 2 copies. Two copies are distributed on nodes 1 and 2. At this point, when one messageA is produced, four copies of messageA are stored in the cluster. That is, two data copies (RAID1 double copies) are stored on nodes 1 and 2. The diagram below:

Now that both nodes 1 and 2 have Replication copies of Kafka. Why make an extra copy on your hard drive? It’s not too wasteful. Yes. From a mathematical point of view, yes.

So let’s imagine a scenario, let’s say we go straight to RAID0. When a data disk fails, the Kafka cluster automatically migrates the affected partitions to other available machines. If it happens to be the leader, the leader switch is performed. It seems to be all right.

However, assuming that the amount of Parition just affected is large, switching it to other machines will lead to increased pressure on other machines, and it is likely to affect the use of other partitions. This is a major hidden danger affecting cluster security.

In a large cluster of thousands of units, hard disk failure is a common occurrence. This causes the process of partition migration and leader switching to become relatively frequent. But this doesn’t seem to be a huge problem because the data is accessible and not lost.

However, if the client is sensitive to the leader switch, it will quickly perceive the fluctuation of the server. As a service provider, we still want to provide users with stable services. If the above situation occurs, users may feel that the service is not stable, which will affect the reputation of the manufacturer.

As a result of industrial development, the price of mechanical hard disk continues to decline, the price of large capacity mechanical hard disk is actually very low. Therefore, in considering the balance between cost and stability, we can consider paying more cost to ensure the stability of service.

If it’s RAID10 and a hard disk fails, the system still works. If the replacement period is 24 hours, ensure that another disk in the same RAID1 is not damaged within 24 hours to ensure the normal and stable running of the system.

Here’s a tip to share, starting with this image:

Why are the hard drives in the picture different colors? In fact, when doing RAID1, two hard disks are best not the same batch. Because disks in a batch have the same production process and may have the same disk life, the probability of two disks being damaged at the same time may be increased to some extent.

So far, it seems perfect. Let’s add another factor:

Assume that the planned single-node capacity is 8TB, and the planned single-node capacity is 10TB for service development. So what do you do?

Because Raid0 is not dynamically scalable. What do you do now? It looks like you have to replace the entire hard drive. Let’s look at the LVM scenario.

Scheme 4: Striped LVM logical volumes

The striping principle of LVM logical volumes is similar to RAID1. Data is read and written in striped mode. Both have the ability to read and write in parallel. In the experimental process, the parallel read and write performance of the two schemes is similar. The advantage of LVM over RAID10 is that it provides the ability to dynamically expand hard drives. The capacity expansion of strip LVM depends on lvmextend. The capacity expansion condition is as follows: For striped LVMS, each hard disk must have the same capacity. Striped LVEXTEND will fail if each hard drive has a different capacity.

For example, if there are three disks with a capacity of 1 TB and 200GB remaining space, add two disks with a capacity of 1 TB to each disk. In this case, the striped disk needs to allocate an equal amount of space from each disk for capacity expansion. Therefore, each disk can allocate a maximum of 200GB of space. At this point, the new disk has 800GB of unused space.

As mentioned above, the dynamic capacity of LVM seems impractical in the scenario where Kafka clusters are deployed with physical disks mounted on physical machines.

Let’s change the scene, with the advent of cloud services. When we buy virtual machines on the cloud and buy online disks on the cloud to build clusters, the role of LVM is highlighted, please see the following figure:

As shown in the figure, three cloud disks are mounted to each CVM. The three cloud disks are striped with LVM to form a logical disk and mounted to the /data directory. Cloud hard disks have multiple copies and can be expanded online. These two features go well with LVM. Why is that?

Cloud hard disks have multiple copies at the bottom layer, so there is no need to create Raid1 to prevent data damage and system fluctuations. Online capacity expansion means that a single disk can be dynamically expanded. This is where the dynamic capacity of LVM comes into focus. The diagram below:

As an example, suppose the cluster starts with a plan to only need 600GB of capacity per machine. At this point, we can buy 6 100GB cloud hard disks per machine to build LVM striping. Mount to the /data directory so that you can take advantage of striping parallel writes and get the required 600GB capacity. As the business grew for a while, it suddenly became clear that 600GB was not enough and 1.2TB was needed per Broker. At this point, we can expand the disk capacity online through the console, expand the cloud disk capacity of each broker to 1.2TB, and then expand the capacity of /data by lvextend.

On Tencent cloud, the maximum capacity of a cloud hard disk is 16TB, and a maximum of 20 hard disks can be mounted on a single cloud hard disk. So, according to the above thinking, the theoretical capacity of a machine is 16*20=320TB. This is basically a large capacity for a single machine.

Of course, these are theoretical values. Practice will certainly fall short of this ideal. But it provides an idea. In addition, improving the STRIPED I/O capability and estimating the number of hanging disks are related to the I/O SIZE set during LVM striping. I/O SIZE depends on the length and data volume of a single piece of data. There is no recommended value for this value, which needs to be evaluated based on the user’s own business characteristics.

As mentioned above, LVM is a better solution than Raid10 for Kafka deployed in the cloud.

conclusion

This paper analyzes the application scenarios and advantages and disadvantages of several common schemes. It turns out that there is no perfect universal solution. Disk array is not recommended. The service scenario of self-built clusters is simple. The single-disk solution and multi-directory read/write solution can solve many problems. RAID0 and RAID10 can be used in complex and large-scale physical machine clusters. If it is a cluster of CVM hosts deployed on the cloud, the LVM solution is a better option.

To sum up, operating a proper Apache Kafka cluster requires consideration and tradeoff of appropriate hard disk solutions based on business characteristics, cost, data reliability, available resources, and the environment in which you operate.

The authors introduce

Xu Wenqiang, Senior R&D engineer of Tencent Cloud middleware message queue. Tencent Cloud Ckafka core research and development, with years of experience in distributed system development. Mainly responsible for the customized development and optimization of Tencent cloud CKafka. Focus on Kafka performance analysis and optimization in public cloud multi-tenant and large-scale cluster scenarios.

Please scan our wechat official number and look forward to meeting you