Mechanical drive

Computers did not have hard disks in the early days. The entire operating system is installed on a 5 – or 3.5-inch floppy disk. Later, mechanical hard disks mounted on motherboards were introduced, and by now tapes, floppy disks, and optical discs were largely replaced and eliminated.

The IOPS of a mechanical hard disk is about 100 times per second.

The physical structure

The previous hard disk configuration contains the interfaces, corresponding control circuits, and actual I/O devices. The IO device in this case is a mechanical hard disk.

As shown, a mechanical hard disk consists of a disk surface, a magnetic head and a cantilever.

  1. Disk: The disk on which data is actually stored. There is a magnetic coating on the disk. The data is stored on this magnetic coating. In the middle of the disk is a rotating shaft controlled by a motor. This axis will control the disk to rotate. The index related to the disk is called the speed, such as the hard disk has 5400, 7200, and even 10000. The number of revolutions refers to the rotation speed of the motor in the middle of the disk, which is called RPM, or the number of revolutions per minute. 7200RPM, which means that once the computer is turned on and powered, our hard drive can continue to make 7200 revolutions per minute. That’s 120 revolutions per second.
  2. Magnetic head: data is not transferred directly from the disk surface to the bus, but is read from the disk surface through the magnetic head, and then transmitted through circuit signals to the control circuit, interface, and then onto the bus. Usually, a disk will have two heads, one on the front and the other on the back. There are magnetic coatings on both sides of a hard disk to store data, and instead of just one hard disk, a hard disk is stacked up and down parallel to each other. There are heads on the front and back of each disk.
  3. Cantilever: the cantilever is attached to the head and, within limits, locates the head to a specific track on the disk.

A disk is usually round, consisting of many concentric circles, each of which is a track. Each track has its own number. The cantilever really just controls whether to read the data from the innermost circle or the data from the outermost circle.

A track is divided into sectors. What about the same sector of a disk parallel up and down, called a cylinder, reading data, two steps.

  1. Rotate the disk to a certain position. From this position, the cantilever can be positioned to a certain subregion of the entire disk surface. This subregion is shaped a bit like a slice of pizza and is commonly referred to as a geometric sector. This means that, “geometrically”, all of these sectors can be accessed by the cantilever.
  2. By moving the cantilever to a particular sector of a particular track, within this “geometric sector”, we find our actual sector. Once found, the head will fall and the data can be read directly opposite the sector.

These two components make up the time consumed by random data access on the hard disk:

  1. Average delay: This time is actually the time to rotate the disk and align the geometric sector with the cantilever position. In the random case, to find a geometric sector on average, you need to rotate the disk half a turn. For example: 7200 RPM hard disk, so a second inside, you can rotate 240 half turns. So this average delay is going to be1s / 240 = 4.17ms
  2. Average seek time: the time it takes the cantilever to locate the sector after the disk rotates. The average seek time for today’s HDDS is typically between 4 and 10ms.

So it takes 8-14 ms to randomly find a single piece of data on the entire hard drive. A 7200 RPM hard disk provides random I/o access times per second, that is, 1s / 8 ms = 125 IOPS or 1s / 14ms = 70 IOPS. The IOPS of HDDS is about 100 times per second.

Correspondingly, if we do sequential reads and writes instead of random data accesses, maximizing read efficiency is that we can choose to store the sequential data as much as possible on the same cylinder. In this way, we only need to rotate the disk once, do a seek, to write or read data from multiple disks in the same vertical space. If there is not enough data on a cylinder, do not move the cantilever, but turn the disk by motor, so that all data on a track can be read sequentially. Therefore, in fact, the sequential data read and write rate of HDD is very good, which can reach about 200MB/s.

Partial Stroking – Improve performance based on scenario

With only 100 IOPS, it is difficult to meet the massive and highly concurrent requests on the Internet. So, today’s databases, they store their data on SSDS. But SSDS used to be expensive. The data in the database can only be stored on HDDS. Today, even HDDS used in data centers are typically 7200 RPM, because if we want faster random access, we use SSDS. But at the time, SSDS were too expensive to be commercially available. Hard disk manufacturers are constantly developing faster drives. In data centers, 10,000 or even 15,000 RPM hard drives are often used.

However, 10,000 – or even 15,000-rpm drives are also more expensive. If you want to save costs and improve the cost performance, you can only use Partial Stroking or Short Stroking, that is, “shorten the trip” technology. That is, the hard disk seek time is shortened. If you shorten either of these two things, you can improve IOPS.

In general, the seek time of hard disks is longer than the average latency. The most extreme way to reduce seek time is to eliminate the need for seek, where we put all the data on one track. For example, always keep the head on the outermost track. Thus, our seek time is basically zero, and the access time is just the average latency. So the IOPS, that becomes 1s / 4ms = 250 IOPS. Of course, with only one track, very little data can be stored. In practice, you can only use a half or a quarter of the track, which is the outer quarter or half of the track. As a result, the available capacity of the hard disk may become 1/2 or 1/4. But the seek time is also a quarter or a half, and because the cantilever has to move a half or a quarter of the distance, the IOPS increases dramatically.

For example, a 7200 RPM hard disk normally has an average latency of 4.17ms and a seek time of 9ms. The original IOPS is 1s/(4.17ms + 9ms) = 75.9 IOPS. If we use only 1/4 of these tracks, then the IOPS becomes 1s/(4.17ms + 9ms/4) = 155.8 IOPS

This doubles the IOPS, which is about the same as a 15000-rpm hard disk. The hard drive also has only a quarter of its original capacity. At the time, however, a 15000-rpm drive with the same capacity cost more than four times as much as a 7200-rpm drive. Therefore, in this way, the software to format the hard disk, only part of the track for the system available, can greatly improve the cost performance of the hardware.

Reference: haokan.baidu.com/v?vid=86506…

conclusion

The hardware of a mechanical hard disk is mainly composed of a disk surface, a magnetic head and a cantilever. The position of our data on the disk can be located by track, sector and cylinder. An actual access to the hard disk involves rotating the disk to a “geometric sector” aligned with the cantilever. The cantilever then places the head on the sector we actually want to read by wayfinding.

Due to the structure of the mechanical hard disk, our access speed to random data includes the average delay of the rotating disk and the seek time of the moving cantilever. From these two times, we can calculate the IOPS of the mechanical hard disk.

The IOPS of 7200 to mechanical disks is about 100. In the early days of the Internet, we didn’t have SSDS to work with, so engineers came up with Partial Stroking, which wasted storage space but improved IOPS by shortening seek time. This solution is also a typical software optimization solution after in-depth understanding of hardware principles.

SSD

Since HDDS can no longer meet the needs of enterprise mechanical hard disks with 10000 RPM or Short Stroking to further improve IOPS, the above Short Stroking optimization method is also improved below 1000 RPM. SSDS, on the other hand, can be easily scaled up to 1W-2W, so SSDS became commercially available. In addition, if SATA SSDS are not enough, PCI Express SSDS can be replaced.

SSDS do not have the same seek process as mechanical hard disks, so their random reads and writes are faster.

Mechanical drives are much more durable than SSDS, and if we need to write and delete data frequently, mechanical drives are much more cost-effective than SSDS. We can think of it simply because SSDS, like CPU caches, use a capacitor to hold one bit of data. It consists of a capacitor combined with a voltmeter that records one or more bits.

SLC, MLC, TLC, and QLC

SLC: The way to record a bit is that it is 1 when the capacitor is charged with voltage and 0 when the capacitor is not charged when it is discharged. Such SSDS are called SLC particles. There is only one bit of data in a storage unit.

The problem with this approach is similar to the problem with CPU caches, that is, the number of components that can be stored in the same area is limited. If you only use SLC, you will encounter the problem that the storage capacity cannot increase and the price cannot decrease. Later, it was possible to store two, three, or even four bits in a capacitor – using a voltmeter to differentiate voltages.

QLC: four bits can represent a total of 16 different numbers from 0000-1111. So, if you can charge the capacitor with 15 different voltages, and the voltmeter can tell the 15 different voltages. Add the zero represented by the empty capacitor, and you have four bits from 0000-1111.

The problem with this approach: representing 15 different voltages, charging and reading requires more precision. This results in slower charging and loading, so QLC SSDS are several times slower to read and write than SLCS.

P/E erasure problem

The physical structure of SSDS is top-down.

Like other IO devices, it has a corresponding interface and control circuit. Today’s SSDS use SATA or PCI Express interfaces. In addition, FTL is a flash replacement layer in the control circuit, which is also the core module of SSD. The performance of SSD disk depends largely on the algorithm of FTL.

New high-capacity SSDS now come in 3D packages, which are made up of many bare disks stacked on top of each other, just as mechanical hard drives stack many disks on top of each other to fit more capacity into the same space.

A bare piece can be placed on multiple planes, generally a plane storage capacity is about GB level. A plane is divided into many blocks. The storage size of a block is usually several hundred KB to several MB. Within a block, there are many separate pages, as well as pages in memory, and a page is usually 4KB in size. The bottom two layers of blocks and pages are very important.

For SSDS, data writing is called Program. Writes cannot be overwritten, as on a mechanical hard disk, but must be erased and then written.

  • The basic unit of SSD reads and writes is a page.
  • SSD erasure is done in large blocks.

The service life of an SSD is the number of erasures per block. One plane of an SSD can be regarded as a blank sheet of paper. Writing data on it is like writing with a pencil on white paper. If you want to write new data in a place that has already been written, use an eraser to erase the written text. However, if you rub the same place too often, it will break, and there will be no way to write.

  • SLC chips can be erased about 100,000 times
  • MLC is around 10,000
  • TLC and QLC have been thousands of times.

So when you buy an SSD, you’ll see a big difference in price for the same capacity, because the chip size and life span are completely different.

SSD read and write life cycle

The three colors represent the different states of the pages in the SSD

  • White indicates that the page has never written data
  • Green means valid data is being written into it
  • Red represents the data inside, which has been deleted in the view of the operating system.

At first, every page of all blocks is white. As we started writing data into it, some of the pages turned green. Then, because we deleted some files on the hard drive, some pages turned red. But these red pages, they can’t write data again. Because the SSD cannot erase a single page, the entire block must be erased at once, so new data can only be written in the white page behind. The red voids scattered across the green space look like hard disk fragments. If any of the blocks are red all at once, we can erase the entire block. It turns white again, and you can write data into it page by page. This happens a lot. After all, a block is not big, ranging from a few hundred KB to a few megabytes. If you delete a file of several megabytes and the data is stored consecutively, the entire block can be erased.

As more data is stored on a hard drive, the red hole takes up more and more space. So, as you can see, we’re going to run out of white blank pages to write data to. At this point, we will do a similar task to “disk defragmentation” in Windows or “memory garbage collection” in Java. Find a block with the most red holes, move the green data in it to another block, then erase the whole block, turn it white, and write the data again. However, this “disk defragmentation” or “memory garbage collection” work should not be done too actively or too often. Because SSD erasures are limited. If we defragment every now and then, our SSDS will die quickly.

So there is no way to make good use of SSD, solution: – reserve space

That is, the capacity of an SSD disk cannot be fully used. However, in order not to offend consumers, SSD manufacturers set aside a certain amount of space for this “disk defragmentation”. A 240 GB SSD disk has 256 GB disk space. SSD drive through our control chip circuit, the extra hard disk space, used for all kinds of data, so that you can write the space of 240 GB. This extra 16 gigabytes of space is called reserved space. The reserved space of SSD disks is about 7%-15%.

SSDS are especially suitable for applications that require more read and less write. In everyday applications, our system disks are suitable for SSDS. However, if we use SSD to do special download disk, has been downloading all kinds of audio and video data, and then disk backup is not very good, especially now QLC particle SSD, it only a few thousand erasable life.

SSDS can write data only one page at a time, but cannot overwrite pages. Data erasure can only be performed in a block. Therefore, we need to use a mechanism such as “disk defragmentation” or “memory garbage collection” to clean up the data holes in the block. SSD disks also reserve reserved space to prevent disk write failures.

Wear equalization, TRIM, and write amplification effects

Because SSD affected by erasing times, at the same time, our computer in the process of using SSDS, the computer will not to discriminate between the SSD, so may appear below, for the physical location of the operating system commonly used software not often erase block of data, and other SSD disk storage place will often face daily development code data to be erased. Over time, part of the SSD becomes faulty and becomes unavailable, reducing the available space of the SSD.

FTL and wear are balanced

Because of the above situation, it will lead to some bad blocks out in advance. There is also a waste of shoeshine times, so a wear balancing strategy is developed to evenly distribute the erasure times of SSD blocks. The core approach to implementing this technique, as with virtual memory, is to add an indirection layer. This indirection layer is the FTL – flash replacement layer. Just as virtual memory pages and physical pages are mapped through a page table when managing memory, FTL stores a simple mapping of logical block addresses to physical block addresses.

Disk addresses accessed by the operating system are logical addresses. Only after FTL transformation, it will become the actual physical address and find the corresponding block for access. The operating system itself does not need to worry about the wear of the block, as long as it operates as a mechanical hard disk to read and write data. All read/write requests to SSDS in the operating system go through FTL. FTL has physical blocks that correspond to logical blocks, so FTL can record the number of times each physical block is erased. If a physical block is erased more often, FTL can move the physical block to a less frequently erased physical block (the more frequently erased block and the less frequently erased block swap Pointers and transfer their data to the less frequently erased block). However, the logical block does not change and the operating system does not need to know about the change.

This is also a typical idea in the design of large-scale systems, that is, each layer is isolated. The operating system does not need to consider the underlying hardware, but completely turns over FTL in the control circuit of the hardware to manage the writing of the actual physical hardware.

TRIM directive support

However, the operating system doesn’t care what the actual underlying hardware is, and the use of SSDS also poses a problem. The problem is that the block states in the logical layer of the operating system and the logical layer of the SSD do not match. When you delete a file from the operating system, you don’t actually delete the file at the physical level. You just delete the meta information in the file system, which means that the inode can still be used and new data can be written to it. At this point, the actual physical storage space is marked as writable in the operating system.

Therefore, in fact, the daily file deletion, is only a logical deletion of the operating system level. That’s why, a lot of times, we accidentally delete the corresponding files, and we can use various recovery software to get the data back. This is also why, if we want to delete clean data, we need to use various “file shredding” functions.

This deletion logic is fine at the mechanical hard drive level, because the mechanical hard drive can replicate data, and since the file is marked as writable, subsequent writes can directly overwrite the location without actually deleting the file. SSDS are different.

When we delete a recently downloaded file in the operating system, such as a JDK installation file marked yellow as openJDK.exe, there is no meta information in the inode corresponding to the file in the operating system. However, at this point, the logical block level of SSD does not actually know this thing about the upper operating system. So in FTL logical block level, openJDK.exe still occupies the corresponding space. The corresponding physical page is still considered occupied.

At this point, if we need to garbage collect SSD, the physical page corresponding to openJDK.exe will still have to be moved to another Block in the process (because we don’t know that the data in the physical Block is invalid). It’s only when the operating system writes data to the inode that we realize that the original yellow pages are no longer useful, and we mark them as obsolete. So, in the case of SSD drives, the operating system does not know about file deletion. As a result, we often move a lot of deleted data in order to wear balance. This leads to a lot of unnecessary data reads, writes, and erases, which costs SSD performance and shortens SSD life.

To address this issue, both operating systems and SSD master chips support the TRIM command. This command allows the operating system to notify the SSD when a file is deleted that the corresponding logical block is marked as deleted. All current SSDS support the TRIM command.

If this command is not notified, the SSD is not aware of the logical deletion of data on a block at the operating system level and does not know that the data is invalid until the next time the operating system writes data to that location. The valid data in the block will be migrated due to the wear and balance policy of the block. Therefore, the data will be treated as valid data for invalid migration. SSD performance and wear are affected.

Write amplification

The creation of the TRIM command also reflects a problem with SSD drives, which tend to slow down over time. As SSD storage space becomes more and more occupied, there may not be enough blank space each time new data is written. You may have to do garbage collection, merge pages in some blocks, and then erase some pages to make room. At this point, at the application or operating system level, it may be writing a 4KB or 4MB data. However, after actually going through FTL, you might have to move 8MB, 16MB or even more data.

The actual amount of data written by flash memory/amount of data written by FTL (migration) = write magnification. The larger the write magnification is, the worse the actual SSD performance will be. To solve the problem of write amplification, we need to periodically carry out garbage collection in the background. When the hard disk is relatively idle, we will finish the work of handling data, erasing data and leaving blank blocks, rather than wait for the actual data to be written, and then carry out such operations.

AeroSpike

So it’s really hard to use SSDS well. The high performance of SSDS cannot be effectively exploited by simply replacing HDDS unless its features are properly taken advantage of. KV database like AeroSpike is a good implementation.

First, the AeroSpike operates the SSD hard drive and does not pass through the operating system’s file system. Instead, it operates directly on blocks and pages inside SSDS. Because the file system in the operating system, for the KV database, just gives us another layer of indirection, will only reduce performance, there is no real use for us.

Second, AeroSpike makes two optimizations when reading and writing data. When writing data, the AeroSpike tries to write a large block of data rather than frequently writing many small blocks of data. This makes hard drives less prone to frequent disk fragmentation. Also, it is easier to take advantage of sequential writes by writing a large chunk of data at once. One block of data that AeroSpike writes is 128KB, much larger than the 4KB of a page.

In addition, when reading data, the AeroSpike can read as little as 512 bytes of data. Because SSDS have good random read performance, they don’t have erasure life issues like writing data. Also, a lot of times the data that we’re reading is the value of the key-value pair, and that data is going to be transferred over the network. If we have to read a large amount of data at once, our network bandwidth will be insufficient.

Because AeroSpike is a real-time KV database with high requirements on response time, if there is a serious write amplification effect, the response time of written data will be significantly longer. So AeroSpike does a couple of things like this:

  1. Is continuous disk defragmentation. The AeroSpike used what’s called a high-water algorithm. In fact, this algorithm is very simple, once a physical block of data fragmentation exceeds 50%, the physical block is compressed, and then the data erasure, to ensure that the disk always has enough space to write.
  2. In AeroSpike’s best practices, it is recommended that you use only half of the SSD capacity to ensure database performance. That is, we artificially reserved 50% of SSD space to ensure that SSD write amplification is as small as possible and does not affect database access performance.

Just because of these kinds of optimization, when NoSQL database was just emerging, the performance of AeroSpike was far behind Cassandra and MongoDB databases, and the performance gap between AeroSpike and these databases sometimes reached an order of magnitude. This made AeroSpike the benchmark for high-performance KV databases at the time.

The problem

Why is AeroSpike less popular now than Redis?

The author replies:

Hello, in fact, as far as I know, AeroSpike has been widely used at home and abroad. I think the core reason why Redis is not popular is because open source is late.

In addition, for most startups with less data, the storage space is enough with memory as cache, and Redis is enough, but AeroSpike is not needed for the time being.

DMA

DMA: Even with the PCI Express interface, SSDS are still slow compared to cpus. Therefore, if the CPU issues corresponding instructions for I/O operations and then waits for the I/O device to complete the operation, the CPU spends a lot of time actually waiting for the I/O device to complete the operation. It has no practical significance. And a lot of what we do with the I/O device is actually just transferring data from memory to the I/O device. In this case, the CPU is just waiting. Especially when transferring a large amount of data, such as a large file copy, it can be a bit time-consuming to have all the data go through the CPU. So DMA technology, Direct Memory Access, was invented to reduce the CPU waiting time.

DMA technology: put a separate chip on the motherboard, in memory and I/O device data transfer, no longer through the CPU to control the data transfer, but directly through the DMA controller. This chip is a coprocessor ==. Assist CPU to complete the corresponding data transmission.

DMAC is most valuable when the data to be transferred is extremely large and fast, or the data to be transferred is extremely small and slow. For example, when we use gigabit network card or hard disk to transfer a large amount of data, if all carried by CPU, it will be too busy, so we can choose DMAC. When data transfer is slow, DMAC can wait for data to arrive and then send a signal to the CPU for processing, rather than leaving the CPU waiting.

DMAC is also a special I/O device that, like cpus and other I/O devices, connects to a bus for actual data transfer.

There are two types of devices on the bus. DMAC is both a master device and a slave device.

  • Master device: In order to initiate data transmission, you must be a master device, the CPU is the master device.
  • Slave device: A slave device (such as a hard disk) can only accept data transfer. Therefore, if data is transmitted through the CPU, either the CPU reads data from the I/O device or the CPU writes data to the I/O device. It must be the CPU master device

The slave device can also make a request to the master device, but instead of sending data content, it is a control signal. The I/O device can tell the CPU that I have data for you, but the actual data is pulled by the CPU, not pushed by the I/O device to the CPU.

DMAC is a slave device for the CPU; For IO devices like hard disks, it becomes a master device.

The process of data transfer using DMAC:

  1. First, the CPU still acts as a master device, making requests to the DMAC device. This request is to modify the configuration register in DMAC.
  2. When modifying the DMAC configuration, the CPU tells the DMAC the following information:
    1. Initial value of the source addressAs well asThe mode of adding or subtracting addresses during transmission.
      1. Source address: Where data is transmitted from. If we are writing data from memory to hard disk, it is the address of the data to be read in memory. If data is read from the hard disk into memory, that is the address of the I/O interface of the hard disk.
      2. Address addition or subtraction mode: that is, whether data is transferred from a large address to a small address or from a small address to a large address.
    2. The initial value of the destination address and the mode of adding or subtracting the address during transmission.
    3. The length of data to be transmitted. That’s how much data is transferred.
  3. After this information is set, the DMAC becomes an idle state.
  4. If you want to load data from the hard disk into memory, the hard disk will make a data transfer request to the DMAC. == This request does not go through the bus, but through an additional wire ==.
  5. The DMAC then needs to respond to the request through an additional connection.
  6. Thus, the DMAC chip, to the hard disk interface to initiate a bus read transmission request. The data is read from the hard drive to the DMAC controller.
  7. DMAC then sends a bus write request to our memory to write data into memory.
  8. DMAC repeats steps 6 and 7 until the data length set in the DMAC register is transferred.
  9. After the data transfer is complete, DMAC returns to the idle state of step 3.

Therefore, in the whole process of data transmission, we do not carry data through CPU, but by the CHIP DMAC to carry data. But the CPU is also essential in this process. It is the CPU that sets what data is transferred, and from where to where. That’s why DMAC is called a coprocessor.

In the beginning, there was no DMAC in the computer, all the data was carried by the CPU. With the increasing demand for data transmission, the independent DMAC controller on the motherboard first appeared. Today, there are more and more I/O devices, more and more complex data transmission needs, and different scenarios are used. In addition to the display, network card, hard disk for data transmission requirements are not the same, so each device has its own DMAC chip.

Kafka

Kafka takes advantage of THE DMA data transfer method and achieves significant performance gains over DMA.

Kafka is a pipeline for processing real-time data. We often use it as a message queue, or to collect and drop massive amounts of logs. As a pipeline for processing real-time data and logs, the bottleneck is naturally also at the I/O level.

There are two common scenarios for mass data transfer in Kafka. One is to receive upstream data from the network and then drop it to the local disk to ensure data loss. In the other case, it is read from the local disk and sent over the network.

In the latter case, data is read from disk and sent to network. The most intuitive way is to use a file read operation to read data from disk into memory, and then use a Socket to send the data to the network.

File.read(fileDesc, buf, len);
Socket.send(socket, buf, len);
Copy the code

In this process, it seems that there are two steps, but actually there are four data transmission processes. Two of these are DMA transfers and the other two are CPU-controlled transfers.

  1. The first transfer is read from the hard disk into the buffer in the kernel of the operating system. This transfer is carried == by ==DMA.
  2. The second transfer requires copying the data from the kernel buffer into the memory allocated by our application. This transfer is done by ==CPU transport ==.
  3. The third transfer, from our application’s memory, is written to the operating system Socket buffer. This transfer, again, is carried by the ==CPU.
  4. The last transfer needs to be written from the Socket buffer to the nic buffer. This transfer is again carried == via ==DMA.

This way of data transmission is relatively large, only a simple transmission needs four times of data transport. Among them, the above four transport steps, from the kernel read buffer to the application memory, and then from the application memory to the Socket buffer, are actually the same data in memory for operation, inefficient.

In a scenario like Kafka, most of the hardware resources that are ultimately utilized are actually used to move data around. Therefore, we need to minimize the need to move data. What Kafka does is it takes the number of data transfers from four to two, and only DMA does the data transfers, and no CPU is required.

@Override
public long transferFrom(FileChannel fileChannel, long position, long count) throws IOException {
    returnfileChannel.transferTo(position, count, socketChannel); } Kafka source code, finally called Java NIO library transferTo methodCopy the code

Kafka’s code calls the Java NIO library, specifically the transferTo method in FileChannel. Our data is not read into the intermediate application memory, but directly written to the corresponding network device through the Channel. In addition, Socket operations are not written to the Socket Buffer, but are directly written to the Buffer of the network adapter according to the Descriptor. So, in this process, we only made two data transfers. Data retention in memory is eliminated.

  1. The first time, through DMA, is to read directly from the hard disk into the read buffer of the operating system kernel.
  2. The second time, according to the Socket descriptor information, directly from the read buffer, write to the network card buffer.

As a result, we transferred the same piece of data twice instead of four times, and instead of moving the data through the CPU, all the data was transferred via DMA. In this method, no data is “copied” at the memory level, so this method is also referred to as zero copy.

There is an article in IBM Developer Works that has written a program to test the performance gains of using zero copy on the same hardware. I put a link to the article here. At the end of this article, you can see that using zero copy can reduce the time to transfer the same data regardless of the size of the data transfer by 65%, greatly improving the throughput of the machine. To learn more about zero copy, read this article carefully.

conclusion

It would be particularly wasteful if we kept leaving the CPU to do all the data transfer work. For one thing, our data transfer work doesn’t use much of the CPU core’s “computing” function. On the other hand, the CPU runs much faster than I/O operations. So, we want to take the load off the CPU.

So engineers put a DMAC coprocessor chip on the motherboard. With this chip, the CPU simply tells DMAC what data we want to transmit, where it’s coming from, and where it’s going, and it’s safe to leave. The subsequent actual data transmission work will be done by DMAC. With the increasing number of peripherals in modern computers, a general purpose DMAC chip is no longer enough, so we add DMAC chips to all peripherals, making the CPU less concerned with data transmission.

In our actual system development process, good use of DMA data transmission mechanism, can also greatly improve I/O throughput. The prime example is Kafka.

Traditionally, reading data from the hard disk and sending it out through the network card requires four transfers, two of which are between the buffer in memory and the corresponding hardware device, which we can’t save. But on two other occasions, data was copied in memory entirely by the CPU.

In Kafka, the FileChannel transferTo method call in Java’s NIO allows us to avoid copying data into our application’s memory. By DMA, we can write data directly from the memory buffer to the card buffer. After using this zero-copy method, we can reduce the time of transmitting the same data by 1/3, which is equivalent to a three-fold increase in throughput.

This is why Kafka is now the standard solution for real-time data transmission pipelines.

Data integrity – single bit inversion

Single-bit reversal: For example, the binary representation of some data in memory is 00100100, but after the single-bit reversal problem, it will become 00110100.

Single-bit reversals or errors in memory are not a particularly uncommon phenomenon. Whether it is because of the manufacturing quality of memory caused by leakage, or external rays, there is a certain probability, will cause single-bit errors. At the memory level, the software engineer is not aware of the error, and the error is likely to be random.

Solutions:

Parity: Treat N bits in memory as a group. Common ones, like 8 bits is a byte. And then, using that extra bit to keep track of whether those eight bits have an odd number of ones or an even number of ones. If it’s an odd number of 1s, the extra bit is recorded as a 1; If it’s an even number of ones, the extra bit is recorded as a zero. This extra bit is called the check bit.

As shown, if a one-bit flip occurs in this byte, the check point is inconsistent with the actual check point. You know that the data is wrong. But again, there are other drawbacks to this approach.

  1. Parity can only solve the problem of encountering a single bit error, or an odd number of bits.
  2. It can only detect errors, but not correct them. Even if a data error is found in memory, we can only abort the program, not let it continue to run normally.

So, a solution to remedy these two problems has come up. ECC, the full name of ECC memory is error-correcting Code memory. As the name implies, is in memory errors, can be corrected. And it can find more errors.

It can not only catch errors, but also correct errors that occur. This strategy is often called error-correcting codes. It also has an updated version, called erasure codes, that not only corrects errors, but also erases data when errors cannot be corrected. Whether ECC memory, network transmission, or even RAID of hard disks, in fact, all use the related technologies of error correction codes and erasure codes.

== In terms of data volume and load, without ECC, single-bit flipping is actually a fault with high probability. = =


According to the teacher’s description, the probability of single-bit flipping is not low, but most PCS do not use ECC. Why do PC seldom hear of bugs caused by this problem?

Author’s Reply: You Ming,

Hello, there are two situations:

  1. The first is that the actual load of the PC is much lower than that of the server, and most of the time your PC is idle, with low CPU usage, low memory usage, and nothing to compute. Servers are often running at high load around the clock. The server may perform more computations in a day than your PC does in a year. There may be 1,000 computers in a data center at once, meaning that a server may encounter problems once in a lifetime for a PC.

  2. The second is that there are many times when you are not aware of it, for example, the program crashes suddenly, the machine restarts with blue screen, and even some program data errors, you do not care which one is caused by single-bit flip.

Reference: Can a “single bit error” really crash an entire program or even a computer system? What error detection and correction mechanisms do modern computers have?

Data integrity – Hamming code

Hamming code solves the legacy of the above error detection code, it can not only check out the error, and can check where the error. It is also called error correcting code. Error-correcting codes require more redundancy, which not only tells you where the data is wrong, but also allows you to correct the data directly. ECC memory also uses Hamming codes to correct errors.

The basic hamming code is called 7-4 hamming code. The “7” here refers to the actual valid data, there are seven bits. The “4” here means that we store 4 extra bits of data for error correction, but the error correction ability of error correction code is limited. In 7-4 Hamming code, we can only correct a certain 1 bit of error.

A 4-bit check code that can represent 16 different numbers. The check value calculated from the data bits must be certain. Therefore, if the data bit is wrong, the calculated check code must be different from the specified check code. And the possible values are in the remaining 15 possible check values where 2 to the fourth minus 1 is 15.

There are 15 possible check values for 15 possible error bits. 2^3-1 = 7, but the same parity bit can also fail. Therefore, if only one of the seven data bits and three check bits is wrong, the number of possible error bits is 10. 2^ 3-1 = 7, so it is impossible to determine which one is wrong. It has to satisfy the following equation.

K + N + 1 <= 2^N The data bit has K bits and the parity bit has N bitsCopy the code

In the case of 7 data bits, K=7, the minimum value of N is 4. 4 parity bits, actually can support up to 11 data bits. Below, the comparison table between data bits and parity bits

Error correction principle of Hamming code

The encoding mode of Hamming code: a 4-3 Hamming code (i.e., 4 data bits and 3 parity bits). Let’s call the four data bits D1, D2, D3 and D4. We call the three check bits P1, P2 and P3.

From the 4-bit data bit, we remove 1 bit and calculate a corresponding check bit. This check bit is computed using the parity we talked about earlier. For example, we use d1, d2, d4 to calculate a check bit P1; A check bit P2 is calculated with D1, D3 and D4. Calculate a check bit P3 with D2, D3, d4. Like the table below:

In this case, if the d1 bit is wrong, we will see that p1 and P2 sum are calculated differently. D2 is wrong, the calculation results of P1 and P3 are different; If D3 is wrong, p2 and P3; If d4 is wrong, then p1, P2, p3 are different. You will find that when the data code is wrong, at least 2 bits of the check code will be computed inconsistently.

On the other hand, if the check code of P1 is wrong, only the check result of P1 is wrong. P2 and P3 error results are the same, only the calculation of one check code is inconsistent.

So the check code is inconsistent, and there are 7 cases where 2^3-1=7, which correspond to 7 different bit errors.

Generation rule of Hamming code

First, determine how many bits of data to transmit after encoding. For example, our 7-4 Hamming codes are 11 digits in total.

We then number the 11 bits from left to right and write out their binary representation as well.

Next, find the integer power == of binary ==. In this 7-4 Hamming code, it’s 1, 2, 4, 8. These numbers are our check points, and we record them as P1 to P4. If you look at it in binary terms, they’re the only four of these 11 numbers, only one of the four bits is a 1.

So the remaining seven numbers are our d1-D7 data points.

Then, for check points, parity is used again. But for each check bit, not all seven bits of data are used to calculate the check. It’s p1 in terms of 3, 5, 7, 9, 11. That is, these numbers in binary notation, if the first bit from right to left is 1, use P1 as the check code.

The remaining p2, we use 3, 6, 10, 11 to compute the check code, that is, p2 if the second bit from right to left is 1 in binary notation. So, p3 is naturally the digit check code from right to left, and the third bit is 1. P4 is the check code if the fourth bit is 1. As shown in figure:

At this point, it can be seen that any data code error, there will be at least two or three corresponding check code mismatch, so that the reverse can be found which data code error. If the check code is wrong, then only this bit of the check code doesn’t match, then we know that the check code is wrong.

Hamming distance

Another way to understand the role of hamming code: for two binary representations of data, the difference between them is called the Hamming distance. For example, the haiming distance between 1001 and 0001 is 1, because they differ only in the leftmost digit. And the Hamming distances of 1001 and 0000 are 2, because they have two different leftmost and right-most digits.

So, what’s called one-bit correction, all the hamming numbers that are 1 away from the data that we want to transmit, can be corrected back.

And any two actual numbers that we want to transmit, the Hemming distance is at least 3.

Why can’t it be 2? Because if it’s 2, then there’s one wrong number, and the Hemming distance to both correct numbers is 1. When we see the wrong number, we don’t know which number to correct.

Correct the original value of the number according to bias.

conclusion

Hamming code looks simple but we still use this technical solution in ECC memory today. And Hemming won the Turing Award for hamming code.

By adding multiple redundant check bits to the data, Hamming code can not only detect errors in the data, but also correct the error bit when only a single bit of data fails. In the process of understanding and calculating hamming codes, it is important that not only the original data bits can be wrong. The newly added check bit may also have a single bit flip error. This is why a 7-bit data bit with a 3-bit parity bit is not enough, but a 4-bit parity bit is needed.

The actual hamming code coding process is not complicated, we use different parity bits to match multiple different data groups, to ensure that any data bit error, will produce a unique combination of multiple parity bits error. This way, when something goes wrong, we can turn around and find the data bits that went wrong and correct them. When only one check bit is wrong, we know that it is the check bit that is actually wrong.

Distributed computing

If we only have one computer in the data center, we run into three core problems:

  1. The choice of vertical extension and horizontal extension
  2. How to maintain high availability
  3. Consistency problem

Horizontal scaling

If you use the server provided by the cloud vendor to build a website of your own, to provide services to users. One CPU core, 3.75 GB memory, and a 10 GB SSD system disk. Such a server costs about $28 a month.

But then the traffic starts to come up, and the server is running a little low, and you need to upgrade the server, but you have two options:

  1. The first option is to upgrade the hardware of the current server to two CPU cores and 7.5 gigabytes of ram. This choice is called vertical scaling.
  2. The second option is for us to rent the same server as before. So we had two servers with one CPU core and 3.75GB of ram. This choice is called horizontal scaling.

At this stage, there seems to be no cost difference between the two options. A two-core server with 7.5 gigabytes of memory costs $56.61, while two one-core servers with 3.75 gigabytes of memory cost $57, a difference of less than 1%. However, vertical scaling and horizontal scaling may seem like two different options, but as traffic continues to grow. At the end of the day, it’s just a choice. That is, it will expand vertically and horizontally, and eventually rely on horizontal expansion to support Internet services such as Google, Facebook, Alibaba and Tencent.

The logic and advantages behind vertical scaling are simple. Generally speaking, vertical scaling usually doesn’t require us to reprogram, which means we have no r&d costs. The reason we end up using horizontal scaling is simple: we can’t keep doing vertical scaling. The best server we can buy on Google Cloud right now has 96 CPU cores and 1.4 TERabytes of memory. If we get so much traffic that a 96-core server won’t be able to support it, we won’t be able to scale vertically anymore. At this point, we have to use the horizontal expansion scheme.

However, once horizontal scaling begins, we face the problem of adapting at the software level. So we need to start doing distributed computing. We need to introduce components such as load balancing to distribute traffic. We need to split the application server and the database server for vertical functional shard. We also need message queues between different applications to perform asynchronous tasks.

All of these software changes are at the heart of distributed computing: getting multiple computers to work together on tasks through messaging rather than shared memory. And because we must eventually scale horizontally, we need to design the system based on messaging rather than shared memory early in the system design. Even if the messages are only delivered on the same server. (because then it will inevitably be split vertically)

High availability – Single point of failure

Benefits of horizontal scaling:

  1. You can “force” the system to support horizontal scaling as early as possible from a development perspective, so that vertical scaling solutions don’t fall behind when real traffic is growing rapidly.
  2. Assurance of the availability of the system.

The above 1 core to 2 core vertical scaling, after scaling, we still only have 1 server. If this server has a little hardware failure, for example, the CPU goes down, our entire system goes down and becomes unusable. With horizontal scaling, even if one server’s CPU fails, we still have another server to provide service. The load balancer can detect through health checks that the failed server is unresponsive and automatically switch all traffic to the second server, an operation known as failover. Our system is still working.

System availability refers to the percentage of time that our system is in service. Whether it’s a hardware or software failure, or the need to shut down the system for an upgrade, we lose the availability of the system. Availability is usually expressed as a percentage number, such as 99.99%. We said that the system should be 99.99% available per month, which means that your service should not be down for more than 4.32 minutes per month.

Some of the loss of system availability was planned for. Such as the downtime upgrade mentioned above, this is called planned downtime. Some unplanned loss of system availability, such as the sudden failure of a server’s hard drive, is called unplanned downtime.

It is definitely impossible for our systems to achieve 100% availability, especially with unplanned downtime. From simple hardware damage, to machine room power failure, optical cable dig, and all kinds of natural disasters, such as earthquake, flood, tsunami, can make our system unusable. As engineers and architects, our job is to improve the usability of the system as cheaply as possible.

The availability of servers today is pretty good, usually 99.99% availability. If we have a small three-server system, one with Nginx deployed as a load balancer and reverse proxy, one with PHP-FPM as a Web application server, and one with MySQL as a database server. Each server has 99.99% availability. So our overall system availability is 99.99% × 99.99% × 99.99% = 99.97%. In this system, the numbers don’t seem that different. But on the flip side, we went from a 0.01% loss of availability to a 0.03% loss of availability, and the duration of unavailability tripled. If we had 1000 servers, the overall availability would be 99.99% ^ 1000 = 90.5%. That means our service is unavailable for more than a month of the year.

In this scenario, any server fails and the entire system becomes unusable. This problem is called a single point of failure.

To solve the single point of failure, the first step is to remove the single point. Horizontal scaling means that two servers provide the same functionality and then distribute traffic to two different servers through load balancing. Even if one server goes down, there is still one that can provide services. But it’s not enough to have two servers. Single points of failure are everywhere in the data center. We are now using two virtual machines in the cloud. If the two VMS are hosted on the same physical machine, the physical machine itself becomes a single point. Then we need to divide the two virtual machines into two different physical machines. But that’s not enough. If the two physical machines are on the same rack, the switch on the rack becomes a single point. Even on different racks, there is still a chance that the entire data center will experience an unexpected failure.

In addition, the data center where the services are deployed on the cloud may have all servers shut down due to heat dissipation problems. Faced with this situation, it is necessary to design and deploy remote live system. Therefore, in modern cloud services, when you buy a server, you can choose the area and zone of the server, and whether to put the server in different areas or regions is also an important factor to avoid single point of failure.

It’s just that single points can be removed, and the usability problem is not solved. For example, we used load balancing to evenly distribute traffic to two servers, and when one application server fails, we do have one server to serve. However, the load balancer will send half of the traffic to the server that has already died, so this is only half available. To make the entire service fully available, we need a failover mechanism. To failover, you first need to be able to detect the failure.

In the case of our PHP-FPM Web application, the load balancer usually periodically requests a health check address provided by a Web application. The interval may be 5 seconds. If the health check fails for two or three consecutive times, the load balancer will automatically switch the traffic of this server to another server. So, we automatically have a failover. Automation of failover is important in large systems because the more servers you have, the more likely it is that a failure will occur. Automated failover not only reduces manpower requirements for operations and maintenance, but also improves availability by shortening the time from fault discovery to problem resolution.

You can set up a Heartbeat interface on your Web application that checks every 20 seconds to failover when something goes wrong.

Thus, the availability of a system that horizontally scales out a single point of failure with the same function of the server, and triggers automatic failover via a health check mechanism, is 100% – (100% – 99.99%) × (100% – 99.99%) = 99.999999%. This reduces the number of hours of inservice by a factor of 10,000. In practice, usability is not that ideal. In terms of hardware, from the server to the switch, from the network cable to the power supply of the equipment room, from the overall heat dissipation of the equipment room to the external optical fiber lines, there are many possible problems. This is why we need to design high availability at the system level, because after the number of machines, each step can make a big difference.

conclusion

Horizontal scaling and usability are not enough. Once you have a lot of servers in your system. In particular, when we scale out stateful databases with the same functionality to ensure availability, we face a new challenge: partition consistency. However, the problem is more of a software design problem.

Vertical scaling of service capabilities by upgrading hardware specifications. In addition, you can also increase the number of servers to improve service capacity. But at the end of the day, we must take the path of horizontal expansion.

Partly because vertical scaling is unsustainable; On the other hand, only horizontal scaling can guarantee high availability. Securing high availability through horizontal scaling requires us to do three things. The first is to understand how usability is calculated. Server hardware damage is only one of the factors that may cause availability loss. Power supply, heat dissipation, switches, and network cables in the equipment room may cause availability loss. And external fiber optic cables, natural disasters, also have the potential to cause our entire system unusable. Therefore, when analyzing and designing the system, we need to eliminate single point of failure as much as possible. Furthermore, we need to have an automated failover strategy for hardware failures. After all these strategies are in place, we can really breathe a sigh of relief and sleep well under the massive load and traffic.