Summary: For a data storage system, ensuring that data is not lost is the bottom line, is also the most difficult part of the data storage system. According to statistics, 93% of companies that lose a data center for 10 days will go bankrupt within a year. So if we want to make sure that the data is not lost, what measures can we take?

The author rushed Zhi source | | ali technology to the public

The background,

For a data storage system, ensuring that data is not lost is the bottom line, but also the most difficult part of the data storage system. Imagine if your bank account record is 10,000 and it is lost because of a data storage system anomaly or a data error that causes a bit flip from 10,000 to zero. The impact is fatal. According to statistics, 93% of companies that lose a data center for 10 days will go bankrupt within a year.

The industry terms Data integrity and Data Corruption describe such problems. In addition to Data errors, they also describe problems with Data storage, transmission, and other processes. In order to ensure consistent understanding, first clarify the definition of not losing data and good data:

  • Data is not lost, refers to the relevant content is not lost. For example, a 100 MB file is partially or completely missing; Alternatively, some or all of the file’s metadata is missing, typically as the file creation time field is missing.
  • The data is good, meaning the content exists, but an error has occurred. For example, all 100MB files exist, but some or all of their data are wrong, which is different from the original data (for example, 10,000 errors are stored as 0). Or, some or all of the file’s metadata is wrong. For a storage system, data is represented as either a 0 or a 1, so a data error is a bit flip, where the data goes from 0 to 1, or from 1 to 0.

At the same time, Data Consistency is also a related term, but it has more stringent requirements, and Data loss or error can lead to Data Consistency issues. However, data consistency is not necessarily guaranteed when data is not lost or not lost, because the consistency requirement is not met in the business logic design, such as the consistency requirement of database transaction ACID, which is usually logical data correctness. This paper focuses on the causes of data failure, and how to prevent and control the design of the data storage system, without in-depth discussion of database transactions.

1 Common Bit Flip of disk, memory and network data

For the computer system, whether it is calculation or storage, whether it is electronic parts or mechanical parts, it is the binary system of 0 and 1, there is the problem of data flip, so the key to good data is to protect the bit flip.

The position of the plate is reversed. Both HDDs and SSDs contain storage media and data reading. Bit flipping may occur at the media level or data reading level.

  • In order to detect bit flipping at the media level, extra space is usually added to store the check bits. For example, the HDD extends the 512-byte block size to 520 bytes, adding an 8-byte (Data Integrity Field) content that uses a 2-byte (Guard) Field to store CRC16 values based on the 512-byte block content.
  • In order to detect the bit flip detection of data reading, CRC is used to check the external cable interface access layer, while ECC is used to check the internal read-write components. To this end, S.M.A.R.T. information also provides Ultradma CRC Error Count, Soft ECC Correction, Hardware ECC dissolved field to Count the errors.
  • Memory bit flip. Memory, as an electronic device, is vulnerable to interference, such as signal crosstalk, cosmic rays, etc., resulting in bit flips, for which the introduction of ECC (Error Correction Code) memory.
  • Network bit flip. Network card as a transmission device, due to the cable, interface, internal devices and other problems in the transmission process, it may also appear bit flip, so the network transmission, usually add Checksum to check the flip.

The above is a typical data flip scenario, especially disk and memory are usually found in error when accessing; But in fact, the Data has been wrong when it is not found, so the industry also called “Silent Data Error (SDE, Silent Data Error)”.

2 Hidden CPU SDE

In addition to data silencing errors in the storage hardware that have data flips, the CPU, the core part of computing, has a similar problem. Because the CPU also has internal registers, cache, transmission bus and other electronic devices, although also added to the detection mechanism, but also has the probability of error. From the perspective of CPU error classification, there are typically three types:

  • Hardware detects errors. Through the hardware design inside the CPU, errors can be detected and corrected automatically in some cases with little or no impact on the system.
  • The user can detect errors. In some cases the hardware can detect errors but cannot correct them, and these errors must be visible to the user, typically such as downtime crashes.
  • Silent data error. Such errors are neither detected by the hardware nor notified to the operating system, but the data is simply written to memory by the CPU, so that it cannot be known that it is an error.

Silent data errors on the CPU are fatal because the experience of the program is to have the results returned by the CPU and stored and processed, but the data has actually gone wrong, unknown to the business, and without any warning. Using the data validation function and the CPU SDE error checking tool, the team found several SDE errors, and the industry Google and Facebook have published articles describing the problem, which can be described as the current industry problem.

Second, CPU SDE fault discovery process

1. Find the problem

Recently, two core modules developed by the team were found to have abnormal calibration data on the specified server of a cluster. Since the two core modules were found at the same time, the root cause investigation direction was shifted to hardware after eliminating the problem of software module.

2. Analysis and positioning

If the MD5 value is correct, then compare it with the correct MD5 value of the data to see if the CPU has returned the wrong data.

$pwd /dev/shm $ cat t.py import os import sys import hashlib data = open("./data").read() hl = hashlib.md5() hl.update(data) digest = hl.hexdigest() print "digest is %s" % digest if digest ! = "a75bca176bb398909c8a25b9cd4f61ea": print "error detected" sys.exit(-1)

After testing on the failed machine, it was found that the return value of MD5 calculated by CPU would occasionally return the wrong MD5 value, and there was no exception in the operating system at this time, and the problem was identified as CPU-related.

3 Manufacturer Confirmation

After communicating with the CPU manufacturer, it was confirmed that a hardware fault occurred in one of the CPU’s cores, resulting in this exception, and the following solutions were discussed.

  • Short-term solutions. The manufacturer provides tools for rapid online monitoring of similar faults. The team completes the testing of the detection tool (mainly for typical CPU instruction set), and after it is verified in business, it will be put online quickly.
  • Long-term solutions. The manufacturer provides tools to monitor all types of silent data error faults online, and discusses root causes and optimization measures in detail.

Three, the storage system data is not lost good design thinking

1 data is not lost good system thinking

In order to better control the risk of data storage system in the aspect of data loss, the multi-dimensional thinking as shown in the figure above is carried out.

Failure mode

From the point of view of fault source, fault mode is divided into hardware fault and software fault.

Time of failure

At the moment of failure, there are two situations as follows:

  • On The Fly. Indicates that data errors occur during reads and writes, such as CPU calculations, memory access, network transfers, or disk reads.
  • The Rest. Represents the bit reversal caused by aging of components, environmental effects, cosmic rays and other factors after the data is stored in the medium.

Fault data type

From the point of view of the failure data type, the affected data may be metadata or data; Most systems contain metadata and data, including the underlying hard disk that holds configuration metadata, so they are all prone to errors.

Fault detection method

Data errors will occur 100% of the time, without any luck. Fault detection method will be extremely important, which can detect data errors quickly, and will gain more opportunities to repair data.

In order to detect faults, the industry provides the following typical check code algorithm or error correction code algorithm:

  • XOR algorithm. Single Bit roll-over errors can be detected by calculating the check value in binary XOR (XOR).
  • CRC (Cyclic Redundancy Check) algorithm. Usually from N data bits, through mathematical polynomial calculation to get k check bits, to achieve error detection and correction. It is widely used in data transmission and verification, as well as the storage of hard disk verification.
  • LDPC (Low Density Parity Check Code) Algorithm It is a kind of linear code defined by the check matrix. In order to make decoding feasible, when the code length is long, the check matrix needs to meet the “sparsity”, that is, the number of 1 in the check matrix is relatively low, that is, the number of 1 in the check matrix is required to be far less than 0, and the longer the code length is, the lower the density is. In SSD storage, LDPC is also used at scale.

In addition to the basic detection algorithm mentioned above, there are also detection algorithms related to the business logic of each layer, which should carry out correctness detection according to the data structure and algorithm logic of the business.

Troubleshooting Solution

Typical data error troubleshooting solutions fall into the following two categories:

  • Data redundancy repair. For example, based on coding redundancy to repair, as well as copy, error correction code data repair.
  • Data backup and repair. Save point-based data, such as incremental or full backups. When a data error is detected, you can restore the data based on the point in time, but it is not up to date.

2 Hardware data errors and repair typical scenarios

Memory data error

  • Data error mode. A single bit error occurred and a bit flip occurred.
  • The moment of failure. This usually occurs when you read or write to memory, typically memory reads, writes, copies, etc.
  • Fault data type. The relevant data affected by the error bit may be business metadata or business data.
  • Fault detection method. Using ECC memory, using similar Hamming code technology data bit + check bit to detect.
  • Troubleshooting solution. Typically, ECC can automatically fix single-bit errors without application awareness; Multibit errors can also be detected, but cannot be fixed, when the operating system level responds to memory MCE errors.

Network card data error

  • Data error mode. The components in the network card are abnormal, the network cable is loose, and there is data error.
  • The moment of failure. The time when data is transmitted over a network, typically as a network receiving and sending packet.
  • Fault data type. The relevant data affected by the error bit may be business metadata or business data.
  • Fault detection method. Add validation to each layer of network protocol packet, and detect errors through validation.
  • Troubleshooting solution. Usually in the network protocol, the wrong network packet is discarded and retransmitted.

Disk data error

  • Data error mode. In addition to the use of CRC verification for the disk interface transmission, the disk storage media will have a bit flip error.
  • The moment of failure. Data stored on the media has a silent data error that is only detected when it is read.
  • Fault data type. The relevant data affected by the error bit may be business metadata or business data.
  • Fault detection method. When data is stored on a medium, such as a 512-byte Sector, additional CRC validation data and LBA (Logical Block Addressing) information are stored, This allows you to check if the data is wrong or if the LBA is written wrong (typically a Firmware error). Through the background data scan, found the error.
  • Troubleshooting solution. Redundancy between multiple blocks can be made through the business software layer, such as saving data copy and erasure code, so that the correct redundancy data can be repaired.

CPU data error

  • Data error mode. The most difficult are CPU static data errors, such as calculating a CRC that returns an incorrect value, but the operating system does not report an exception.
  • The moment of failure. When using CPU for data calculation.
  • Fault data type. The relevant data affected by the error bit may be business metadata or business data.
  • Fault detection method. Comparison of repeated processing of different CPU cores in the single system and repeated processing of CPUs in different machines in the distributed system was conducted to Detect End to End detection (E2E Detect), as shown in the figure below.

  • Troubleshooting solution. The upper-level business should backup the original data and restore it after abnormal detection. For example, the machine saves the original data according to the append write mode and can restore it after the late calculation and processing error. Moreover, the recycle bin mechanism can be designed so that even if the data is deleted at the software level, it will remain for a certain period, so that there is a chance to repair bugs after the software logic is found.

3 software data errors and repair typical scenarios

Software bugs cause data errors that affect analysis

Software Bug is very serious, especially in the distributed system development software, in the business layer if there is no data backup, there is only one data, once the software Bug appears to delete the data, will be a disaster. Business layer software is usually divided into data and metadata, which are affected to different degrees. Generally, data errors in metadata have a greater impact.

Data error detection for software bugs

There are two types of data for business software, based on retention time:

  • Incremental data. New data written, changed, or deleted by the business is usually processed by the front-end logic of the business software.
  • Inventory data. The data stored in the early stage of the business needs to be changed due to the data migration and spatial arrangement of the business software, and the background logic is designed to deal with it.

Therefore, error detection should be targeted, including the following detection latitude:

  • Incremental data processing detection. The data update log is saved in the foreground logic of business software, and the detection logic verifies whether there are bugs in the foreground logic by checking the update log.
  • Inventory data processing and testing. The data change log is saved in the background logic of business software, and the detection logic verifies whether there are bugs in the background logic by checking the change log.
  • Full data testing. In view of the damage of the silent data of storage media, data errors may occur even if there is no software to modify the data. Therefore, it is necessary to design the full-volume data scanning logic to actively detect errors.

For the design of data error detection, the following key points need to be considered:

  • The error detection module should be decoupled. This module should be an independent module, designed separately, and is not related to the data processing logic of the foreground and background, so that the error detection module can accurately detect bugs.
  • Completion of data processing logic logging. The processing logic includes foreground logic and background logic, both of which need to ensure that the log is complete (the record of missing data changes), the log is correct (the log contains validations, such as CRC), and the log is not lost (the log will not be lost in case of power loss or exception).
  • Improve detection efficiency. In addition to finding out bugs, the most important purpose of data error detection is to support data recovery. Improving detection efficiency can better help data recovery. Detection efficiency mainly measures the number of detected files per unit time (metadata detection) and the amount of detected data per unit time (data detection), such as the number of detected files per day and the number of files per day.
  • Metadata detection takes precedence over data detection. Based on the importance of metadata, metadata should be checked first and quickly during detection, so that the impact of data errors can be better controlled.
  • Reasonable use of data redundancy layer verification. For example, full-volume data detection can make full use of the background scanning (Scrub), which is distributed in the data redundancy layer (copy, erasure code). When data errors are detected on a certain disk, it can be compared and corrected through the redundancy in this layer.

Data repair of software bugs

If a software Bug has been detected and an actual data error has occurred, a data fix must be performed, with the following key points to consider:

  • Data redundancy embedded. Business software is designed with data layer redundancy in mind and is built on redundant storage. Typical such as distributed storage of multiple copies, erasure code technology, to ensure that the data block after static data (especially storage media hardware, Firmware data error), data can be repaired in this layer.
  • Data backup. Although the business software is built on the data redundancy layer, the single file seen by the business layer will still lead to data loss if the file is mistakenly deleted by a software Bug. Therefore, the business software layer should do a good job of data backup design, such as CheckPoint, multi-version, snapshot, backup, etc.
  • Data backup retention cycle. The long-term retention of backup data has the problem of cost, so it is necessary to set a reasonable retention period to ensure the balance between backup and cost.
  • Data backup time consistency. Business software may back up data in multiple layers, so each layer should provide backup interface capability, and the backup time should be set uniformly by the business layer.
  • Data recovery efficiency. Data recovery tools should be able to recover data as quickly as possible, compared to the number of files recovered per day, the amount of data recovered per day.
  • Unified design of error detection and backup recovery. Assuming that “Data Backup Retention Cycle” is TB, “Error Detection Time” is Td, and “Data Recovery Time” is Tr, then TB > Td + Tr should be guaranteed.

Four, summary

Good data loss requires systematic design, which can effectively protect hardware data errors and resist software bugs. Therefore, comprehensive construction should be conducted from data redundancy, backup, error data detection, data recovery and other dimensions, as shown in the figure below.

As a data storage system, the business has done the following work based on the above ideas:

  • Distributed data redundancy configuration. In the distributed layer configuration copy or erasure code, so that in the case of a data error, the correct data can be corrected.
  • The business layer adopts append write to support multiple versions to realize data backup. Save the backup history data to the recycle bin for use when you need to restore.
  • End-to-end CRC check. At each layer of the laton service, calibration is carried out in the single machine, distributed computing link, network link and writing disk link, and the important metadata should be focused on verification.
  • Record the data update log of foreground and background. The business will correctly record the request information of various data changes to the log, and the log will be persisted.
  • Incremental, stock and full data detection mechanism. For incremental data, combined with log implementation, typical scan detection is completed within 1 day; For the stock data, combined with the log implementation typical scan detection within 3 days; For full data, based on parameter configuration, typical scan detection is achieved within 60 days.
  • Data recovery mechanism and organization. The business establishes a special data recovery team to practice the accuracy of data recovery for each version; And through the batch processing mechanism, improve the recovery efficiency.

Although the business has made the above good prevention mechanisms, there is still a lot of room for improvement in terms of CPU static data errors. In theory, to do not lose good data, only the probability of infinite close to 100%. To achieve this, in addition to technical optimization, it also requires a hard work of accountability and management, and a constant reverence for the data.

This article is the original content of Aliyun, shall not be reproduced without permission.