The introduction

Replication, Partition, Consensus, Transaction, etc., are all very important in distributed system design. In this paper, the Reliability, Scalability and Maintainability of distributed systems are discussed, and the problems solved by these techniques are described.

Reliability

Reliability refers to the ability of a system to work properly under any circumstances. A system is completely reliable if it works properly in the event of any exception. In reality, there are many types of exceptions, some of which are difficult to avoid in advance. Therefore, it is important to understand the possible exceptions and analyze how to quickly recover from them when they occur. Generally, exceptions include hardware exceptions, software exceptions, and human exceptions.

Hardware abnormal

A server may fail to work properly if any component, such as a hard disk or power supply, is damaged. Usually, such exceptions are unavoidable, but we can use some technical means to achieve rapid recovery after the occurrence of exceptions, no matter from the perspective of software or hardware, the basic solution is redundancy. From the point of view of hardware, we can redundancy multiple pieces of hardware through a single machine, when one of the hardware is abnormal, we can quickly replace the faulty hardware with good hardware, this kind of hardware redundancy has no effect on the failure of data center level. From a software perspective, quick recovery can be achieved through Replication, where traffic can be diverted to new copies at the software level when a server hardware fails (there is actually hardware redundancy, but this approach is more flexible). In addition to Replication, Sometimes, to reduce the impact of a single server failure on all users, you can create partitions for user data. In this way, a single server failure affects only a part of users. With the introduction of Replication, the question of how to ensure consistency of data across multiple copies became a new Consensus, and Paxos and Raft algorithms were designed to solve this problem.

Software exception

Software exceptions generally refer to system bugs, which not only include their own system bugs, but also rely on the service system bugs. Software exceptions are also unavoidable. Therefore, when software exceptions occur, quick recovery measures are also needed. There are usually three methods:

  1. You can adjust existing software configuration parameters to avoid problems
  2. Restart software or dependent services to remove exceptions
  3. Fix bugs directly and upgrade the version

When there is no fatal problem, method 1 or 2 is generally used for recovery. When the problem is serious and there is no known way to bypass it, method 3 is generally used. Method 3 is also relatively risky, because new bugs may be generated when fixing bugs.

Human abnormal

Both the software itself and the server run by the software are managed by human beings. However, human beings make mistakes and sometimes execute wrong commands to cause the system to fail to work properly. The most fatal error may be deleting the data of a server. Replication is often used to avoid problems.

Scalability

The workload of the system is usually not static. When the workload increases, it is often possible to increase the number of machines to maintain the same performance, and the number of machines to increase is determined by the scalability of the system. The better the scalability of the system, the less machine resources to increase. The most perfect scalability is linear scalability, that is, when the workload is increased by N times, only N times more machines are needed to maintain the same performance, and the worst scalability is no scalability, that is, when the workload is increased by N times, no amount of machines can maintain the same performance.

Load often means different things for different systems. For infrastructure systems, read and write times per second is common, and for business systems, it usually has its own metrics, such as transactions created per second. Similarly, the performance metrics used are often different for different systems, with throughput (the number of tasks completed per second) being emphasized for batch systems and response time being emphasized for online systems.

After a system’s workload metrics and performance metrics are clear, we can discuss how to scale under that system. Scaling is usually done in two ways: vertical scaling, replacing existing machines with better ones, or horizontal scaling, using more machines.

For vertical expansion, its advantage is no impact on the business, the disadvantage is that the better machine is very expensive, usually a penny of goods, and ten cents can only buy two cents of goods, and in reality there is always a single machine can not fit the amount of data, at this time the vertical expansion of nature can not be implemented.

For horizontal extension, usually need to software level to cooperate, for stateless system, usually as long as on the new machine deployment needs to extend the system, for the system with the state, generally refers to the storage system, usually the data is divided into Partition (Partition), in this way can new machine through the way of migrating Partition, Migrate data and corresponding workloads from the old machine. The advantage of horizontal expansion is that it uses relatively cheap servers, which can save costs. However, it needs to do a lot of work at the software level, including Partition management, migration, load balancing and so on.

Maintainability

The quality of maintainability determines whether the system can be developed for a long time. A system with poor maintainability will bring a lot of inconvenience to the operation and development personnel. For operations personnel, maintainability refers to whether the system supports common operations methods, good documentation, and so on. And for developers, mainly divides into the kernel development and business development, USES the system for the business development, maintainability refers to whether the system has the good interface, convenient use for business, for example, the Transaction is the underlying system provides an interface of the business, it ensures that performs statements in a Transaction with ACID properties, Therefore, the business only needs to focus on the development of business logic, and does not need to care about the concrete implementation of the bottom; For kernel development, maintainability refers to the code quality of the system, mainly including the readability of the code and whether it is easy to modify, mainly related to the code design ability of the system kernel developer.

conclusion

In order to achieve good Reliability, Scalability and Maintainability, the Scalability of the enterprise can be improved. Techniques such as Replication, Partition, Consensus, and Transaction are commonly used in distributed system design to understand the problems they are trying to solve and to gain insight into the possible implementations behind each technology. It is helpful to evaluate the design of a system, which is very necessary for the selection of multiple competing product systems and in-depth study of system principles.

The update of this blog will be pushed to the wechat public number as soon as possible, welcome everyone to follow.

reference

  • Design Data Intensive Applications