Abstract: Distributed system is undoubtedly the lasting hot topic, but in fact, if not necessary, it is strongly recommended that not to enter the field of distributed, a lot of problems in the case of a centralized are simple, technical personnel don’t because of the hot micro service, for example, is to transform their products also to do, be sure to judge whether there is a need to carefully, not to technology and technology, This article is about the skills that a qualified architect in a distributed domain needs to know when it has to be distributed (traffic, storage, or number of developers).

Distributed system is undoubtedly the lasting hot topic, but in fact, if not necessary, it is strongly recommended that not to enter the field of distributed, a lot of problems in the case of a centralized are simple, technical personnel don’t because of the hot micro service, for example, is to transform their products also to do, be sure to judge whether there is a need to carefully, not to technology and technology, This article is about the skills that a qualified architect in a distributed domain needs to know when it has to be distributed (traffic, storage, or number of developers).

To repeat my criteria for architects, the most important thing for an architect is not to draw a few boxes and connect a few lines (that’s the basic requirement), but to control technical risk, which obviously can’t be learned by watching a few structured powerpoint slides.

communication

Since it is a distributed system, inter-system communication technology is inevitable to master.

First of all, I need to master some basic knowledge, such as network communication protocols (such as TCP/UDP, etc.), Blocking IO (NonBlocking IO, ASYN-io), network adapter (multi-queue, etc.). More application level, need to understand such as connection reuse, serialization/deserialization, RPC, load balancing, etc.

After learned the basic knowledge, basically can write a simple distributed communication module in the system, but it is far from enough, since entered the field of distributed, the scale actually had not low requirements, usually also means is to be able to support a large number of connections, high concurrency, low resource consumption of communication program.

A large number of connections usually go one of two ways:

1. A large number of clients connect to a server

In today’s NonBlocking-IO world, a server that supports a large number of clients is not so difficult to write, but in the case of large and often long connections, there is a special point to note that when the server is down, I have seen several similar cases without experience. When the client scale is increased, the server will be destroyed by a large number of building connections as soon as it restarts (of course, The Server backlog queue should first be slightly larger). The usual approach is to sleep clients at random times before reconnection, and to avoid reconnection intervals.

2. One client connects to a large number of servers

There are also scenarios where a large number of servers need to be connected. In such cases, it is also important not to build all connections simultaneously, but to build them in batches within the capacity.

In addition to establishing connections, the other thing to note is that the concurrent sending of requests is also the same, you must do a good job of limiting the flow, otherwise it is easy to cause memory explosion due to some slow points.

These issues have to be taken into account in terms of technical risk, as well as in terms of design and code implementation, otherwise, once scaled up, the problem is really not easy to solve.

The point of high concurrency is to master CAS, common lock-free algorithm, read/write lock, thread-related knowledge (such as thread interaction, thread pool), etc. In the case of NonBlocking-IO, the most important thing is to minimize the time consumption of IO thread pool in the overall design and code implementation.

Low resource consumption is something NonBlocking-IO itself has basically done.

scalability

Basic means that large, distributed system for this type of system scalability problems must be taken into account in design, architecture diagram on any one point, or if the request quantity is increasing amount of data, how to do can be solved by adding machine way, of course, this process also need not consider infinite scene, If you have experienced architects from relatively small to very large, obviously the advantages are not small, but also increasingly scarce.

The scalability problem revolves around two scenarios:

1. Stateless scenario

For stateless scenarios, it is easier to add machine support as the volume increases. In this case, only the problems found by nodes can be solved based on load balancing, either hardware or software.

Stateless scenarios usually put a lot of states in DB. When equivalent reaches a certain stage, servitization needs to be introduced to alleviate the situation of too many DB connections.

2. Stateful scenario

The so-called state is actually data, and Sharding is usually used to achieve scalability. Sharding has a variety of implementations, common ones are as follows:

2.1 rules Sharding

Sharding state data based on certain rules, for example, dividing database and table is often adopted in this way, which supports scalability, but usually also brings very complex management, state data relocation, and even difficult business functions, such as global join, cross-table transaction, etc.

2.2 Consistent Hash

Consistent Hash makes adding machines cheaper, and the pressure can be more balanced. For example, distributed cache is often used, which is basically the same problem as regular Sharding.

2.3 Auto Sharding

The advantage of Auto Sharding is that it basically does not care about data relocation, and it is OK to add machines as the volume increases. However, in the case of Auto Sharding, there are relatively high requirements on how to use it, and this usually causes some restrictions. Such a scheme is HBase.

2.4 the Copy

Copy this common read far more than write situation, implementation will have the final consistent scheme and global consistent scheme, the final consistent most can be through the message mechanism, global consistent such as Zookeeper/ETcd, both to achieve global consistency and high write support ability is difficult to achieve.

Even today, scalability in Sharding is still a big challenge and very difficult to do.

What is written above is basically just the direction of the solution. When it comes to the details, it is easy to judge that it is an architect who solves too many large-scale scenarios

The stability of

As a distributed system, it is necessary to consider how to deal with the failure of any point in the whole system (it is normal to fail some machines every day when the machine size is certain), and it is also mainly divided into stateless and stateful:

1. Stateless scenario

For stateless scenarios, it is usually easy to handle, and only the node discovery mechanism with heartbeat and other detection mechanisms is OK. Empirically, it is not enough for business to purely rely on 4-layer detection, and 7-layer detection is usually required. Of course, the 7-layer detection has to deal with the problem of large scale.

2. Stateful scenario

For stateful scenarios, more troublesome, for data consistency requirements is not high also OK, the main solution can also use basic types, of course, the main preparation plan is very good also is very not easy to do, there are all kinds of solution, for the Lord prepared solution and don’t feel too comfortable, HBase such, for example, means that hang up one, It takes a certain amount of time for another machine to take over, which has an impact on availability;

In a globally consistent type of scenario, if one machine fails, it usually means that there has to be an election mechanism to decide which of the other machines will dominate, common examples being paxOS-based implementations.

maintainability

Maintainability is easy to be omitted, but it is actually a very important part of distributed system, such as how to build the entire system environment, deployment, supporting maintenance tools, monitoring points, alarm points, problem location, problem handling strategy and so on.

From above to master these techniques, you can know why want to find a qualified architect in the field of distributed so difficult, besides the above mentioned is only distributed in the field of general technology points, but the fact is that usually needs are specific architect in the field of distributed, such as distributed file system, distributed cache, Domain-specific architects need to have domain-specific knowledge in addition to these technical points, which makes it even more difficult.

Along with the development of the Internet, many in the field of distributed technology are mature, think in 8 or 9 years ago, a large-scale web site scalability is how to design or very hot discussed topic, but this time the basic structure of it are clear, and there are many good system open source, has led many to experience things are settling down, With the support of all kinds of good open source products, it will be much easier to build a distributed system in the future, and the cloud will accelerate the process.

Ps: In the process of writing this article, I found that it is not difficult to judge how deep a technical person’s background is. It is to ask him/her to write or talk about all the techniques he/she thinks he/she knows, and see how deep or how long he/she can write about them. It is not easy to write thick or speak for a long time, although I do not deny that it is not easy to write clearly or speak clearly very succinctly, but it must be thick first and then thin.

By Ali Bixuan