The most complete distributed architecture knowledge body, you are one architecture away from architecture

Problem 1.

What is Distributed and what is microservice?
Why distributed?
Distributed core theoretical basis, nodes, networks, time, order, consistency?
What are the design patterns of distributed systems?
What are the types of distributed?
How to implement distributed?

2. Key words

Nodes, Time, Consistency, CAP, ACID, BASE, P2P, machine scaling, network change, load balancing, flow limiting, authentication, service discovery, Service Orchestration, degradation, fusing, idempotence, branch table, Sharding, Automatic operation and Maintenance, fault tolerance, Full stack monitoring, Fault recovery, performance tuning

3. Summary of the full text

With the development of mobile Internet and the popularity of intelligent terminals, computer systems have long been transitioned from single machine to multi-machine collaboration. Computers exist in the form of clusters, and huge and complex application services have been built according to the guidance of distributed theory.

This paper tries to introduce the distributed knowledge system outline based on MSA(microservice architecture) from the aspects of distributed basic theory, architecture design pattern, engineering application, deployment operation and maintenance, and industry scheme. In this way, we can have a three-dimensional understanding of the evolution from SOA to MSA, and further understand the nature of micro-service distribution from concept and tool application, and how to build a full set of micro-service architecture with immersive feeling.

4. Basic theory

4.1 Evolution from SOA to MSA

SOA Service-oriented architecture

Due to business development to a certain degree of layer, need to decouple the service, which carried a single large system logical split into different subsystems, through a service interface for communication, service oriented design patterns, eventually need bus integration services, and most of the time also Shared database, a single point of failure will cause the failure on the level of bus, A further step would have dragged down the database, hence the emergence of more independent designs.

MSA microservices architecture

Micro service is truly independent service, from the service entrance to the data persistence layer, logic is isolated independently, without the service bus to access, but at the same time increased the whole distributed system construction and management difficulty, need to layout and management of service, so the rise of the service, the service of ecological technology stacks of a complete set of also need seamless access, To support the governance concept of micro services.

4.2 Nodes and Networks

node

The traditional node is also a single physical machine, all services are integrated into the service and database; With the development of virtualization, a single physical machine can be divided into multiple VMS to maximize resource utilization. The concept of nodes becomes services on a single VM. As container technology has matured in recent years, services have become completely containerized, that is, nodes are only lightweight container services. In general, a node is a collection of logical computing resources that provide unit services.

network

The foundation of distributed architecture is network. No matter LAN or public network, computers cannot work together without network, but network also brings a series of problems. Network messages are transmitted in different order, and message loss and delay often happen. We define three network working modes:

Synchronous network

Node synchronization execution
Message latency limited
Efficient global lock

Semi-synchronous network

Lock range extended

Asynchronous network

Node independent execution
Message latency is unlimited
There is no global lock
Some algorithms are infeasible

Common network transport layer has two main protocol characteristics:

TCP protocol

First TCP although others can be faster
TCP addresses duplication and out-of-order problems

UDP protocol.

Constant data stream
Losing a bag is not fatal

4.3 Time and sequence

time

In a slow physical space and time, time flows alone. For serial transactions, it is easy to follow the steps of time. And then we invented clocks to tell us what happened in the past, and clocks keep the world in order. But in a distributed world, dealing with time can be a pain.

Distributed in the world, we should coordinate the first relationship between the different nodes, but different node itself admits that time and independent, so we created the network time protocol (NTP) try to solve the standard time between different nodes, but the NTP itself performance is not good, so we construct in addition to the logical clock, and finally improve for vector clock:

Some disadvantages of NTP cannot fully meet the coordination problem of concurrent tasks in distributed mode

Time is not synchronized between nodes
Hardware clock drift
Thread may sleep
Operating System Hibernation
The hardware of dormancy

The logical clock

Define events first
T ‘= Max (t, t_msg + 1)

Vector clock

T_i ‘= Max (t_i, t_msg_i)

Atomic clocks

The order

With tools to measure time, solving the order problem comes naturally. Since the whole theory of distribution is based on how to negotiate the consistency of different nodes, and order is the basic concept of consistency theory, we need to spend time on the scale and tools to measure time.

4.4 Consistency Theory

Speaking of consistency theory, we must look at a comparison diagram of the influence of consistency on system construction:

The graph compares the balance of transactions, performance, errors, and latency under different consistency algorithms.

Strong consistent ACID

ACID is the principle that guarantees transactions due to network latency and message loss. These four principles are familiar without even explaining them:

Atomicity: Atomicity, the fact that all operations in a transaction either complete or do not complete and do not end up somewhere in the middle.
Consistency: Consistency in which the integrity of the database is not compromised before and after a transaction.
Isolation: The ability of a database to allow multiple concurrent transactions to read, write and modify its data at the same time. Isolation prevents data inconsistency caused by cross-execution of multiple concurrent transactions.
Durabilit: After a transaction, changes to the data are permanent and will not be lost even if the system fails.

Distributed consistency CAP

In the distributed environment, we cannot guarantee the normal connection of the network and the transmission of information, so we have developed three important theories of CAP/FLP/DLS:

CAP: It is impossible for distributed computing systems to ensure Consistency, availability and Partition tolerance at the same time.
FLP: In asynchronous environments, if there is no upper limit on network latency between nodes, no algorithm can reach consensus in limited time as long as a malicious node exists.
DLS: (1) protocols operating under a model of a partially synchronous network (i.e., network latency is bounded but we don’t know where) can tolerate 1/3 arbitrary (in other words, Byzantine) errors; (2) deterministic protocols (with no upper limit on network latency) are not fault-tolerant in an asynchronous model (although this paper does not mention that randomization algorithms can tolerate 1/3 of errors); (3) The protocols in the synchronization model (the network latency is guaranteed to be less than the known d time) can, surprisingly, be 100% fault-tolerant, although there is a limit on how 1/2 of the nodes can fail

Weakly consistent BASE

In most cases, we do not necessarily require strong consistency. Some businesses can tolerate delayed consistency to a certain extent. Therefore, in order to give consideration to efficiency, the final consistency theory BASE is developed. BASE is Basically Available, Soft State, Eventual Consistency

Basically Available: A distributed system is allowed to lose part of its availability in the event of a failure, i.e. the core is guaranteed to be Available.
Soft State: The Soft State allows the system to have intermediate states that do not affect the overall system availability. Generally, one copy of data in distributed storage has at least three copies. The soft state is reflected in the delay of copy synchronization between different nodes.
Eventual Consistency: Eventual Consistency means that all data copies in the system reach the same state after a certain period of time. Weak consistency is the opposite of strong consistency, and final consistency is a special case of weak consistency.

Consistency algorithm

The core of distributed architecture is the realization and compromise of consistency, so how to design a set of algorithms to ensure the communication between different nodes and data to achieve infinite consistency is very important. It is very difficult to ensure that different nodes can achieve the consistency of the same copy in the uncertain network environment, and a lot of research has been done on this topic.

First we need to understand the general premise of consistency (CALM):

The full name of the CALM principle is Consistency and Logical sponsorship. It describes the relationship between monotonous logic and Consistency in a distributed system. For details, see Consistency as Logical sponsorship

In distributed systems, even monotonous logic can guarantee “final consistency,” which does not depend on the scheduling of central nodes
Any distributed system can achieve final “consistency” if all non-monotonic logic has central node scheduling.

Then we will focus on the Data structure CRDT(conflict-free Replicated Data Types) of distributed systems:

After we understand some laws and principles of distribution, we have to start to consider how to realize the solution. The premise of consistency algorithm is data structure, or the foundation of all algorithms is data structure. Well-designed data structure and exquisite algorithm can effectively solve real problems. After the predecessors’ continuous exploration, we know that the distributed system is widely used data structure CRDT.

Refer to Talking about CRDT,A Comprehensive study of Convergent and Commutative Replicated Data Types

State-based: The CRDT data of each node is directly merged. All nodes can be finally merged to the same state. The sequence of data merging does not affect the final result.
Operation-based: Notifies other nodes of each data operation. As long as the node knows all operations on the data (the order in which the operations are received can be arbitrary), it can be merged into the same state.

After understanding the data structure, HATs(Highly Available Transactions) and ZAB(Zookeeper Atomic Broadcast) are important protocols for distributed systems.

See High Availability Transactions, ZAB Protocol Analysis

The last thing to learn is the mainstream consistency algorithm in the industry:

To be honest, I have not fully understood the specific algorithm. Consistency algorithm is the core and essence of distributed system, and the development of this part will also affect the innovation of architecture, and the application of different scenarios will also lead to different algorithms

Paxos: The Elegant Paxos Algorithm
Raft: Raft Consistency Algorithm
Gossip: Think like a Visualization.

In this section, we will talk about the core theoretical basis of distributed system and how to achieve data consistency between different nodes. Next, we will talk about the mainstream distributed system at present.

5. Scenario classification

5.1 File System

The storage capacity of a single computer has always been limited. With the emergence of the network, multiple computers have been proposed to store files cooperatively. The earliest distributed file systems were actually called network file systems, and the first file servers were developed in the 1970s. In 1976, Digidor company designed File Access Listener (FAL), and modern distributed File System is from The famous Google paper, “The Google File System” laid The foundation of distributed File System. Modern mainstream distributed file systems Refer to Distributed File System Comparison. The following lists several common file systems

HDFS
FastDFS
Ceph
mooseFS

5.2 database

Databases, of course, are also file systems, and master data adds advanced features such as transactions, retrieval, and erasing, so complexity increases, both for data consistency and performance. Traditional relational database in order to take into account the transaction and performance characteristics, in a distributed development co., LTD., non-relational database to get rid of the strong consistency of the transaction, achieved the effect of eventual consistency, thus has the leap development, no (Not Only Sql) also produced more architectural database types, including KV, column type storage, Document type, etc.

Column storage: Hbase
Document storage: Elasticsearch, MongoDB
KV type: Redis
Relationship type: Spanner

5.3 calculation

Distributed computing system is built on the basis of distributed storage. It gives full play to the characteristics of data redundancy and disaster recovery of distributed system and efficient data acquisition of multiple copies, and then performs parallel computing. It divides the tasks that need a long time to be calculated into multiple tasks for parallel processing, thus improving computing efficiency. Distributed computing system can be divided into offline computing, real-time computing and streaming computing.

Offline: Hadoop
Real-time: the Spark
Streaming: Storm, Flink/Blink

5.4 the cache

Caching as a performance booster is everywhere, from CPU caching architectures to distributed application storage. Distributed cache system provides random access mechanism for hot data, greatly improving access time, but the problem is how to ensure data consistency, introduce distributed lock to solve this problem, the mainstream distributed storage system is basically Redis

Persistence: Redis
Nonpersistent: Memcache

5.5 the message

Distributed message queue system is a powerful tool to eliminate a series of complex steps brought by asynchronous, multi-threaded high concurrency scenarios first we often have to carefully design business code, to ensure that the multi-threaded concurrency does not appear deadlock caused by resource competition. Message queues queue asynchronous tasks in a deferred consumption mode and then digest them one by one.

Kafka
RabbitMQ
RocketMQ
ActiveMQ

5.6 monitor

With the development of distributed system from single machine to cluster, the complexity is also greatly improved, so the monitoring of the whole system is also essential.

Zookeeper

5.7 applications

The core module of distributed system is how to deal with the business logic in the application, the application of direct call depends on the specific protocol to communicate, some based on RPC protocol and some based on the general HTTP protocol.

HSF
Dubble

5.8 log

Errors are common in distributed systems, and we need to design systems with fault tolerance in mind as a universal phenomenon. When something goes wrong, quick recovery and troubleshooting is very important. Distributed log collection, storage and retrieval provide me with powerful tools to locate the problematic link in the request link.

Log collection: Flume
Log storage: ElasticSearch/Solr, SLS
Log location: Zipkin

5.9 books

As mentioned above, the so-called distributed system is forced by the limited performance of a single machine, and the heap hardware can not be endlessly increased, and the single-machine heap hardware will eventually meet the bottleneck of performance growth curve. So we use multiple computers to do the same job, but such a distributed system always needs a centralized node to monitor or schedule the resources of the system, even though the central node may be composed of multiple nodes. However, blockchain is a real district-centric distributed system, in which P2P network protocols communicate with each other without a central node of real significance. Each other coordinates the generation of new blocks according to the computing power and rights and interests of blockchain nodes.

The currency
The etheric fang

6. Design patterns

Last we list the different scenarios of distributed system architecture and implementation of the role of function, in this section, we further generalize distributed system design is how to consider structure design, different design directly, difference and emphasis of different scenarios need to choose design patterns of cooperation, to reduce the cost of trial and error, The following issues need to be considered when designing distributed systems.

6.1 availability

Availability is the proportion of time a system is up and running, usually measured as a percentage of uptime. It can be affected by system errors, infrastructure problems, malicious attacks, and system load. Distributed systems typically provide users with service level agreements (SLAs), so applications must be designed to maximize availability.

Health check: The system implements full-link function check. External tools periodically access the system through public endpoints
Load balancing: Use queues for peak peak-shaving, as buffers between requests and services to smooth out intermittent heavy loads
Throttling: Limits the range of resources consumed by application levels, tenants, or entire services

6.2 Data Management

Data management is a key element of distributed systems and affects most quality attributes. Data is often hosted in different locations and on multiple servers for performance, scalability, or availability reasons, which can present a number of challenges. For example, data consistency must be maintained and data often needs to be synchronized across different locations.

Caching: Loads data from the data store layer into the cache as needed
CQRS: Segregation of Command Query Responsibility (CQRS)
Event tracing: Only use the append mode to record the complete series of events in the domain
Index tables: Create indexes on fields referenced by frequent queries
Materialized views: Generate one or more data prepopulated views
Split: To split data into horizontal partitions or shards

6.3 Design and Implementation

Good design includes factors such as consistency in component design and deployment, maintainability that simplifies management and development, and reusability that allows components and subsystems to be used in other applications and other scenarios. Decisions made during the design and implementation phases have a huge impact on distributed systems and quality of service and total cost of ownership.

Proxy: reverse proxy
Adapters: Implement an adapter layer between modern applications and legacy systems
Back-end separation: Back-end services provide interfaces to be invoked by front-end applications
Computing resource consolidation: Combining multiple related tasks or operations into a single cell
Configuration separation: Move configuration information from the application deployment package to the configuration center
Gateway aggregation: Use the gateway to aggregate multiple individual requests into a single request
Gateway offload: Offloads shared or dedicated service functions to the gateway proxy
Gateway routing: Routing requests to multiple services using a single endpoint
Leader election: Coordinate the cloud of a distributed system by selecting one instance as the administrator responsible for managing other instances
Pipes and filters: Break down complex tasks into a series of individual components that can be reused
Sidecar: Deploys application monitoring components into separate processes or containers to provide isolation and encapsulation
Static content hosting: Deploy static content to the CDN to speed up access efficiency

6.4 the message

Distributed systems require a messaging middleware that connects components and services, ideally in a loosely coupled fashion to maximize scalability. Asynchronous messaging is widely used and provides many benefits, but also presents challenges such as message ordering and idempotency

Competing consumer: multi-threaded concurrent consumer
Priority queue: Message queues are classified into priorities, and those with higher priorities are consumed first

6.5 Management and Monitoring

Distributed systems run in remote data centers without full control over the infrastructure, making administration and monitoring more difficult than standalone deployment. Applications must expose runtime information that administrators can use to manage and monitor systems and support changing business needs and customizations without stopping or redeploying applications.

6.6 Performance and Expansion

Performance represents the responsiveness of the system to perform any operation at a given time interval, while scalability is the ability of the system to handle increased load without affecting performance or easily increasing available resources. Distributed systems often experience varying peaks of load and activity, especially in multi-tenant scenarios, that are nearly impossible to predict. Instead, applications should be able to scale within limits to meet peak demand and when demand decreases. Scalability involves not only compute instances, but also other elements such as data stores, message queues, and so on.

6.7 the elastic

Resilience is the ability of a system to gracefully handle failures and recover from them. Distributed systems are typically multi-tenant, use shared platform services, compete for resources and bandwidth, communicate over the Internet, and run on commercial hardware, meaning an increased likelihood of transient and more permanent failures. To remain resilient, faults must be quickly and efficiently detected and recovered.

Quarantine: Isolate the application’s elements into the pool so that if one fails, the others continue to run.
Breaker: Deals with faults that may take varying amounts of time to repair when connected to a remote service or resource.
Compensation transactions: Undo work performed by a series of steps that together define the final consistent action
Health check: The system implements full-link function check. External tools periodically access the system through public endpoints
Retry: Enables applications to handle expected temporary failures while trying to connect to services or network resources by transparently retrying previously failed operations

6.8 security

Security is the ability of a system to prevent malicious or accidental behavior outside of its intended use, and to prevent disclosure or loss of information. Distributed systems run on the Internet outside trusted local boundaries, are usually open to the public, and can serve untrusted users. You must protect your application from malicious attacks, restrict access only to approved users, and protect sensitive data.

Federated identity: Delegate authentication to an external identity provider
Gatekeeper: Protects applications and services by using dedicated host instances that act as proxies between clients and applications or services, validating and cleaning requests, and passing requests and data between them
Valet key: Uses a token or key that provides a client with limited direct access to a particular resource or service.

7. Engineering applications

Above we introduced a distributed system is the core of the theory, facing some of the challenges and the compromise way of thinking to solve the problem, list the classification of the existing mainstream distributed system, and sums up the construction of distributed system methodology, then we will introduce from engineering point of view to resort to build a distributed system contains the contents and steps.

7.1 Resource Scheduling

One cannot make bricks without straw, all of our software systems are built on the basis of the hardware server, from the beginning of the physical machine directly deployed software system, to the application of virtual machines, and finally to the resources on the cloud container, the use of hardware resources also began to intensive management. This section compares the scope of responsibilities corresponding to traditional operation and maintenance roles. In the DevOPS environment, the integration of development and operation and maintenance is also what we want to achieve flexible and efficient use of resources.

Elastic scaling

In the past, if the software system needs to increase machine resources with the increase of users, the traditional way is to find the operation and maintenance application machine, and then deploy the software service access cluster. The whole process relies on the human experience of operation and maintenance personnel, which is inefficient and error-prone. Microservice distribution does not require human flesh to add physical machines. With the support of containerization technology, we only need to apply for cloud resources and execute container scripts.

Application capacity Expansion When users surge, services need to be expanded, including automatic capacity expansion and automatic capacity reduction after the peak
Offline For outdated applications, the cloud platform takes back container host resources
Machine replacement For faulty machines, can replace container host resources, services automatically start, seamless switching

Network management

With computing resources, the other most important is network resources. Under the current background of cloud, we hardly have direct access to physical bandwidth resources, but directly manage bandwidth resources uniformly by cloud platform. What we need is the maximum application and effective management of network resources.

Domain name application Application Applies for matching domain name resources and standardizing multiple sets of domain name mapping rules
Domain name change Unified platform for domain name change management
Load management Access policy Settings for multi-machine applications
Security external basic access authentication, blocking illegal requests
Unified Access Provides a unified access permission application platform and unified login management

Fault snapshot

In case of system failure, our top priority is system recovery, and it is also very important to preserve the scene of the incident. The resource scheduling platform needs a unified mechanism to preserve the scene of the failure.

The scene to keep

Storage of resource phenomena such as memory distribution and thread count, such as JavaDump hook access

Debug access

Bytecode technology does not need to invade service code, and can be used for debugging of field logs in production environment

7.2 Traffic Scheduling

After we build a distributed system, the first to be tested mark is the gateway, and then we need to focus on good system of traffic situation, that is how the management of the traffic, we pursue is within the system can accommodate traffic limit, leave the resource with the highest quality traffic to use, and the illegal malicious traffic block at the door, This saves costs and ensures that the system does not crash.

Load balancing

Load balancing is our general design of how services absorb traffic and is usually divided into hard load balancing at the physical layer and soft load balancing at the software layer. Load balancing solutions are mature in the industry. They are usually optimized for specific services in different environments. The following load balancing solutions are commonly used

switches
F5
LVS/ALI-LVS
Nginx/Tengine
VIPServer/ConfigServer

The gateway design

The gateway is the first place for load balancing, because the gateway is the first place to hit the traffic in a centralized cluster. If the gateway fails to withstand the pressure, the whole system will be unavailable.

The first consideration in high-performance gateway design is high-performance traffic forwarding. A single gateway node can usually reach millions of concurrent traffic
Distributed For traffic load balancing and DISASTER recovery, the gateway must also be distributed
Service filtering gateways are designed with simple rules to exclude most malicious traffic

Traffic management

Request verification request authentication can block and clean up how many illegal requests
Data caching Most stateless requests have data hotspots, so CDN can consume a considerable portion of the traffic

Flow control control

The rest of the real traffic we use a different algorithm to divert requests

Traffic distribution – Counter – Queue – funnel – token bucket – Dynamic flow control
Traffic restrictions at the time of traffic surges, usually we need limited flow measures to prevent the system from an avalanche, you will need to estimate maximum flow of the system, and then set the upper limit, but flow increased to a certain threshold, the extra traffic will not enter the system, by sacrificing some flow to preserve system availability. N/A QPS granularity n/A Thread number granularity N/A RT threshold N/A Traffic limiting policy N/A Traffic limiting tool N/A Sentinel

7.3 Service Scheduling

The so-called beating iron also needs its own hard, traffic scheduling management, the rest is the robustness of the service itself. Distributed system services fail all the time, even if the failure itself is considered part of distributed services.

The registry

The gateway, described in our network Management section, is the hub of traffic, and the registry is the home of services.

Status type The status of the best application service that can be checked through the registry to see if the service is available
Life cycle Different application service states constitute the life cycle of an application

Version management

Cluster version A cluster does not need to have its own version number. A cluster consisting of different services also needs to define a large version number

Version rollback

Rollback management can be performed based on large cluster versions in case of deployment exceptions

Service choreography

Service choreography is defined as controlling the interaction of parts of resources through the interaction sequence of messages. The resources involved in the interaction are all peer-to-peer, with no centralized control. In the microservice environment, there are many services. We need a general coordinator to negotiate the dependency and call relationship between services, and K8S is our best choice.

K8S
Spring Cloud – HSF – ZK+Dubble

Service control

Previously we addressed the robustness and efficiency of the network. This section describes how to make our service more robust.

found

In the resource management section, we introduced that after applying for container host resources from the cloud platform, application services can be started through automated scripts. After starting, services need to discover the registry and register their service information with the service gateway, namely gateway access. The registry monitors the different states of services, does health checks, and flags unavailable services.

The gateway access
Health check

demotion

When users surge, the first thing we do is manipulate the traffic side, that is, limit the flow. We also need to do something about the service itself when we find that the system responds slowly after limiting traffic, which may cause more problems. Service downscaling means turning off functions that are not currently core, or expanding the scope of accuracy that is not critical, and then doing some manual remediation afterwards.

Reducing consistency constraints
Close non-core services
Simplify the function

fusing

When we have done the above operations, we still feel uneasy, so we need to worry about it further. The circuit breaker is a self-protection against overload, just as we switch a trip. For example, when our service continuously queries the database, if business problems cause query problems, it is the database itself that needs to fuse to ensure that it will not be dragged down by the application, and access friendly information, telling the service not to blindly call again.

closed
ajar
Off state
Fuse tool – Hystrix

Power etc.

We know that an idempotent operation is characterized by any number of executions having the same effect as a single execution. It is necessary to assign a global ID to a single operation so that multiple requests can be identified from the same client to avoid dirty data.

Global Consistency ID
Snowflake

7.4 Data Scheduling

The biggest challenge for data storage is the management of data redundancy. With more redundancy, the efficiency will be low and resources will be occupied. With fewer copies, disaster recovery cannot be achieved.

State transition

Separate state to global storage and convert requests to stateless traffic. For example, we typically cache login information to global Redis middleware rather than redundant user login data across multiple applications.

Depots table

Horizontal data scaling

Shard partition

Multiple copy redundancy

7.5 Automatic O&M

We have introduced the trend of DevOps from the time of resource application management, and the true integration of development and operation requires the coordination of different middleware.

Configuration center

The global configuration center is centrally managed based on the environment, reducing the confusion of multiple configurations

switch
diamend

Deployment strategy

Distributed deployment of micro-services is common. How to better support business development of our services and robust deployment strategy is the first thing we need to consider. The following deployment strategy is suitable for different businesses and different stages.

Stop the deployment
Scroll to deploy
Blue green deployment
Gray scale deployment
A/B testing

Job scheduling

Task scheduling is an essential part of the system. The traditional way is to configure the Crond scheduled task on the Linux machine or directly complete the scheduling business in the business code, but now mature middleware is replaced.

SchedulerX
Spring Scheduled Task

Application management

A significant portion of o&M time is spent rebooting applications, performing offline operations, and clearing logs.

Restart the application
Use offline
Log cleaning

7.6 Fault-tolerant processing

Now that we know that distributed system failures are a common occurrence, planning for failures is also essential. We usually have active and passive ways to deal with it. Active is when an error occurs, we try to try it a few more times, and maybe it works, and if it works, we can avoid the error. The passive way is that the wrong thing has happened, and in order to fix it, we just do what we need to do to minimize the negative impact.

Retry design

The key to retry design is to design the time and number of retries. If the number of retries is exceeded, or a period of time, then retry is meaningless. The open source spring-Retry project is a good implementation of our retry plan.

Transaction compensation

Transaction compensation is consistent with our concept of ultimate consistency. A compensation transaction does not necessarily return the data in the system to the state it was in when the original operation began. Instead, it compensates for the work done by the steps that successfully completed before the operation failed. The order of the steps in a compensation transaction is not necessarily the exact opposite of the order of the steps in the original operation. For example, one data store may be more sensitive to inconsistencies than another, so a step to undo changes to that store in a compensation transaction should occur first. Having a short-term timeout-based lock on each resource required to complete the operation and pre-acquiring those resources helps increase the likelihood of overall activity success. Work should be performed only after all resources have been acquired. All operations must be completed before the lock expires.

7.7 Full stack Monitoring

Due to the distributed system is made up of many machine work together, and also no guarantee that the network is completely available, so we need to build a system for each link can be monitored, so that we can monitor from the bottom to all levels of business, unexpected can repair the fault in time, avoid more problems.

Base layer

The basic level is the monitoring of container resources, including the load of each hardware indicator

CPU, IO, memory, threads, throughput

The middleware

Distributed systems are connected to a large number of middleware platforms, and the health of middleware itself needs to be monitored

The application layer

Performance monitoring At the application layer, you need to monitor real-time indicators (QPS, RT) and upstream and downstream dependencies of each application service
Service monitoring In addition to the monitoring degree of the application itself, service monitoring is also a link to ensure the normal system. By designing reasonable service rules, alarms can be set for abnormal situations

Monitor the link

zipkin/eagleeye
sls
goc
Alimonitor

7.8 Fault Recovery

When a fault has occurred, the first thing to do is to immediately rectify the fault and ensure that the system services are available. In this case, we usually perform rollback operations.

Application of the rollback

Before rolling back applications, you need to save the fault site for troubleshooting.

Baseline back

After the application service is rolled back, the code baseline also needs to revert to the previous version.

Version rollback

The overall rollback requires service choreography to roll back the cluster with a large version number.

7.9 Performance Tuning

Performance optimization is a big topic of distributed systems, involving a very wide range of aspects, this can be taken out to do a series of speaking, this section will not be expanded. The process of service governance itself is also a process of performance optimization.

A distributed lock

Caching is a great tool for solving performance problems. Ideally, each request can be returned as quickly as possible without additional computation. The distributed cache needs to solve the consistency of data. At this time, we introduce the concept of distributed lock. How to deal with the problem of distributed lock will determine the efficiency of obtaining cached data.

High concurrency

Multithreaded programming mode improves system throughput, but also brings business complexity.

asynchronous

Event-driven asynchronous programming is a new programming mode, which eliminates the complex business processing problem of multithreading and improves the response efficiency of the system.

8. To summarize

To conclude, if possible, try a single-node approach rather than a distributed system. Distributed systems come with some failed operations, and to deal with catastrophic failures, we use backups. To improve reliability, redundancy was introduced. A distributed system is essentially a collection of machines working together. And all we have to do is figure out how to make the machine work as expected. Such a complex system, need to understand each link, each middleware access, is a very large project. Fortunately, in the context of microservices, most of the basic work has already been done for us. The distributed architecture described above can be basically constructed by using the distributed three-piece set (Docker+K8S+Srping Cloud) in engineering implementation.

The distribution diagram of core technologies of distributed architecture is as follows:

Distributed technology stacks use middleware:

Finally, a diagram is used to summarize the knowledge system of distributed system.