The question of system design and architecture theory is very broad and can be answered by almost all technical theories. As a backend technician who has been working on code for nearly 10 years, let me give you my opinion.

System design and architecture are still closely related to the business types of the system. For example, traditional business systems mainly focus on domain modeling design, high concurrency, high availability, data consistency and other systems, which will be significantly different from business systems in design. Therefore, for different types of systems, To briefly introduce some of the difficulties and solutions faced in the design.

Background: The domain model is the key to general business system design

The key to business system design is how to define the models of the system and the relationships between the models, mainly the definition of the domain model, when we determine the model, the relationship between the models will also be clear.

Model Design can refer to the classic book “Domain-Driven Design” on Domain models, through which you can get a clearer understanding of concepts such as Domain definition, coating, anemia model, etc.

Domain model systems within a single application also need to pay attention to domain hierarchies. As a developer, have you seen and refactoring a lot of controller-service-DAO style code hierarchies? It often makes people vomit blood while doing refactoring.

Here’s a layering suggestion for designing better areas:

The Interface layer Interface

It is mainly responsible for interacting and communicating with external systems such as dubbo services, Restful apis, RMI, etc. This layer mainly includes Facade, DTO, and Assembler.

The Application layer Application

The main component of this layer is the Service, but it is important to note that this layer of services is not simply a wrapper around the DAO layer. In a domain-driven design architecture, the Service layer is a “thin” layer that does not implement any internal logic. It is simply responsible for coordinating and forwarding, delegating business actions to the lower domain layer.

Domain layer Domain

The Domain layer is the core of the Domain model system, responsible for maintaining the object-oriented Domain model, and almost all the business logic is implemented at this layer. It contains important Domain components such as Entity, ValueObject, Domain Event, and Repository.

Infrastructure layer

It supports Interfaces, Applications, and Domains. All platform-specific, frame-specific implementations will be provided in Infrastructure to prevent the three layers, especially the Domain layer, from “contaminate” the Domain model with these implementations. One of the most common types of facilities in Infrastructure is a concrete implementation of object persistence.

High concurrency system design

How often do you get asked a question in an interview: How would you redesign your system if the traffic in your system increased N times? The problem of high concurrency can be solved at various levels, such as:

The code level

  • Lock optimizations (using lock-free data structures), mainly related to AQS locks under the Concurrent package
  • Database cache design (to reduce concurrent database competition pressure), there is also the problem of cache and DB data inconsistency, in practice, high concurrency systems and data consistency systems will adopt the opposite strategy.
  • Data can be merged at the application layer. Only one DB update request can be sent to a Container at a time.
  • Others include space exchange based on BloomFilter, reduced processing time through asynchrony, concurrent execution through multiple threads, and so on.

Database level

  • Different types of storage have been selected according to different storage requirements, from early RDBMSS, to NoSql (KV storage, document databases, full-text indexing engines, etc.), to the latest NewSql (TiDB, Google Spanner /F1 DB), and so on.
  • Design of table data structure, selection and difference of field type.
  • Index design, need to pay attention to the clustering index principle and overwrite index to eliminate sort, and so on, as far as the leftmost matching principle is bad common sense, advanced index to eliminate sort mechanism, and so on, B+ tree and B tree difference.
  • Last regular means: depots table, reading and writing separation, data fragmentation, hot data resolution, etc., high concurrency barrel tend to do data points, and there to go deep said and there are many, such as how to initialize the bucket, the last stage routing rules, how to combine data and so on, more classic way is divided into a bucket to a primary barrels + N points.

Architecture Design Level

  • The distributed system is servitization
  • Stateless support for horizontal elastic expansion and shrinkage
  • Failed fast at the service logic level
  • The link hotspot data is preloaded
  • Multilevel cache design
  • Advance capacity planning and so on

High availability system design

For very high availability systems, we usually talk about several 9s, such as 99.999%.

Facing the high availability system design can also be analyzed from various aspects

Code level: Focus on distributed transaction issues, and CAP theory is a regular part of the interview routine

Software layer: The application supports stateless. Multiple deployed modules are exactly the same, and the request processing results in any module are consistent => The module does not store context information, but processes the request according to the parameters carried in the request. The goal is to scale quickly and provide service redundancy. Common ones, such as session problems.

Load balancing problem

How to ensure system load after multiple software copies are deployed? How do I select the machine to call? That’s the load balancing problem

In a narrow sense, load balancing can be divided into the following types:

  1. Hardware load: such as F5
  2. Software load: such as LVS, Ngnix, HaProxy, DNS, etc.
  3. Of course, there is load balancing on code algorithms, such as Random, RoundRobin, ConsistentHash, weighted rotation, and so on

Load balancing in a broad sense can be understood as the capability of load balancing. For example, a load balancing system needs the following four capabilities:

  1. The fault is automatically detected by the machine
  2. Fault service automatic removal (service circuit breaker)
  3. Request automatic retry
  4. The service is automatically discovered

Idempotent design problems

When the load balancing is carried out above, the generalized load balancing needs to complete the automatic retry mechanism, so in the business, we must ensure the idempotent design.

Here can be considered from two levels:

Request level

Because the request will be retried, you have to do idempotent, you have to make sure that the result of the request is exactly the same if the request is executed repeatedly as if it were executed once. Idempotent design at the request level requires idempotent design at the data modification level, that is, read requests at the data access level are naturally idempotent and write requests are idempotent. Read requests are generally inherently idempotent, returning the same results no matter how many times the query is performed. The essence of this is really a distributed transaction problem, which is discussed in more detail below.

Business level

This can cause very serious problems such as multiple rewards and repeated orders. Idempotences at the business level are essentially a distributed locking problem, as described later. How to ensure no repeat orders? There is the token mechanism and so on. How to make sure the goods are not oversold? Such as optimistic lock. How MQ consumers guarantee exponentiality and so on are common interview questions.

A distributed lock

Idempotent design at the business level is essentially a distributed locking problem. What is a distributed lock? The globally unique resource of the lock in the distributed environment can serialize the request, actually represent the mutex, and solve the idempotent problem of the business layer.

The common solution is the Redis cache based setNX method, but as a technician, it should be clear that there are also single point problems, the failure to renew lease based on timeout, asynchronous master/slave synchronization and so on. To go further, CAP theory, an AP system is essentially unable to achieve a CP requirement. Not even RedLock.

So how do we design a distributed lock? Strong consistency, high availability of the service itself is the most basic requirements, others such as support for automatic renewal, automatic release mechanism, highly abstract access simple, visual, manageable and so on.

Reliable solutions based on the storage layer are as follows:

zookeeper

CP/ZAB/N+1 available: based on temporary node implementation and Watch mechanism.

ETCD

CP or AP/Raft/N+1 available: restful API based; KV storage, strong consistency, high availability, reliable data: persistence; Client Specifies the TTL mode. The uUID of the heartbeat CAS is required.

A circuit breaker of services

After microservitization, the system is distributed deployed and the systems communicate with each other through RPC. The failure probability of the whole system increases with the increase of the system scale. A small failure may cause a bigger failure through link conduction and amplification. You want to choose to mask the impact as much as possible when invoking a service in the case of some non-critical path service degradation

Service degradation

The overall load of the service exceeds the preset upper limit, or the incoming traffic is expected to exceed the threshold. In order to ensure the normal operation of important or basic services, partial requests are rejected or some non-critical and non-urgent services or tasks are delayed or suspended.

The main means are as follows:

Service layer demotion, the main means

  1. Reject some requests (traffic limiting), such as caching the request queue and rejecting some requests with long waiting time; Reject non-core requests based on Head; There are other general algorithms such as token buckets, leaky buckets, and so on.
  2. Shut down some services: for example, the reverse refund service will be shut down at the zero point of the Double 11 sales promotion.
  3. Hierarchical degradation: For example, the autonomous service is degraded. The downstream request volume from the gateway to the service and DB is gradually reduced based on the interception and business rules, which reflects the gradual decrease of the processing power from top to bottom.

Data layer degradation

For example, when the traffic is heavy, update requests are only cached to MQ, and read requests are read to cache. When the traffic is small, the completion operation is performed (generally, if the data access layer is degraded, there is no need to do it in the data layer).

Flexible availability strategy

For example, some traffic limiting tools that specify the maximum traffic, or the flow limiting tools based on the CPU load, need to ensure that the automatic open, do not rely on manual.

Availability issues caused by distribution

Release is also a point of high availability, ha ha, before also experienced some online direct downtime release cases (the bank’s internal system), but as a tall Internet, will mainly use these types of release: gray release, blue and green release, canary release and so on.

Data consistency system design

In general, some financial and accounting systems will be very strict on this part of the requirements, the following will mainly introduce the transaction consistency involved, consistency algorithm and other content.

Transaction consistency problem

At the DB level, data consistency is generally achieved through rigid transactions, mainly through WAL(Write ahead logging). All changes to data files must be logged so that they can be recovered from a crash. Traditional database transactions are based on this mechanism (REDO committed transactions are rolled back to UNDO uncommitted transactions).

In addition to this method, there is another way to backup data through shadow data blocks. In advance, record the state of the modified data blocks before modification, and back them up. If you need to roll back, you can directly use the backup data blocks for overwriting.

The rest is based on a two-phase commit XA model.

However, at present, the Internet system has widely adopted the distributed deployment mode, and the traditional rigid transaction cannot be realized, so the flexible transaction has become the mainstream of the distributed transaction solution and prevention. The main modes are as follows:

TCC mode/or phase 2 mode

During the try phase, resources are pre-deducted (but not locked, improving availability), and data is committed or rolled back during the Confirm or Cancel phase. Typically, you need to introduce a coordinator, or transaction manager.

SAGA mode

Each participant in a business process commits a local transaction, and when one of the participants fails, compensates the previously successful participants, supporting forward or backward compensation.

Transaction messages for MQ

HalfMsg is sent first, and then commit or rollback Msg is sent after processing. Then MQ will regularly ask producer whether halfMsg can be committed or rollback, and finally achieve the final consistency of transactions. You actually delegate the action of compensation to RocketMQ.

Segmenting things (asynchronous assurance)

Retry mechanism based on reliable message + local transaction message table + message queue. At present, this is also the mainstream scheme of some big factories, which is commonly called subsection thing inside.

Flexible transactions are basically implemented on the basis of final consistency, so there must be compensatory actions in it, and the soft state is generally displayed to the user before the final consistency is achieved.

To note is that not all of the system is fit for the introduction of data consistency framework, such as the user can be modified at any time by request, for example, merchants set up the background system, merchants can modify the data at any time, here, if involves the consistency, introducing the consistency framework leads to eventual consistency of compensating actions before, A resource lock blocks subsequent requests from the user. Resulting in a poor experience. In this case, other means are needed to ensure data consistency, such as data reconciliation and other operations.

Consistency algorithm

From the early Paxos algorithm to the later derived ZAB protocol (reference: A Simple Totally Ordered Broadcast Protocol), it provides A reliable distributed lock solution. The Raft Algorithm (In Search of an Understandable Consensus Algorithm) is also some knowledge points that need to be understood In the design of distributed system.

The last

Here introduced different system design will face some difficulties, basic each point inside, are our predecessors on the way to solve various problems continuously explore, finally it is concluded that the industry solutions, presented in front of you, as a technical personnel, to learn these technical points is only a matter of time, However, the ability and spirit to find, face and solve problems are the most worthy of our learning, and also the necessary ability of a system designer or an architect.

— — — — — — — — — — — — — — — — — — — — — —

The author | yong jian

Edit | orange

New retail product | alibaba tao technology