I just placed an order, and who knows what I went through in microservices

www.icoco.tech/ users buy into…

Title: What happens when a user makes a successful purchase on an e-commerce site?

When I am a fool and the user buys successfully from an e-commerce site and is still in micro service, it must be an e-commerce system with a micro service architecture.

Designing an e-commerce system is not easy

Simple imagine, since it is an electric system, a user to buy, is must have a user module, buy what things are not always mistral, purchase must be goods, save a shopping cart, have to have goods module, commodity must have inventory, inventory temporarily put together with goods, warehousing logistics alone, just as well as a virtual goods, Anyway, the title also does not say that can not be virtual goods ^_^, buy a success, that must have an order, add an order module, under the order must pay, do not pay people why to give you things, that must have a payment module.

Simple and crude, four modules

User module, commodity module (inventory), order module, payment module

Okay, we got a couple of modules, plus an order flow chart

Wait, it looks like the title is micro services, and since it’s micro services, it involves the question of splitting services

DDD domain driven design

Just really is combing the module, since it is service, will have to break up for the service, the service how to split, seemingly by it can also be in accordance with the comb first time module, but it seems I am not a professional, I heard that a lot of people are now using DDD (domain driven design) to guide the break up of the service.

Refer to the DDD design, DDD official architecture sketch, the overall architecture is divided into four layers: Infrastructure, Domain, Application, and Interfaces.

Microservices with DDD

However, for domain design, the code layer is not the most important. The most important thing is how to divide the domain and demarcate the boundary. As for microservices, it is very suitable to divide Modules from business, and divide each business segment well. Microservices + DDD. Personally, I think we should first divide large business Modules from the perspective of microservices, and each microservice should be a module that can be independently deployed and perform its own functions. To put it simply, in the actual development of microservices, combined with the idea of DDD to divide all their own fields.

Key to DDD implementation

The first point is to create all aggregation, entity, and value objects in the approved language.

The second point is the key “modeling.”

Divide “strategic modeling”, review the whole project from a kind of macro perspective, divide “boundary context”, and form “context mapping” with God’s perspective.
There is also a modeling “tactical modeling,” where “aggregates,” “entities,” and “value objects” are grouped by modules in the “bounded contexts” that our “strategic modeling” divides.

Build a context map of our e-commerce system

First, let’s determine what is the core area of our strategy and what is our purpose. As an e-commerce system, our core is to sell more goods and obtain more orders and profits, so sales can be regarded as one of our core areas. This is established as a clear core domain.

Getting back to our theme, we don’t have a shopping cart this time, and we don’t have individual member sales prices, so remove some context and create a mapping.

Domain driven design looks be like simple, actually it is difficult to implement, because in each link all needs corresponding to attend or guidance of experts in the field, so as to design the most realistic context map, and we compared the effort may later more data driven development pattern, but on the whole to the control performance of the project, said field is more abstract than data driven, More top-level design, in response to the changeable situation of the Internet to see further.

We split microservices into 5 domains, namely sales domain, commodity domain, user domain, order domain and payment domain.

Perfect. Now you can start developing ^ _ ^

Wait, wait, wait, wait, wait; Code is not moving, graph first, first draw the sequence

Sequence diagram

A simple ordering process, covering several areas

Perfect. Now you can develop microservices

Wait, the technology stack for microservices hasn’t been chosen yet

Selection of micro service technology stack

The service is split and the sequence diagram is finished, so we can start our journey of micro services. At present, the mainstream micro services include the well-known Dubbo and Spring-Cloud family barrel of Ali, and Sina’s Motan. I am familiar with Dubbo and Spring-Cloud. I have used both of them. Which one should I choose?

Because we’ve done this before, let me make a quick, rough summary. Dubbo was used a long time ago, when microservices were not as popular as they are today, and many theoretical systems were not yet complete. Dubbo was more like an RPC integration framework, while Spring-Cloud was more of a microservice architecture ecosystem. Compared to Dubbo, springCloud is a full suite of microservices solutions that are a super Dubbo in terms of functionality. Dubbo and SpringCloud describe the Dubbo architecture of microservices as like building a computer, with a lot of freedom. SpringCloud is more like a branded machine.

I prefer spring-Cloud because it is easy and quick. Ok, I will decide the technology stack to use Spring-Cloud. Happy decision.

Wait, shouldn’t the pros and cons of using spring-Cloud as a microservice be clear when you’re so hasty?

Microservices: pros and cons

Since the choice of micro services, we have to know the advantages and disadvantages of micro services, especially the disadvantages, the introduction of micro services, is equivalent to the introduction of a complex system, a complex system to bring all kinds of challenges must be understood in advance.

Leon:

1. Strong modular boundaries

We know that software architecture, software design and modularity are very important. In the beginning, when we write programs and make software, we adopt the way of classes to do modularity. Later, we adopt the way of components or class libraries to do modularity, which can be reused in engineering and shared with other teams. The tiny segment the level of the component is tall with a layer above, in the form of services to do modular, each team start and maintain their independent service, have a clear boundary, finish the development of a service other team can directly call the service, do not need to like components through jar or source way to share, so the boundary of the micro service is relatively clear.

2. It can be independently deployed

3. Technological diversity

Cons (or challenges) :

1. Distributed complexity

In the original monolithic application is an application, a person familiar with the monolithic application architecture can have a good control of the entire monolithic application. But in a distributed system, the service might involve after service have dozens of, some big companies may involve hundreds of services, service and the service is implemented through mutual communication between business, so this time the whole system becomes very complex, and the general developer or a team can’t understand how the system works, This is the complexity of distribution.

2. Final consistency

Micro service data is decentralized management, each team has its own data sources and data copy, for example, team A order data, B team also have order data, whether the team A revised order data should be synchronized to team B data, data consistency problem is involved here, if you don’t have very good solve the problem of consistency, may cause data inconsistencies, This is not acceptable in business.

3. Operation and maintenance complexity

Previous operations need to manage the machine + monolithic applications, distributed system and the monolithic application is not the same, distributed system needs a lot of service, coordination between service and service, so for distributed system resources, capacity planning, monitoring, the reliability of the stability of the whole system is very has the challenge.

Only when you clearly understand the challenges posed by microservices, and know that they have their own biases, can you truly succeed in the challenge, and most importantly, know what the pit is, so that you don’t step on the pit.

Perfect. Now that you understand the benefits and challenges of microservices, you’re ready to start developing

Wait, microservices haven’t done logical layering yet

How do microservices do logical layering

Currently our micro service there are several services, orders, products, users, if the client to see “my orders” that an interface, if the client assumption is PC, you need to request three interfaces, respectively docking orders, goods, user three services, take three call data respectively, then three times call data integrating output display. Remember that the PC calls the back-end service through the Internet, which undoubtedly increases the overhead of the network, and makes the PC side become more complex. It is assumed that an additional layer is added in the middle as the aggregation service layer, which can reduce the network overhead. Because the data is transmitted internally through the Intranet, the business of the PC side becomes relatively simple.

The “PC aggregation service” in the figure is also a microservice, but it belongs to the middle layer of the aggregation service. We will logically divide microservices into two layers:

Microservices base service layer

Basic services generally belongs to the Internet platform basic support services, for example, the foundation of e-commerce sites have order service, goods and services, customer service, etc., these all belong to compare base and atomicity, sinking the infrastructure of a company is lower, down to undertake storage, up to provide business capabilities, some companies call (basic services, intermediary services, public services). Netflix becomes a mid-tier service. Let’s call it basic services.

Microservices aggregation service layer

There are already basic services that can provide business capabilities, so why do we need aggregation services? Because we have different access ends, such as APP, H5, PC, etc. They seem to call roughly the same data, but in fact there are many differences, for example, PC needs to display more information, APP needs to do information tailoring, etc. Generally, low-level services are more general, while basic services should output relatively uniform services, which are better abstracted. However, for different external APP and PC access, we need to make different adaptations. At this time, we need to have a layer to do the work of aggregation tailoring. For example, when the details of a commodity are displayed on PC and APP, THE PC may display more information, while the APP needs to make some information cuts. If the basic service directly opens the interface to PC and APP, the basic service also needs to make various equipment, which is not conducive to the abstraction of the basic service. Therefore, we add the aggregation service layer on top of the basic layer, which can be tailored for PC and APP to make appropriate Settings.

Then we add another service to our micro service, which belongs to aggregation service.

Ok, now you can happily coding…

Wait, it doesn’t seem right, if it’s a monolithic application plus transactions it should be ok, but here it’s distributed, I’m afraid you have to consider adding distributed transactions

Distributed transaction

Let’s figure out the relationship between the create order and the fastener inventory module

Can be found, because the cause of the micro service, we conducted a distributed the service, as each database as into distributed each database does not necessarily have the same physical machine, so this time a single database ACID already can not adapt to this kind of situation, and want to ensure that cluster in this cluster ACID hardly to the point, Or even if the efficiency and performance can be achieved, the most important thing is that it is difficult to expand new partitions. At this time, if we pursue cluster ACID again, our system will become very bad. At this time, we need to introduce a new theoretical principle to adapt to this cluster situation, that is, CAP

The CAP theorem

CAP must satisfy the following three attributes:

Consistency (C) : Whether all data backups in a distributed system have the same value at the same time. (Equivalent to all nodes accessing the same latest copy of data)
Availability (A) : Whether the cluster can still respond to read/write requests from clients after some nodes fail. (High availability for data updates)
Fault tolerance of partitioning (P) : In practical terms, partitioning is equivalent to time-bound requirements for communication. If the system cannot achieve data consistency within the time limit, it means that A partitioning situation has occurred and that it must choose between C and A for the current operation.

Simply put, in a distributed system, at most two of the above attributes can be supported. But obviously, since we are destined to partition because of distribution, we cannot avoid the error of partition 100%. Therefore, we have to choose between consistency and availability.

In distributed systems, we tend to pursue availability, which is more important than consistency. So how to achieve high availability, here is another theory, which is the BASE theory, which further expands CAP theory.

The BASE theory of

BASE notes:

Basically Available
Soft state
Eventually consistent

BASE theory is the result of balancing consistency and availability in CAP. The core idea of the theory is: we cannot achieve strong consistency, but each application can adopt appropriate ways to achieve the final consistency of the system according to its own business characteristics

Ok, said a big theory, programmers are anxious, hurry up to see what distributed transaction solutions, can carry out the next coding…

Come on, discuss the technical solution:

Several schemes came out, because we are not specifically to explain the mechanism and principle of distributed transactions, mainly to do the technology selection of distributed transactions.

One is XA two-phase submission, which can be used in many traditional companies, but it is not suitable for distributed system of Internet micro-service. It takes a long time to lock resources and has a great impact on performance.

The other is that ALI’s GTS is not open source, fescar has been open source at present, but there is a lack of research at present, and it may be used after the next stage of research, so it is excluded for now.

The remaining two are TCC and MQ message transactions

MQ message transaction -RocketMQ

RocketMq has officially announced support for distributed transactions in version 4.3. Before selecting Rokcetmq for distributed transactions, be sure to select version 4.3 or later.

As an asynchronous assured transaction, the two transaction branches are asynchronously decouple via MQ. The design process of RocketMQ transaction message also draws on the two-phase commit theory, and the overall interaction process is shown in the following figure:

At this point, we can basically assume that the subscriber of MQ must be able to receive messages 100% of the time until the local transaction of the MQ sender completes, and then we can modify the order destocking step:

There is an asynchronous transformation involved here. Let’s take a look at the steps in the synchronous process

View product details (or shopping cart)
Calculate commodity prices and current inventory of goods (generate order details)
Commodity destocking (invokes the commodity inventory service)
Order confirmation (generate valid order)

After the order is created, an event “orderCreate” is published to the message queue, which is then forwarded by MQ to the service subscribing to the message. Since it is based on message transactions, we can assume that the goods module subscribing to the message is 100% likely to receive the message.

After receiving the orderCreate message, the goods service performs the inventory reduction operation. Note ⚠️, where there may be some irresistible factor that causes the inventory reduction to fail. Regardless of success or failure, the goods service will send a message “stroeReduce” of the inventory reduction result to the message queue. The order service subscribes to the result of the inventory deduction.

There are two possibilities when the order service receives the message:

If the inventory reduction is successful, change the order status to “Confirm order”, and the order is successful

If the inventory reduction fails, change the order status to “invalid order”, and the order fails

This mode makes the process of confirming an order asynchronous, which is ideal for high concurrency. However, keep in mind that this requires some changes in the front-end user experience and involves the process with the product.

Perfect, tuning consistency can be solved using MQ distributed transactions

Wait, take a look at the risks of MQ message transaction schemes

The way using the MQ above is indeed can complete A and B, but A and B are not strict consistency, but the eventual consistency, we sacrifice strict consistency, for improve the performance of, this is very suitable for large to promote high concurrency scene always use, but if B has been executed is not successful, then the consistency will be destroyed, more out of the scheme should be considered in subsequent, The more detailed the scheme, the more complex the system will be.

TCC scheme

TCC is a two-stage model of servitization. Every business service must implement try, Confirm and calcel methods. These three methods can correspond to Lock, Commit and Rollback in SQL transactions.

1). Try phase Try is only a preliminary operation for preliminary confirmation. Its main responsibility is to complete the check of all services and reserve service resources

2). Confirm phase Confirm is the confirmation operation to be continued after the check is completed in the try phase, and must meet the idempotent operation. If the execution fails in confirm, the transaction coordinator will trigger continuous execution until it meets the requirement

3). Cancel is a cancellation that frees resources reserved for the try phase if the try fails, and must be idempotent and as likely to be executed continuously as confirm

Now let’s see, how does our process of ordering and reducing inventory add TCC

During the try, the inventory service will be asked to reserve N inventory for this order, and the order service will generate an “unconfirmed” order, and the two reserved resources will be generated at the same time. During the confirm, the resources reserved in the try will be used. According to the TCC transaction mechanism, if the resources reserved in the try stage can be normally reserved, Then confirm must be able to complete submission

During the try phase, if the task fails to be executed, the interface operation cancel is performed to release the resources reserved during the try phase.

Perfect, to introduce our system to TCC ^ _ ^

Hold on, there’s a question
Some students may ask how to solve the problem if one party fails in confirm or Cancel, and there may be exceptions. This involves the transaction coordinator of TCC. When the transaction coordinator does not return confirm or Cancel, The timer is enabled to retry confirm or Cancel continuously, which is one reason we emphasize that the Confirm, Cancel interface must be idempotent
TCC also makes a local message table to record a transaction, including the primary transaction, the sub-transaction, the completion of the transaction will be recorded in this table (of course not necessarily a table, may be ZK, Redis, etc.). A timer is then enabled to check the table.
There are also students will ask, transaction how to pass, this involves the use of TCC framework, generally used are implicitly passed parameters. Subtransactions are called with implicit arguments when the main transaction is created. Subtransactions including try, confirm, and Cancel are recorded in the transaction table.

The open source framework for TCC is recommended here to use Mengyun’s TCC, and then others as well, whatever.

Perfect, the order process is developed and QA can access ^ _ ^

Wait, did you do anything about microservices

Fuse current limiting isolation degraded

Distributed micro service dependency relationship is complex, for example the front-end of a request, it came to the back-end for a lot of request, will be converted to a background service appears unstable or delay, if there is no good current limiting fuse measures, may cause the loss of user experience, serious when can appear the avalanche effect, the entire site to break, If alibaba in the double 11 and other activities, if there is no set of good current limit circuit breaker measures, it is unimaginable, may be simply unable to support so large concurrent capacity.

Netflix did not design a good fault tolerance for traffic limiting before 2012. At that time, it was also troubled by the system stability. Several times, the website collapsed due to the lack of good circuit breaker measures. With this system, Netflix has made a big leap in system stability, and since then there has been no large-scale avalanche accident

The following uses hystrix as an example to illustrate a lower limit current fusing

A few concepts:

Fusing, isolation, current limiting and degradation are the most important concepts and modes of distributed fault tolerance.

fusing

If you have circuit fuses in your house, there are fuses that protect you from problems when you use super-powered circuits, and that magnifies the problem.

isolation

We know that the computing resources are limited, CPU, memory, queue, thread pool resources, they are all limit on the number of resources, if not in isolation, a service call may thread consumes a lot of resources, to take up the resources of other services, then knock-on effect should be a service of the potential problems caused other services cannot be accessed.

Current limiting

When we flood our service with heavy traffic, we need certain traffic limiting measures. For example, we only allow a certain number of accesses to pass through our resources within a certain period of time. If there will be problems with the larger system, we need traffic limiting protection.

demotion

If the system fails to provide enough support, a degradation capability is needed to protect the system from further deterioration, and user-friendly flexible solutions can be provided, such as informing users that they are temporarily unavailable and please try again after a period of time, etc.

hystrix

Hystrix encapsulates all of the above fuses, isolation, limiting, and degradation into a single component. Here’s a diagram of hystrix’s internal design and invocation process

The general workflow is as follows:

Build a HystrixCommand object to encapsulate the request and configure the parameters needed in the constructor for the request to be executed
Executing commands. Hystrix provides several methods for executing commands. The most common ones are Synchrous and Asynchrous
Check whether the circuit is open, if so, directly enter the fallback method
Determine if the thread pool/queue/semaphore is full, and if so, enter the fallback method directly
Execute the run method, typically hystrixcommand-run (), to enter the actual business call, and directly enter the fallback method when the execution times out or fails and an unexpected exception is thrown
Every step in the process will be reported to metrics to calculate the monitoring indicators of fuses
Fallback method is also divided into implementation and backup
Finally, the return request response

Perfect, let’s add Hystrix to our system so we don’t have a sudden peak flow and our system doesn’t just crash

Wait, what about hystrix’s current-limiting number, error number fuses, timeout fuses, recovery attempt ratio that we need to configure?

This will depend on the metrics of your system and the scale of your deployment. There is also a capacity design issue, which we will discuss in more detail when we bring the system online.

Just mentioned a problem, that is, the current limit value, the error count fusing these Numbers, we are all written in a configuration file, such as speaking and writing in the properties, yml inside, when all of a sudden one day need to reduced the number of current limit (which may be suffered what pressure blow) system, then we can pull down the code, bala bala changed, Then upload and package again, release and restart, a process down, not to say a few hours, ten minutes.

Somehow we put these configuration items in a centralized configuration center

Centralized configuration center

Write your own breeding center is still quite troublesome, go to the vegetable market, there are, SpringCloud-Config, Baidu Disconf, Ali Diamond, and Ctrip Apollo

The configuration center can be simply understood as a service module. Developers or operation and maintenance personnel can configure the breeding center through the interface. The following micro-services can be connected to the configuration center to obtain the parameters modified on the configuration center in real time. There are generally two ways to update

In Pull mode, the service periodically pulls data from the configuration center
In push mode, the service is always connected to the configuration center. Once the configuration is changed, the breeding center will push the changed parameters to the corresponding microservice

Both the pull and push modes have advantages and disadvantages.

Pull usually uses a timer to pull data. Even if a network jitter fails to pull, the latest configuration will be obtained at the next timer.
Push can avoid the delay of the Pull timer and basically achieve real-time data acquisition. However, the update may be lost when the network jitter occurs.

Ctrip’s Apollo

What is distinctive about Ctrip’s Apollo is that it integrates the two modes of pull and push, combining the advantages of the two modes. The development or operation and maintenance personnel make modifications in the configuration center, and the service of the configuration center will push the modifications to Apollo’s client in real time. However, considering that the push may not succeed due to some network jitter, The client also has the function of pulling data from the Apollo server at regular intervals. Even if the push fails, the client will still take the initiative to pull synchronous data as long as a certain period of time is required to ensure that the final configuration can be synchronized to the service. This is also Apollo’s very distinctive design in terms of high availability.

Apollp also made a guarantee on high availability, the client will cache the data in memory and sync it to the local disk, even if the Apollo server is down, even if the client service is restarted, it can still pull back the data from the local disk and continue to provide external services. In this sense, Apollo’s configuration center has been thoughtful about high availability.

Once the configuration center is configured, we can put the hystrix and mysql user passwords, some service switches and so on.

Perfect, the development is basically completed, in fact, just a few modules, a simple order shopping process, when we delivered the system to the operation and maintenance, operation and maintenance shouted, log, how can do micro services without calling the chain log?

Call chain monitor & log

Indeed, microservices are a very complex distributed system, and without a set of call chains to monitor, it is difficult to locate problems with dependencies between services.

The figure below is the “entropy” of ali’s microservices in hawk-Eye system.

Among the major mainstream Internet companies at present, Ali has a very popular Hawk-eye system, and Dianping also has a very famous call chain monitoring system CAT. In fact, call chain monitoring was first proposed by Google. In 2010, Google published a paper on call chain named after its internal call chain system Dapper. This paper explained the experience and principle of using call chain in Google, and the general principle is shown as follows:

Here we can use the ELK method to record and show the call chain monitoring log, when we store one call as a row record

TraceId and parentSpanId can be connected in series to form an overall link, from which errors, call delay and call times can be analyzed

At present the market mainstream call chain selection zipkin, pinpoint, Cat, Skywalking, between them have some partial focus, worth saying is a new call chain tool produced by Skywalking people, using open source call chain analysis based on bytecode injection, access segment without code invasion, And open source support for a variety of plug-ins, UI in a few tools more powerful, and UI is also more pleasing to the eye, has joined the Apache incubator.

Skywalking was used as the call chain tool

Why skywaling is adopted? In the implementation of low-level principles, these products are similar, but the details of implementation and use are still very different.

First of all, in terms of implementation, Skywalking basically does not invade the code, using Java probe and bytecode enhancement, and CAT also uses code buried point, and Zipkin uses interception request, pinpoint is also using Java probe and bytecode enhancement.
Secondly, in terms of granularity of analysis, Skywaling is method level, while Zipkin is interface level, and the other two are also method level.
In terms of data storage, Skywalking can use ES, which is famous in the log system, and other models, Zipkin can also use ES, PINPOINT use Hbase, CAT use mysql or HDFS. It is relatively complex, because the company has more guarantee for the talents familiar with ES. The selection of familiar storage solutions is also the key to consider the selection of technology.
There is also performance impact. According to some performance reports on the Internet, although they may not be 100% prepared, they have reference value. Skywalking probe has the most effective effect on throughput among the four, and some pressure tests on Skywalking have also roughly proved it.

Perfect, the micro service package dozen good, upload to the server can run ^ _ ^

Wait, micro service packages are played, the rest is a JAR package or war package uploaded to the server one by one, and then use a script start, in the past single block application is ok, now micro service dozens of hundreds of applications, excuse me, operation personnel afraid?

I heard that Docker + Kubernetes works better with microservices

docker + kubernetes

Just a few services, no need for container deployment… At first glance, no play, no CICD, gray release… Container choreography…

We’ll talk about that next time. Let’s deploy the service first

Deploy to production and estimate capacity

The deployment of the service online, a service online must be assessed or estimated under the number of visits there are users, how much access, this involves the configuration of how much machine resources, this should be how to estimate it, anyway, programmers at home how to calculate are not out.

Evaluate traffic

Ask the operation, if it is a product that has been online, there must be existing users and access data, even if there is a deviation, it is controllable scope.
Ask the product, determine a what kind of form of product, for example, is the group, for example, is the second kill, all kinds of processing methods are different

Evaluate the average QPS of visitors

It is generally considered that most requests occur in the daytime, so it is calculated as 40,000. Average daily visits = total daily visits / 40,000

Evaluate peak QPS

Can access the chart pull out before the daily at peak is different according to the business, for example, some business is 10 o ‘clock in the morning traffic during the day, some business is in the evening the family leisure class of traffic, in short, according to the business to estimate the average daily peak, similar to the electricity business class service, generally peak is about 5 times the average daily traffic. For example, some big promotion activities may be higher, which should be communicated with the operation personnel in advance. There are also some activities, such as seckilling, which does not depend on estimation. Seckilling is another consideration, and the coping strategy adopted is completely different from ordinary orders.

Evaluation system, single machine limit QPS

Before going online, we need to do the pressure test together with the testers for each service and each machine. Generally speaking, we will push one service and one machine to the limit and gradually optimize them. Consider a question: the maximum QPS of a fake order machine is 1000, and our peak value is 5000. How many machines are needed to resist it? The answer is greater than or equal to 6, and the minimum fault tolerance is not less than 1.

It looks like a very simple microservice is about the same, but seems to be a lot less, count:

Where are the monitoring systems? (infrastructure monitoring, system monitoring, application monitoring, business monitoring)
Where is the gateway
What happened to uniform exception handling
Where is the API documentation
Where did containerization go
Where is the service choreography
.