This article introduces microservice architectures and related components, what they are and why they are used. This article focuses on succinctly presenting the big picture of microservices architecture, so it doesn’t go into details like how components are used.

To understand microservices, you need to understand those that are not microservices. The usual counterpoint to microservices is monolithic applications, where all functionality is packaged into a single application unit. Moving from monomer application to microservice is not an overnight move, it is a gradual process of evolution. This article will take an online supermarket application as an example to illustrate this process.

Initial need

A few years ago, Xiao Ming and Xiao PI started an online supermarket together. Xiao Ming is in charge of program development and Xiao PI is in charge of other matters. At that time, the Internet was not developed, and online supermarkets were still blue ocean. As long as the feature is implemented, you can make money. So their demand is very simple, just need a website on the public network, users can browse goods on this website, purchase goods; In addition, a management background is required to manage goods, users, and order data.

Let’s sort out the list of features:

  • Web site
  • User registration and login functions
  • Commodity display
  • Place the order
  • Management background
  • User management
  • Commodity management
  • The order management

Because the demand is simple, Xiao Ming left hand right hand a slow motion, the website is done. Management background for security considerations, do not do together with the website, Xiaoming right hand left hand slow motion replay, management site also done. The overall architecture diagram is as follows:

Xiaoming waved his hand and found a home cloud service to deploy, and the website went online. After the online praise such as tide, by all kinds of fat curtilage love. Xiao Ming xiao PI began to lie down to collect money.

With the development of business…

The good times did not last long, a few days later, all kinds of online supermarkets followed up, causing a strong impact on Xiao Ming and Xiao PI.

Under the pressure of competition, Xiaoming xiaopi decided to carry out some marketing methods:

  • Carry out promotional activities. Such as New Year’s day discount, Buy two get one free Spring Festival, Valentine’s Day dog food coupons and so on.
  • Expand channels and add mobile terminal marketing. In addition to the website, we also need to develop mobile APP, wechat applets and so on.
  • Precision marketing. Use historical data to analyze users and provide personalized services.

These activities require the support of program development. Xiao Ming pulls his classmate Xiao Hong to join the team. Xiao Hong is responsible for data analysis and mobile terminal development. Xiao Ming is responsible for the development of promotional activities related functions.

Due to the urgent development task, Xiao Ming and Xiao Hong didn’t plan the structure of the whole system well. Instead, she patted her head and decided to put the promotion management and data analysis in the management background, and set up wechat and mobile APP separately. After several days of all-nighters, the new features and apps are almost complete. The architecture diagram is as follows:

There are many irrationalities in this stage:

  • Websites and mobile apps have a lot of repetitive code with the same business logic.
  • Data is sometimes shared through databases and sometimes transferred through interface calls. The interface call relationship is messy.
  • A single application, in order to provide interfaces to other applications, gradually gets bigger and bigger, including a lot of logic that is not its own in the first place. The application boundary is fuzzy, the function attribution is chaotic.
  • The management background was initially designed with a lower level of assurance. The addition of data analytics and promotion management related features resulted in performance bottlenecks affecting other applications.
  • The database table structure is dependent on multiple applications and cannot be reconstructed or optimized.
  • All applications operate on the same database, causing performance bottlenecks in the database. Especially when data analysis is running, database performance deteriorates dramatically.
  • Development, testing, deployment, and maintenance are increasingly difficult. Even a small change in a feature requires the entire application to be released together. Sometimes a release party accidentally brings untested code with it, or after a feature change, another unexpected thing goes wrong. In order to mitigate the impact of possible issues and online business downtime, all applications are released at 3 or 4 am. After launch, to verify that the app works, you have to watch the daytime peak users the next day…
  • The team appears buck-passing phenomenon. There is often a long debate about which application to build some common functionality on, and it ends up either doing it all in one place or putting it all in one place but not maintaining it.

Despite the problems, the results of this phase cannot be denied: the system was quickly built in response to business changes. But pressing and onerous tasks tend to lead to partial, short-term thinking that leads to compromise decisions. In this architecture, everyone is focused on his or her own niche, and there is no overall, long-term design. In the long run, system construction will become more and more difficult, and even fall into a cycle of continuous overturning and reconstruction.

It’s time for a change

Fortunately, Xiao Ming and Xiao Hong are good young people who pursue their dreams. When they realized the problem, Ming and Hong freed up some of their energy from trivial business requirements and began to sort out the overall structure and prepare for transformation.

To make a retrofit, you first need to have enough energy and resources. If your demand side (business people, project managers, bosses, etc.) is so focused on the schedule that you can’t allocate extra energy and resources, you probably won’t get anything done…

The most important thing in the programming world is abstraction. The process of micro-service transformation is actually an abstract process. Xiao Ming and Xiao Hong sorted out the business logic of the online supermarket, abstracted the public business capabilities, and made several public services:

  • Customer service
  • Goods and services
  • Promotional services
  • Order service
  • Data analysis services

The back-end applications only need to get the data they need from these services, eliminating a lot of redundant code and leaving only a thin control layer and front end. The structure of this phase is as follows:

This phase only separates the services and the database is still shared, so some of the downsides of the smokestack system remain:

  1. The database becomes a performance bottleneck and has the risk of a single point of failure.
  2. Data management tends to be chaotic. Even with a good modular design at the beginning, over time there will always be instances where one service pulls data directly from another service’s database.
  3. The database table structure can be dependent on multiple services and is difficult to adjust.

If you keep the pattern of sharing databases, the architecture becomes more and more rigid and loses the sense of microservice architecture. So Xiao Ming and Xiao Hong beat a drum and split up the database. All persistence layers are isolated from each other and are the responsibility of individual services. In addition, in order to improve the real-time performance of the system, the message queue mechanism is added. The structure is as follows:

After full separation, services can adopt heterogeneous technologies. For example, data analysis services can use a data warehouse as a persistence layer for efficient statistical calculations; Commodity services and promotional services are frequently accessed, so caching mechanisms are added.

Another way to abstract out common logic is to make it a common framework library. This approach reduces the performance cost of service invocation. However, the management cost of this approach is very high and it is difficult to ensure consistency across all application versions.

There are also challenges to database splitting: the need to cascade across libraries, the granularity of data queried through services, etc. But these problems can be solved with sound design. Overall, database splitting is a good thing.

Microservices architecture has an additional benefit outside of technology, as it allows for a clearer division of labor and responsibilities across the system, with each person focused on providing better services to others. In the era of monolithic applications, common business functions often have no clear attribution. In the end, you either do it individually, and everybody reimplements it; Either a random person (usually a competent or enthusiastic person) works in the application he is responsible for. In the latter case, in addition to being responsible for their own application, the person is also responsible for providing these common functions to others — functions that no one else is responsible for, but simply because they are more competent/enthusiastic, they are somehow responsible for (a situation that is also called “overachiever”). As a result, people are reluctant to provide common functionality. In the long run, people on the team become isolated and no longer care about the overall architectural design.

From this point of view, the use of microservices architecture also requires organizational adjustments. Therefore, micro service transformation needs the support of managers.

After the transformation, Xiao Ming and Xiao Hong clearly distinguish their POTS. They were satisfied that everything was as beautiful as Maxwell’s equations.

However……

There is no silver bullet

Spring is here, everything is alive, and it’s time for the annual shopping carnival. Seeing day order quantity Dally rises, xiao PI Xiao Ming Xiao Hong smiles. Unfortunately, it didn’t last long, and then all of a sudden, boom, the system went down.

In the past, troubleshooting for a single application was usually done by looking at the log, examining error messages and the call stack. In microservice architecture, the entire application is scattered into multiple services, making it difficult to locate faults. Xiaoming a machine a machine to view the log, a service a service to call manually. After more than ten minutes of searching, Ming finally located the fault point: the promotion service stopped responding due to the large number of requests received. All the other services, directly or indirectly, called the promotion service and went down as well. In a microservice architecture, a single service failure can cause an avalanche of utility that causes the entire system to fail. In fact, before the festival, Xiao Ming and Xiao Hong have made a request assessment. The server resources were expected to be sufficient to support the festival’s volume of requests, so something must have gone wrong. However, the situation is urgent, with every minute and every second passes is white money, so Xiao Ming did not have time to check the problem, immediately decided to build several virtual machines on the cloud, and then deployed a new promotion service node. After a few minutes of operation, the system was back to normal after a fashion. It is estimated that hundreds of thousands of sales were lost during the whole time of the failure, and the hearts of the three are bleeding…

After that, Xiao Ming simply wrote a log analysis tool (the volume was so large that the text editor could hardly be opened and even the naked eye could not see it when it was opened), counted the access log of the promotion service, and found that during the failure period, the product service would initiate a large number of requests to the promotion service due to code problems in some scenarios. This problem is not complicated, xiaoming finger shake, fixed the value of hundreds of thousands of bugs.

The problem is solved, but there is no guarantee that similar problems will not occur again. Microservices architectures, while logically designed to be perfect, are as vulnerable as a brick palace. While the microservices architecture solves old problems, it also introduces new ones:

  • Microservices architecture the entire application is scattered into multiple services, making it difficult to locate faults.
  • Stability declines. As the number of services increases, the probability of one service failure increases, and the failure of one service may lead to the failure of the entire system. In fact, in high-volume production scenarios, there will always be failures.
  • The number of services is very large, and the workload of deployment and management is very heavy.
  • Development: how to ensure that the various services remain collaborative under continuous development.
  • Testing: When services are split, almost all functionality involves multiple services. What was once a test of a single program becomes a test of calls between services. Testing becomes more complex.

Xiao Ming and Xiao Hong thought over the pain and decided to solve these problems. Troubleshooting generally starts from two aspects: on the one hand, the probability of failure is minimized; on the other hand, the impact caused by the fault is reduced.

Monitor – Detect signs of failure

In highly concurrent distributed scenarios, failures often erupt suddenly and in an avalanche. Therefore, a perfect monitoring system must be established to find the signs of failure as far as possible.

There are many components in the microservice architecture, and each component needs to monitor different indicators. For example, Redis cache generally monitors memory usage and network traffic, database monitors connection number and disk space, and business service monitors concurrency, response delay and error rate. Therefore, it is impractical to make a large and comprehensive monitoring system to monitor each component, and the scalability will be poor. It is common practice for each component to provide an interface to report its current status (the Metrics interface), and this interface should output data in a consistent format. An indicator collector component is then deployed to periodically obtain and maintain component state from these interfaces and provide query services. Finally, a UI is required to query indicators from the indicator collector, draw a monitoring interface, or generate alarms based on thresholds.

Most components don’t need to be developed yourself; there are open source components available on the web. Ming has downloaded RedisExporter and MySQLExporter, which provide an interface to the Redis cache and MySQL database, respectively. Microservices implement customized indicator interfaces according to the business logic of each service. Then Xiaoming adopted Prometheus as the indicator collector, and Grafana was configured with monitoring interface and email alarm. Thus a set of micro-service monitoring system is built:

Locate the problem – Link tracing

In a microservice architecture, a user’s request often involves multiple internal service invocations. To facilitate problem location, you need to be able to record how many service calls are made within the microservice per user request, and their invocation relationships. This is called link tracing.

Let’s use an example of link tracing in an Istio document to see the effect:

Image from Istio document

As you can see from the figure, this is a user request to access the ProductPage page. During the request, the ProductPage service calls the interfaces of the Details and Reviews services sequentially. In response, the Reviews service invokes the ratings interface. The record of the entire link tracking is a tree:

To implement link tracing, each service invocation records at least four items of data in the HTTP HEADERS:

  • TraceId: traceId identifies the invocation link requested by a user. The calls with the same traceId belong to the same link.
  • SpanId: Identifies the ID of a service invocation, that is, the node ID of link tracing.
  • ParentId: spanId of the parent node.
  • RequestTime & responseTime: requestTime and responseTime.

In addition, you need to invoke the log collection and storage component, as well as the UI component that shows the link calls.

The above is only a simple explanation. For the theoretical basis of link tracing, please refer to Dapper of Google

After understanding the theoretical basis, Ming chose Zipkin, an open source implementation of Dapper. Then, with a snap of a finger, I write an HTTP request interceptor that generates this data for injection into HEADERS on each HTTP request, while asynchronously sending a call log to Zipkin’s log collector. As a side note, HTTP request interceptors can be implemented in the microservice code, or they can be implemented using a network proxy component (though each microservice requires an additional layer of proxies).

Link tracing can only locate the fault of a service, but cannot provide specific error information. The ability to find specific error information needs to be provided by the log analysis component.

Problem analysis – Log analysis

Log analysis components should have been widely used before the rise of microservices. Even with a single application architecture, as the number of accesses increases, or as the number of servers increases, the size of the log files can swell to the point where they are difficult to access with a text editor, or worse, spread across multiple servers. To troubleshoot a problem, you need to log in to each server to obtain log files, and search (and open and search slowly) the desired log information.

Therefore, we need a “search engine” for logs as our applications scale up. In order to be able to accurately find the desired log. In addition, the data source side also needs a component to collect logs and a UI component to display results:

Ming investigated and used the well-known ELK log analysis component. ELK stands for Elasticsearch, Logstash, and Kibana components.

  • Elasticsearch: A search engine that also stores logs.
  • Logstash: Log collector. It receives log input, preprocesses the log, and outputs it to Elasticsearch.
  • Kibana: UI component that uses Elasticsearch’s API to find data and present it to the user.

Finally, there is the small problem of sending logs to Logstash. One solution is to call the Logstash interface directly to send the log when it is output. So you have to change the code again… Therefore, Xiaoming chose another solution: the log is still output to the file, and an Agent is deployed in each service to scan the log file and output it to the Logstash.

Gateway-permission control and service governance

After splitting into microservices, there are a lot of services, a lot of interfaces, making the whole call relationship messy. Often in the middle of development, while writing, I can’t remember which service to call for certain data. Or write it wrong, call a service that shouldn’t be called, a read-only function that changes the data…

To deal with these situations, the invocation of microservices requires a gatekeeper, the gateway. Add a layer of gateway between the caller and the called, and verify permissions each time the call is made. In addition, the gateway can serve as a platform for providing service interface documentation.

One problem with using a gateway is to decide at what granularity: the coarse-grained solution is a gateway for the entire microservice, which is accessed by the gateway from outside the microservice and directly invoked from inside the microservice. At the most granular level, all calls, whether internal to microservices or from outside, must go through the gateway. The compromise solution is to divide microservices into several zones according to the business domain. The zones are called directly and the zones are called through the gateway.

Since the service quantity of the whole online supermarket is not particularly large, Xiaoming adopts the coarsest granularity scheme:

Services registered with discovery – Dynamic expansion

The previous components are all designed to reduce the likelihood of failure. However, failures will always happen, so another area that needs to be studied is how to reduce their impact.

The crudest (and most common) troubleshooting strategy is redundancy. Typically, a service will have multiple instances deployed so that it can share the pressure to improve performance and that other instances can respond even if one instance fails.

One of the problems with redundancy is how many redundancies do you use? There is no definitive answer to this question on the timeline. A different number of instances are required according to service functions and time periods. On a typical day, for example, four instances might be sufficient; During a promotional campaign, when traffic increases, 40 instances may be needed. So the number of redundancies is not a fixed value, but is adjusted in real time as needed.

In general, the operation of adding an instance is as follows:

  1. Deploying a new instance
  2. Register the new instance with load balancer or DNS

There are only two steps, but if registering with load balancers or DNS is manual, it’s not easy. Imagine having to manually enter 40 IP addresses after adding 40 instances…

The solution to this problem is automatic service registration and discovery. First, you need to deploy a service discovery service that provides the address information for all registered services. DNS is also a service discovery service. Each application service then automatically registers itself with the service discovery service at startup. After the application service is started, the address list of each application service will be synchronized from the service discovery service to the local server in real time (periodically). The service discovery service also periodically checks the health status of application services to remove unhealthy instance addresses. In this way, you only need to deploy the new instance and stop the service when the instance goes offline. Service discovery automatically checks whether the service instance is added or deleted.

Service discovery is also used in conjunction with client load balancing. Because the application service has synchronized the service address list locally, it can determine its own load policy when accessing microservices. You can even add some metadata (service version and other information) to the service registration, and the client load can control the traffic based on this metadata to implement A/B testing, blue and green publishing, and so on.

Service discovery has many components to choose from, such as Zookeeper, Eureka, Consul, Etcd, and more. However, Xiao Ming felt that his level is good, want to show off, so based on Redis wrote a……

Fuses, service degradation, and current limiting

fusing

When a service stops responding for various reasons, the caller usually waits for some time and then times out or receives an error return. If the invocation link is long, requests may accumulate and the whole link occupies a lot of resources waiting for downstream responses. Therefore, if you fail to access a service for many times, you should disable the service, mark that the service has stopped working, and directly return an error. Re-establish the connection until the service is restored.

Image from Microservices Design

Service degradation

When the downstream service stops working, if the service is not the core service, the upstream service should be degraded to ensure that the core service is not interrupted. For example, the online supermarket ordering interface has a function of recommending products to order together. When the recommendation module is suspended, the ordering function cannot be suspended together. You only need to temporarily close the recommendation function.

Current limiting

When a service fails, upstream services or users routinely retry access. This leads to repeated sit-ups in coffins, as soon as the service returns to normal, it is likely that the network will be overwhelmed and then die. So services need to protect themselves – limiting traffic. There are many traffic limiting policies. For example, if the number of requests per unit of time is too many, the traffic limiting policy discards the redundant requests. In addition, partition traffic limiting can be considered. Reject only requests from services that generate a large number of requests. For example, both the goods service and the order service need to access the promotion service, the goods service initiates a large number of requests due to code problems, the promotion service only restricts requests from the goods service, and requests from the order service respond normally.

test

In the microservices architecture, testing is divided into three levels:

  1. End to end testing: covering the entire system, generally in the user interface model testing.
  2. Service testing: Tests against service interfaces.
  3. Unit testing: Tests against code units.

The three tests are carried out with increasing ease from top to bottom, but with decreasing effectiveness. End-to-end testing takes the most time and effort, but it gives us the most confidence in the system. Unit tests are the easiest and most efficient to implement, but they don’t guarantee a problem-free system.

Because end-to-end testing is difficult to implement, only core functions are tested. Once an end-to-end test fails, it needs to be broken down into unit tests: analyze the cause of the failure, and then write unit tests to reproduce the problem so that we can catch the same bug faster in the future.

The difficulty of service testing is that services often depend on some other service. This problem can be solved with Mock Server:

Unit testing is familiar. We write a lot of unit tests (including regression tests) to cover as much code as possible.

Microservices Framework

Some interconnection codes must be added to application services for indicator interfaces, link tracing injection, log traffic diversion, service registration discovery, and routing rules, as well as functions such as fusing and traffic limiting. It would be time consuming for each application service to implement itself. Based on the PRINCIPLE of DRY, Xiaoming developed a set of microservice framework, which separated the code connected with each component and some other common code into the framework, and all application services were uniformly developed using this framework.

You can implement a lot of custom functionality using the microservices framework. You can even inject program call stack information into link tracing for code-level link tracing. Or output thread pool, connection pool status information, real-time monitoring of the underlying service status.

There is one serious problem with using a unified microservices framework: the cost of updating the framework is high. Every framework upgrade requires all application services to cooperate with the upgrade. Of course, compatibility schemes are typically used to allow a period of parallel time for all application services to be upgraded. However, if there are many application services, the upgrade time may be very long. And there are some stable application services that rarely update, and their owners may refuse to upgrade… Therefore, the use of a unified microservices framework requires sound versioning methods and development management specifications.

Alternative route – Service Mesh

Another way to abstract common code is to abstract it directly into a reverse proxy component. Each service deploys this proxy component in addition, through which all inbound and outbound traffic is processed and forwarded. This component is called Sidecar.

Sidecar does not incur additional network costs. Sidecar and microservice nodes are deployed on the same host and share the same virtual network adapter. So the communication between the Sidecar and the microservice nodes is really just a memory copy.

Image from: Pattern: Service Mesh

Sidecar is only responsible for network communications. You also need a component that centrally manages the configuration of all sidecars. In the Service Mesh, the part responsible for network communication is called the Data plane, and the part responsible for configuration management is called the Control plane. The data plane and the control plane form the basic architecture of the Service Mesh.

Image from: Pattern: Service Mesh

The advantages of Sevice Mesh over microservices frameworks are that it is less intrusive and easier to upgrade and maintain. It is often blamed for performance issues. Even though the loopback network does not generate actual network requests, there is still an additional cost of memory copying. There are also some centralized traffic handling that can affect performance.

The end, is also the beginning

Microservices are not the end of architecture evolution. There are Serverless, FaaS and other directions to go. On the other hand, some people are singing that they must be separated after a long time, rediscovering the single architecture…

Either way, the transformation of the microservices architecture is over for now. Xiao Ming touched his increasingly smooth head contentedly and planned to take a break this weekend to meet Xiao Hong for a cup of coffee.