The service registry | a Consul failure analysis and optimization

background

Iqiyi technology product team provides high-quality video services for hundreds of millions of users. In order to adapt to the rapid iteration and innovation of business and support massive user requests, many teams have spontaneously reformed their own business systems with micro-service architecture.

In the process of microservitization, each business team chooses different open source frameworks according to their own needs, such as Apache Dubbo/Spring Cloud, etc. In addition, there are also some self-developed frameworks. In addition, in order to meet the needs of micro-service application monitoring, many teams also maintain their own monitoring systems and other infrastructure.

With the deepening of practice, some problems gradually began to be exposed, including:

· Redundant construction exists in part of infrastructure, which wastes resources and is not easy to ensure stability;

· It is difficult to promote best practices quickly among teams due to inconsistent technical architecture and SDK;

· Inconsistent technical architecture leads to the introduction of a large number of self-maintained gateways in the east-west traffic, which lengthen the link and affects the efficiency of barrier removal and response delay.

In order to solve the above problems, iQiyi middleware team fully listened to the needs and problems of business in micro-service practice, and launched the standard micro-service architecture of IQiyi. During the construction of the standard architecture, we mainly followed the following principles:

Unified architecture: Multiple technologies are implemented in the same technical field. However, too many technical frameworks may lead to high maintenance costs and lack of professional support. In the selection process of the microservice standard architecture, we integrated the actual situation of each open source project and the mainstream technical solutions of the industry, and unified the technology selection in each field. In principle, there is no more than one technology selection in each field.
Extensibility: There are many types of development SDK in the microservice standard architecture. In order to meet the different business requirements of each development team, it is necessary to ensure the expansibility of each SDK. If the open source version cannot meet the internal requirements, iQiyi middleware team will maintain a unified internal customized version.
High availability: One of the goals of the standard architecture is to gradually consolidate infrastructure maintained by teams (such as registries, monitoring systems, etc.) into an internal common platform. The technical framework review and availability maintenance of relevant platforms are also one of our important tasks. In addition, we have built a service maturity system, SMMI, to regularly assess the maturity of core systems and basic services.
Technology evolution: Open source software has its own life cycle, so the community maintenance of each software needs to be fully considered. For example, in the selection of fuse technology, we mainly promoted Sentinel in the standard architecture, instead of Hystrix, which has stopped maintenance now. In addition, the standard architecture is not an immutable system. We also provide a standardized process for the adoption of new technologies to ensure that our technical system can continue to iterate.
Internal open source: In the construction process of the standard architecture, iQiyi carried out internal open source cooperation mode. In addition to the basic service department, business teams are also encouraged to participate in the maintenance of these basic services, so as to jointly create a micro-service technology system that meets business needs and has a certain industry-leading degree, which further promotes the promotion and improvement of relevant standard architecture.

Iqiyi micro service standard architecture

The following figure shows the full picture of iQiyi micro service standard architecture:

The standard architecture mainly includes the following main contents:

Unified microservices development SDK: Core development framework Dubbo/Spring Cloud; Fusible current limiting frame Sentinel, etc.;
Unified microservices infrastructure, including:

Registry: Nacos/ Consul;

Service gateway: gateway platform for secondary development based on Kong, providing authentication, traffic limiting and other functions;

Configuration center: platform for secondary development based on Ctrip Apollo

Indicator monitoring: Prometheus cluster hosting service;

Link monitoring: full link platform (customized development based on Skywalking);

Chaos Engineering: Secondary development on the basis of ChaosBlade, providing various failure drills.

Unified micro-service platform: QDAS (QIYI Distributed Application Service), which provides micro-service Application life cycle management, Service governance, Service market and other functions.

Standard framework ecological construction

The following will be introduced from the core points of micro-service standard architecture such as MICRO-service SDK, registry, monitoring system, fusing flow limiting, API gateway and micro-service platform.

Open Source SDK customization

According to the needs of each business team, iQiyi middleware team mainly made the following extensions to Dubbo SDK:

Infrastructure adaptation: registry, monitoring system, internal container platform, etc.
Availability enhancements: including mechanisms for isolating unhealthy instances and routing to the nearest area;
Security enhancements: support for authentication mechanisms for inter-service calls;
Serialization: Added support for protobuf serialization.

This part of the content has been introduced in detail through other public articles, which will not be expanded here. Interested readers can refer to Apache Dubbo’s iQiyi journey.

Registry Evolution

Registries are one of the most important infrastructures in microservice applications. Previously, the selection of registries was not uniform within IQiyi. The registries previously operated online include ZooKeeper, Eureka, Consul, etc. In fact, ZooKeeper and Eureka are not the best models for microservice registries in the industry. Taking ZooKeeper as an example, its main disadvantages include:

Unable to scale horizontally;
As a consistent system, network partitions can produce unavailability.

After researching various solutions in the industry, we chose Nacos as our next-generation microservices registry. The lower right corner is the overall introduction of Nacos. The main reasons for selecting Nacos are as follows:

High performance, can be horizontal expansion;
Applicable to both traditional as a service architecture and cloud native environments, including support for interface with Istio control;
A NACOS-Sync component is provided to synchronize data with other registries and also to facilitate migration of registries.

Nacos high availability deployment

When deploying the Nacos service, we took high availability into account in terms of the service deployment architecture. At present, our Nacos service is a large cluster with instances distributed in different availability zones. Within each availability zone, we apply for different VIPs, and the final Intranet domain name is bound to these VIPs. In addition, the MySQL used at the bottom of the system also adopts multi-room deployment. This architecture can avoid the unavailability of the entire Nacos service due to the failure of a single Nacos instance or stand-alone room. Here are some simulations of possible failure scenarios:

Single Nacos instance failure: Automatically removed from VIP using the health check capability provided by the Load Balancer cluster;
A VIP cluster fault: use the client retry mechanism to solve;
If a single AZ is faulty, use the client retry mechanism.

MySQL cluster failure: MySQL is not affected because it has nothing to do with the registration and discovery process.

Whole Nacos service failure: client-side pocket mechanisms such as service instance caching.

Registry smooth migration scheme

Next, I will briefly describe how to use NACos-Sync for smooth migration of registries.

The first step is to deploy a NACOS-sync service to synchronize data from the old registry to Nacos. Nacos-sync supports clustered deployment, idemidemous writes to new registries when multiple instances are deployed, and natively supports Dubbo’s registry data format.
After checking that the data is correct, first upgrade the Consumer side to make discovery from the Nacos registry. Nacos-sync synchronizes the data discovered by the service from the old registry.
Then upgrade the Provider end to register services with Nacos.
Log off the NACOS-Sync service and the old registry, and the whole migration process ends.

The main advantages of the above options include:

***** Service provider and consumer upgrades are completely independent and can be carried out by themselves;

***** Applications involved in migration need to be upgraded only once.

Monitoring system construction

Service monitoring is a topic of great concern to all business teams. A complete micro-service monitoring system generally needs to be composed of the following three aspects:

Indicators monitoring: including gold indicators such as QPS/response delay/error rate, customized indicators of the business, JVM indicators of JAVA applications, and relevant indicators of the basic environment, including CPU/memory utilization, etc.
Log monitoring: for example, the number of error logs; AI technology can also be used for statistical analysis of log patterns.
Link monitoring: Due to the complexity of microservice invocation relationships, call chain tracing is also necessary to help business people better analyze inter-application dependencies and monitor the core metrics on individual invocation relationships.

Indicators to monitor

In terms of indicator monitoring, we built a fairly complete standardized monitoring and alarm scheme internally around Prometheus. There are several problems to be solved:

First of all, there is the problem of index calculation. In order to reduce the intrusion, we have carried out secondary development on the basis of SkyWalking Agent, which can automatically intercept the calls of Spring MVC/Dubbo and other mainstream frameworks, and count the number of calls, processing time, error and so on.

The second problem is indicator collection. Prometheus collects indicators in pull mode and generally uses the service discovery mechanism of Prometheus for microservice scenarios. Prometheus integrates consul, K8S and other service discovery methods by default, but does not provide direct support for the Nacos registry. We have adapted the open source Nacos Adapter. Enabling Prometheus to discover application instance information to be collected from Nacos.

Metrics are viewed using Grafana, which provides a set of generic configuration templates and allows the business to scale as needed.

In terms of alarms, the alarm policy is set in Prometheus, and specific alarms are sent to the internal alarm monitoring platform by the alert-Manager through adapter.

The dashboard view, alarm policy setting, and subscription entries are all configured on the internal all-link monitoring platform. Users can view and perform operations on the platform.

The following figure shows the service monitoring interface:

Link to track

The basic principle of link tracing is also consistent with Google’s paper on Dapper. The application generates call chain data through the buried agent, which is summarized to Kafka through log collection or network direct reporting, and analyzed by our real-time analysis program. The analysis results can be roughly divided into three categories: the original call chain data is stored by ES+HBase; the real-time monitoring data of call relationship is stored by druid, and the topology relationship is stored by graph data.

Link tracing provides the following functions:

Call dependency analysis: it provides multiple levels of granularity of dependencies between services and interfaces, supports various middleware such as MySQL/Redis, and provides developers with an intuitive display of various upstream and downstream dependencies.
Inter-service call relationship indicators: provide monitoring of core indicators such as QPS/response delay error rate, and provide monitoring values from both client and server perspectives in one call relationship, facilitating problem location;
Program exception analysis: record and analyze the exception types and stack information in the call chain data center, and display the program exception types of an application and the number of occurrences per minute;
Log association: Associate the call chain with service logs to obtain detailed information about program running.

Fusing current limiting scheme

Due to the characteristics of microservice architecture, there are a lot of upstream and downstream dependencies and network communication. These factors have certain risks to the application itself, such as burst traffic or hotspot parameters of the upstream system. The downstream system service is unavailable, the delay increases, the error rate increases and so on. Without protection of your own systems, you can have an avalanche effect. In order to deal with these scenarios, we mainly introduced the Sentinel framework to solve them. Sentinel is the core principle of the user can define all kinds of resources (resource can be local an interface, or remote a dependence), and set up various rules on resources (such as current limiting rules), on a visit to a resource, the Sentinel components will check whether these rules meet, in the case of not satisfied will throw a particular exception. Users can catch these exceptions to quickly fail or degrade business logic. Sentinel also provides a console that can be used to manage rule parameter Settings and view real-time monitoring.

In order to meet the needs of various business teams in IQiyi, we have made some extensions to sentinel framework. The following example is the complex parameter flow limiting function we have implemented. Sentinel framework has its own hotspot parameter limiting function, but only supports some simple types of parameters (such as String, int, etc.). In some cases, traffic limiting scenarios can be complicated. For example, in the following figure, traffic limiting may be performed based on the ID attribute of the first parameter, which is not supported by the native Sentinel. In this case, we provide an abstract interface that allows users to extract the resources that need to be curated from the parameters through their own implementation.

In order to realize the dynamic delivery of rule parameters, sentinel was adapted to the internal configuration center. The changes made on The Sentinel Dashboard will be saved to the configuration center at last. The business system can dynamically adjust the parameters without restarting the application by introducing the SDK of the configuration center.

On the QDAS management platform, we also use K8S technology to provide sentinel Dashboard hosting capabilities, eliminating deployment and maintenance costs for business teams.

API gateway

The underlying API gateway of IQiyi is implemented based on the open source project Kong, aiming to provide developers with stable, convenient, high-performance and extensible service entry functions and one-stop MANAGEMENT of API configuration and life cycle, which is of great significance to micro-service governance.

In the architecture design of API gateway control flow, the API gateway module of micro-service platform is realized through internal system integration and seralization, providing developers with all the required entry configuration and management functions, without code intrusion, work order application and other manual interference, and realizing API creation can be used. The API gateway supports universal functions such as authentication, traffic limiting, and access control.

The structure is shown in the figure below:

Specific functions and implementation principle of API gateway has number through public articles are introduced, interested readers can refer to a one-stop portal service | iQIYI micro service platform API gateway of actual combat.

QDAS

In the perfect micro-service architecture, micro-service governance platform is also essential. QDAS is an application-centered one-stop platform that supports the development, deployment, operation and maintenance of micro-service applications in the form of functional plug-ins. It is also compatible with Dubbo/Spring Cloud traditional micro-service framework and Istio service grid architecture.

The QDAS platform mainly supports the following features:

Maintain basic application information
Traditional microservice governance

(1) Instance list and instance offline management integrated with Nacos registry;

(2) Monitoring of Grafana core indicators;

(3)Sentinel Dashboard hosting;

(4) Interface authentication and traffic quota management based on Sentinel (under development);

Application life cycle management

Supports application deployment and version upgrade on various platforms (containers and virtual machines)

4 Service Market

Interface contract management: Includes interface description view based on Swagger UI.

Chaotic engineering

Netflix was the first to systematically put forward the concept of chaos engineering, with the purpose of identifying risks as soon as possible and strengthening the weak areas in a targeted way. We have also been focusing on our own failure drills, with the help of some internal tools and external open source projects, and gradually evolved our own fault injection platform – Xiaoluhuizhuan. With the help of the platform, you can arrange your own drill plan to test the robustness of the service.

At present, the deer Collision platform has supported fault injection including server, container (Docker), database, middleware, network equipment, K8S cluster, etc., and can display the associated monitoring, log and alarm in real time during the drill, and automatically generate the drill report after the drill.

In addition, with the platform’s ability to exercise regularly, users can easily achieve the effect of periodic exercise.

The future planning

The future planning of the microservice standard architecture can be divided into the following aspects:

In terms of the trend of micro-service technology, cloud native and Service Mesh have been a trend of micro-service technology evolution. How to introduce service Mesh technology and provide smooth transition solutions for each business will be a focus of our future work.

In terms of service governance, we will further extend the QDAS platform to create a control surface for both Service Mesh and traditional microservices.

In terms of developer support, we plan to launch project scaffolding and online debugging services in the future, making it more convenient for developers to develop projects and troubleshoot online problems.