preface

In microservice system, service registry is the most basic component, and its stability will directly affect the stability of the whole service system. This paper mainly introduces iQiyi micro-service platform’s service registry construction mode based on Consul, integration with internal container platform and API gateway, and focuses on recording a fault encountered by Consul, analysis and solution process, as well as the optimization and adjustment measures for this fault from the framework.

Consul is a popular service discovery tool for discovering and configuring distributed systems. Consul’s solution is more “one-stop” and easier to use than other distributed services registration and discovery solutions. The main application scenarios are service discovery, service isolation, and service configuration.

Consul portal* * * *

01.

Registry background and use of Consul

From the perspective of micro-service platform, we hope to provide a unified service registry, so that any business and team only need to negotiate the service name as long as they use this infrastructure. Support multi-dc deployment and failover. Due to the good design in scalability and multi-DC support, we choose Consul and adopt the architecture recommended by Consul. There are Consul Server and Consul Agent in a single DC, and there are WAN mode and peer among DCS. The structure is shown in the figure below.

Note: There are only four DCS in the figure, but there are more than ten DCS in the actual production environment according to the company’s computer room construction and third-party cloud access.

02.

Integration with QAE container application platform

Iqiyi’s internal container application platform QAE is integrated with Consul. Due to the early development based on Mesos/Marathon system, there is no POD container group concept, so it is not possible to inject containers into Sidecar in a friendly way. Therefore, we choose the third-party registration mode in microservice mode, that is, QAE system synchronizes registration information to Consul in real time, as shown in the figure below. Consul determines that a node or service instance is unhealthy, but QAE does not detect that. Therefore, the Consul does not restart or reschedule the node or service instance. As a result, no healthy instance is available.

The relationship between QAE application and service is shown as follows:

Each QAE application represents a group of containers. The mapping relationship between applications and services is loosely coupled. You can associate the APPLICATION with the CORRESPONDING Consul DC based on the DC where the application resides.

03.

Integration with API gateway

Microservice platform API gateways are one of the most important users of service registries. The gateway deploys multiple clusters based on the region and carrier. Each gateway cluster maps to Consul and queries the latest service instance from Consul, as shown in the following figure.

Here, We use Consul’s PreparedQuery function, which returns the local DC service instance for all services first. If the local DC does not have one, we query other DC data from near to far based on inter-DC RTT.

Failure and analysis optimization

01.

Consul fault

Consul has been running steadily for more than three years since it was launched at the end of 2016. However, we have recently encountered a fault. We have received alarms that several Consul servers in a DC are not responding to requests and a large number of Consul Agents cannot connect to the Server. The symptoms on the Server are as follows:

Raft protocol failed to get the leader without stopping the election.

2. A large number of HTTP&DNS query interface timeouts, observed that some more than tens of seconds before returning (normal should be milliseconds level return);

3. Goroutine rises quickly and linearly, and memory rises synchronously, finally triggering OOM; No clear problem was found in the logs, and an abnormal increase in PreparedQuery execution time was observed from monitoring metrics, as shown in the figure below.

In this case, the API gateway fails to query service information due to timeout. The gateway cluster is switched to another DC, and the Consul process is restarted to recover.

02.

Failure analysis

After log investigation, it was found that network jitter between DCS (RTT increased, accompanied by packet loss) occurred before the fault, which lasted about 1 minute. According to our preliminary analysis, the network jitter between DCS caused the PreparedQuery requests normally received to be backlogged in the Server and could not be returned quickly. As time goes by, goroutine and memory usage increases, resulting in Server exceptions.

Follow this idea and try to reproduce in the test environment. There are 4 DCS in total, PreparedQuery QPS of a single Server is 1.5K, and each PreparedQuery will trigger 3 cross-DC queries. Then tC-NEtem was used to simulate the increase of RTT between DCS, and the following results were obtained:

1. When the RTT between DCS changes from normal 2ms to 800ms, Consul Server’s Goroutine and memory does increase linearly, as does PreparedQuery execution time, as shown in the figure below.

2. Although goroutine and Consul are growing, other functions of Consul Server are not affected before OOM, RAFT protocol works well and data query requests from this DC can be properly responded to.

Consul Server loses its leader when the RTT between DCS is restored to 2ms. Raft continues to vote and fails to recover.

The above operations can stably reproduce the fault, so that the analysis work has a direction. Firstly, it is basically confirmed that the increase of Goroutine and memory is caused by backlogs of PreparedQuery requests. The cause of backlogs is network request blocking in the initial stage, and the cause of backlogs is still unknown after network recovery. At this time, the whole process should be in abnormal state. So, why did Consul go down after the network was restored? Raft only has DC network communication, why is it abnormal? Is the question that most confuses us.

The initial focus was on raft, and by following the community issue I found Hashicorp/Raft #6852, which described that our version could have raft deadlocks under high load and network jitter, which was very similar to ours. However, after updating raft library and Consul code according to issue, the fault still exists when the test environment reappears.

Then we tried to add a log to the raft library to see the details of what raft was doing. This time we found that the log took 10 seconds from the time the raft member entered the Candidate state to the time the peer was asked to vote for him, and the metrics update was just one line in the code. As shown in the figure below.

We suspected that the metrics call was blocked, causing the system to run abnormally, and later found optimizations in the release history. The lower version of armon/ Go-Metrics adopted the global lock sync.mutex in the Prometheus implementation. V0.3.3 uses sync.Map, where each metric is a key in the dictionary. The global lock is acquired only when the key is initialized, and there is no lock contention when different metric values are updated. Using sync.Atomic for same metric updates ensures Atomic operation and overall efficiency. Consul Server can recover after network jitter reappears after updating the corresponding dependency libraries.

It does appear that the metrics code is blocking, causing an overall system exception. But we still have a question. PreparedQuery QPS of the single Server in the repetition environment is 1.5K, while the QPS of the single Server in the stable network environment still works normally when it reaches 2.8K under pressure test. This means that the original code normally meets the performance requirements, and only in the event of a failure does the performance problem occur.

The subsequent troubleshooting ran into trouble, and after trial and error, we found an interesting phenomenon: the version compiled with go1.9 (which was also used in production) repeated the failure; The same code compiled with go1.14 cannot reproduce the failure. On closer inspection, we found the following two records in go’s release history:

According to the code, we found user feedback that in go1.9~1.13, when a large number of Goroutines compete with a sync.mutex, the performance will be dramatically reduced, which can well explain our problem. Consul code relies on go1.9’s new built-in library and we were unable to compile Consul with a lower version, so we removed sync.mutex optimizations from Go 1.14, as shown below, and compiled Consul with this version of Go, and sure enough we were able to reproduce our glitch again.

Starvation mode was added to normal mode to avoid the long tail effect of lock waits. However, in normal mode, the new Goroutine has a higher chance to compete for lock success, thus eliminating the need for goroutine switching. The overall efficiency is high. Starvation mode, new Goroutines will not compete for locks directly. Instead, they will queue themselves to the end of the waiting queue and hibernate to be awakened. Locks are allocated according to THE WAIT queue FIFO, and goroutines that obtain locks are scheduled for execution. This increases the cost of goroutine scheduling and switching. In go1.14, performance issues were addressed. In starvation mode, when a goroutine performs an unlock operation, the CPU time is directly handed over to the next goroutine waiting for a lock, which speeds up the execution of the locked part of the code.

The cause of this failure is clear. Firstly, network jitter causes a large number of PreparedQuery requests to pile up in the Server, which also causes a large number of Goroutine and memory usage. After the network is restored, the backlogged PreparedQuery continues to execute. In our scenario, the backlogged Goroutine exceeds 150K. These goroutines update metrics during execution to retrieve the global sync.mutex. Starvation mode is switched to and performance deteriorates. A large amount of time is spent waiting for Sync. Mutex, and the request block times out. In addition to the backlog of Goroutines, new PreparedQueries are received, and locks are also blocked. Sync. Mutex remains in starvation mode and cannot be automatically recovered. On the other hand, raft code relies on timers, timeouts, and the timely delivery and processing of messages between nodes, and these timeouts are usually second and millisecsecond, but the metrics code blocks so long that the timing logic doesn’t work properly.

We then updated the issues we found in the production environment to go1.14, ARMon/Go-Metrics V0.3.3, and Hashicorp/Raft V1.1.2 to bring Consul to a stable state. In addition, monitoring indicators are sorted out and improved. The core monitoring includes the following dimensions:

1. Process: CPU, memory, Goroutine, connection number

Raft: member state change, submission rate, submission time, synchronous heartbeat, and synchronization delay

3. RPC: indicates the number of connections and cross-DC requests

4. Write load: registration & unregistration rate

5. Read the load: the Catalog/Health/PreparedQuery request quantity, perform time-consuming

03.

Redundant registered

In light of the symptoms during Consul’s outage, we have undertaken a review of the architecture of the service registry.

In Consul architecture, if a DC Consul Server is faulty, the DC Consul Server is faulty, and other DCS are required for DISASTER recovery. However, many services that are not on the critical path or have low SLA requirements do not have multiple DCS deployed. In this case, if Consul on the DC is faulty, the entire service will fail.

If a service that does not deploy multiple DCS can be registered in redundant DCS, other DCS can detect a DC Consul fault. Therefore, we modified the QAE registry table. For a service that only has a single DC deployment, the system automatically registers a copy in other DCS, as shown in the figure below.

QAE’s redundant registration is equivalent to overwriting data at the upper level. Consul does not synchronize service registration data between DCS. Therefore, services registered through Consul Agent do not have a good method for redundant registration, and rely on the service itself to deploy multiple DCS.

04.

Assurance API Gateway

Currently, the API gateway relies on the local cache of Consul PreparedQuery query results for normal operation. The current interaction mode has two problems:

1. The gateway cache is lazy. The gateway loads the data from Consul query only when it is used for the first time.

2. PreparedQuery may involve multiple cross-DC queries internally, which is time-consuming and complex. Since each gateway node needs to build a cache separately and the cache has TTL, the same PreparedQuery will be executed for many times. Query QPS grow linearly with gateway cluster size.

To improve the stability and efficiency of gateway Consul queries, we have chosen to deploy a separate Consul cluster for each gateway cluster, as shown in the figure below.

In the figure, the red Consul cluster is the original Consul cluster, and the green Consul cluster is deployed independently for the gateway. The Consul cluster only works within a single DC. We developed a gateway-Consul-sync component that periodically reads PreparedQuery query results from the public Consul cluster and writes them to the green Consul cluster. The Gateway directly accesses the green Consul for data query. After such transformation, there are the following advantages:

1. From the perspective of supporting gateways, the load of the public cluster originally grew linearly with the number of network joints, but after transformation, it grew linearly with the number of services. In addition, a single service would only execute one PreparedQuery query within the synchronization cycle, and the overall load would be reduced;

2. In the figure, green Consul is only used by gateways. All data in PreparedQuery execution is local and does not involve cross-DC query, so the complexity is reduced and is not affected by cross-DC network.

3. If the public cluster is faulty, gateway-consul-sync does not work properly, but the green Consul returns data that has been synchronized before, and the Gateway continues to work.

4. The gateway uses the same interface and data format to query Consul before and after the Consul transformation. If the green Consul cluster in the preceding figure is faulty, the gateway can switch back to the public Consul cluster as a backup solution.

Summary and Prospect

As a unified service registry, stability and reliability are always our primary goals. On the one hand, to ensure the stability of the service registry itself, on the other hand, it will improve the stability of the whole technical system through multi-dimensional redundancy of deployment, data, components and so on.

We now have a set of monitoring metrics that help us evaluate the overall capacity and saturation of the system. As more and more services are added to the system, monitoring indicators of service dimensions should be improved. When the system load changes unexpectedly, specific services and nodes can be quickly located.

reference

1. Iqiyi micro service API gateway

Mp.weixin.qq.com/s/joaYcdmee…

2. The Consul raft deadlock github.com/hashicorp/c…

3. The Consul Prometheus update github.com/hashicorp/c…

4.Go sync.Mutex Performance issue github.com/golang/go/i…

5.Go sync.Mutex update 

Go-review.googlesource.com/c/go/+/2061…

Maybe you’d like to see more

High availability optimization practice based on microservice Maturity model

Iqiyi is based on Prometheus’ microservice application monitoring practice