How does Qunar solve the self-healing problem of IDC network fault

Miao Hongtao, who joined Qunar in 2010, is now in charge of the operation and maintenance management of the Technical Support Department. He has led the team to complete the planning and construction of DNS system, load balancing system, operation and maintenance automation system, and distributed storage system.

1. Background

When I joined Qunar in 2010, IDC in Qunar was still small with only a few hundred servers. With the development of Qunar, business demands for servers are also increasing, and the IDC scale of Qunar is gradually growing. If all servers are placed in one IDC, management is easier, but risks are inevitable. Faults in a single equipment room, such as power supply, network, and air conditioner, may affect services, especially the network. As we all know, general IDC will have a lot of users, and now the Internet users will often be attacked disturbing. On the other hand, the access methods of the connected operators of each IDC are also different, which will cause the network jitter of operators, IDC and operators themselves to bring instability to our services. In order to share the risk, we implemented the multi-machine room deployment scheme, built the backbone network of Qunar, and solved the traffic communication and link redundancy among the machine rooms.

The multi-room deployment solution solves the problem of expanding the capacity of a single IDC and provides multiple egress services. However, when a single machine room is faulty, some services will be affected. Original processing mode: Operation and maintenance students receive the alarm, then quickly open the notebook, connect VPN (enter the Intranet), check the monitoring impact, make judgment, cut traffic. The whole process for an operation and maintenance veteran, to make a quick judgment, at least it takes more than 10 minutes, 10 minutes for the loss of e-commerce like us is relatively large. At the same time, operation and maintenance students should ensure that no matter where they go, they should carry the book and be ready to fight at any time.

In order to solve these problems, Qunar independently developed a self-healing system, which solved the problems of uplink and downlink of the computer room network.

2. User access-downlink

Users access Qunar’s products in two main ways: client and PC, but either way is based on the DNS system. Since we have several rooms, we deploy a responsible equalization system in each room.

The load balancing system of Qunar is briefly introduced. Our existing system is mainly based on Nginx load balancing system, the early model is also through heartbeat+ Nginx mode, as shown below:

As the business grew, the individual Nginx became the bottleneck. Later, we introduced ECMP mode. Horizontal scaling of Nginx has become very simple.

Later, we also introduced Openresty Enterprise edition. As you all know, Nginx needs to be reloaded every time it loads a configuration. In this process, the original worker will be closed and the new worker will be started. During this process, some users may fail to access the system. Openresty can do configuration hot loading, and the whole process is user friendly. Meanwhile, Nginx modification is a serial process, and the next modification must wait for the completion of this modification. As the number of clusters increases, a configuration change takes longer and longer. Openresty Edge solves this problem by hot-loading upstream as well as other configurations. In addition, Openresty implements partition management, making our serial work parallel and complementary, greatly increasing THE productivity of OPS.

Through these modifications of load balancing system, we avoided the failure of a single node in a cluster. However, a single machine room failure point still exists. For user access problems, we mainly resolve them through DNS resolution. Qunar has opened source DNSDB system, you can go to the following website to learn about it:

github Github.com/qunarcorp/o…

Through DNSDB system, Qunar operation and maintenance realizes one-click switch of hundreds of domain names. DNSDB system can not only operate through the Web, but also provide API interface. It is the API that makes automatic switching possible. We have monitoring of national links in each machine room:

The above chart shows the national network monitoring of each computer room. In a single machine room, it can detect more than 100 nodes nationwide within 30 seconds. With watcher monitoring platform, when the monitoring threshold alarms, DNSDB API interface is triggered to complete DNS domain name address switchover. We set the threshold for 4 consecutive points and then trigger the switch after 2 points alarm. This is also an empirical value, which can effectively avoid the problem of false positives in monitoring systems. If ok is displayed for 20 consecutive minutes, DNSDB restores the original configuration. In this way, we completely solved the network failure of a single machine room. In addition, the automatic fault recovery mechanism also avoids the increase of bandwidth cost caused by the long-term unrecovery of traffic caused by human negligence. Qunar’s DNS system also provides view, EDNS and other functions, which also provides strong support for OPS to adjust bandwidth and improve user access quality. \

3. Access third-party services – Uplink

Although we have avoided the problem of users accessing Qunar through the DNS system, our business still needs to access third-party cooperative websites. When there is a problem in a single machine room, especially between IDC and the connected carrier, our services cannot be switched to other outlets.

To this end, OPS and Qunar TC group implemented an IDC proxy mode. The IDC proxy solution also consists of three parts: the proxy module part, the qconfig configuration proxy part, and the monitoring part. The structure is as follows:

For the proxy module, we chose squid as the proxy module. At the beginning, we chose ATS as the proxy module, which has advantages over SQUID in performance and can use multiple cores. However, a problem was found during testing. When multiple IP addresses are resolved on a website and one IP address fails to be accessed, the ATS does not retry the second IP address and returns 502 gateway failure. Squid will retry the second IP after the first IP fails. So we finally chose squid as our proxy server. To maximize the performance of the server, we used squid’s multi-threaded mode. One of our test environments: a 32-core server, SQUID, uses hyper-threading mode. During the pressure test, the QPS can reach 50KQPS when the CPU load is full. In the proxy cluster, we also use ECMP mode, which makes it very easy to scale the whole cluster. \

With the agent module, we also want to be able to make the business in the event of a single room failure, the agent can automatically switch, so as to make the business transparent. Since most of our services were Java services, we asked the TC team to help develop an HTTP module. After introducing this module, the business can configure the proxy module through our qconfig.

On the service side, you can set the centralized mode through qconfig:

The switch. The use of proxies is controlled through switches.
The whitelist was configured. Only the configured domain name will go through the proxy.

On the O&M side, there are more combinations of setting modes than on the service side:

The blacklist. A blacklisted domain name will not allow lucky dimension’s proxy. For example, we filter our domain names so that we don’t waste resources.
Proxy rules: You can use domain name, domain name + room, room, and default rules. By setting these rules, OPS can control traffic.

With the agent module and Qconfig, we also need to monitor the system. For this piece, we reuse the monitoring system part introduced earlier. When a single machine room fails, our program invokes Qconfig to trigger the agent rule modification, thus quickly switching traffic to other machine rooms. The whole process is transparent to the business and the service remains continuous.

With these two systems, operation and maintenance students are finally less nervous on rest days. I hope I can lead students in operation and maintenance to continue to improve our operation and maintenance system, so that more systems can realize self-healing of faults, so that students can have more time to study more advanced technologies and spend more holiday time with their families.

How does Qunar solve the self-healing problem of IDC network fault

1. Background

2. User access-downlink

3. Access third-party services – Uplink

Related Posts

Build a blog with an independent domain name quickly from scratch

MongoDB wiredTiger storage engine learning summary

Tencent computer housekeeper TAV engine reverse analysis