Troubleshooting is a practical problem that every system must face. However, as the system becomes more and more complex, it becomes more difficult to find, locate, and rectify faults. Didi now serves nearly 400 million passengers and more than 17 million drivers in more than 400 cities, with more than 10 business lines providing services. The rapid growth of the business is a great challenge for stable work. InfoQ spoke with Zhang Yunliu, a senior operations engineer at Didi, to learn more about the company’s work on troubleshooting and stability building. In addition, Zhang Yunliu will also share related topics at CNUTCon on September 10, welcome to follow.

InfoQ: Can you give us an overview of what didi Chuxing has done to improve the efficiency of troubleshooting?

InfoQ: What is your general process when monitoring systems discover a problem?

InfoQ: Monitoring is a prerequisite for failure avoidance. Can you talk about your monitoring architecture and technology stack?

InfoQ: What is your iterative approach to monitoring, from simple system monitoring to independent monitoring?

InfoQ: Based on Didi Chuxing’s previous operation and maintenance experience, what are the types of failures?

InfoQ: What technical points will you focus on sharing with attendees at CNUTCon?