Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

Author: Feng Weiyuan, senior engineer, head of Redis system operation and maintenance of Tencent Cloud. 6 years of DBA experience, has been engaged in SQL optimization, instance tuning, database architecture, massive database cluster operation and maintenance, operation platform construction and management, etc. Provide database services for QQ, Qzone, QQ Music, Micro cloud, Tencent Cloud and other businesses.

Since its birth in 2015, Tencent Cloud Redis has grown and provided services for tens of thousands of customers. As the only person in charge of operation and maintenance, how to solve the three challenges faced by the author?

  • Consistency management of meta information
  • Efficient operation and maintenance of 10,000 equipment
  • How to realize intelligent scheduling

Understanding Tencent cloud Redis

Tencent Cloud Redis is based on the technology precipitation of Tencent internal distributed cache in QQ, music, Qzone, micro-cloud and other businesses for many years, to create a highly available and reliable Redis service platform for customers. Its business development is rapid, there are tens of thousands of equipment, QPS has reached 100 million.

Tencent Cloud Redis currently provides master/slave version, cluster version and new generation versions respectively. In use, it is basically compatible with Redis protocol, supporting string, linked list, set, ordered set, hash table and other data types, which can help customers to complete the development of different types of business scenarios. Tencent Cloud Redis supports active/standby hot backup and provides a full set of database services such as automatic disaster recovery switchover, data backup, fault migration, instance monitoring, online capacity expansion, and data file recovery.

Operational problems

In the process of operating Redis, we encountered various problems summarized as follows:

  1. Environment: network, TCP parameter setting problems;

  2. Design: do persistence, page table replication caused by lag;

  3. Developer: slow query, connection storm, lack of flow control, etc.

  4. End users: for example, e-commerce seckill activities, where access spikes and processing power is pushed to the limit.

In general, it is the service operation process, the demand and supply of resources do not match.

Three big challenges

In addressing these operational challenges, we have climbed three mountains:

Challenge 1: Meta-information consistency management

The confusion of meta information leads to some operation and maintenance failures that are often encountered in daily operations. The four most basic types of meta-information are cluster, device, instance, and configuration. We have three principles when we deal with this kind of problem.

  • “Complete” — meta-information combing statistical complete;
  • “Accurate” — keep consistent with all kinds of information on the live network;
  • “One” – a unified entry, provides a unified API, to read and modify data, so that metadata changes can be audited.

Firstly, all meta information is sorted out, the common features of various meta information are extracted and classified into models. Then, the attributes and methods of template objects are abstracted and data structures are defined. Finally, the methods of data synchronization and consumption are set and API interfaces are provided externally. Thus a basic DB-CMDB subsystem is built. That is, database layer unified meta-information management system.

In terms of design ideas, a general framework can be used to manage information of different database types and lay a foundation for the automation of operation and maintenance.

Challenge two: The way 10,000 devices operate

At the beginning of the system service, the overall operation and maintenance scale is not large, and a lot of operation and maintenance work can be solved manually. Is it impossible for one or two DBAs to manually solve 10,000 device operations after a customer explosion? Can not face the performance impact of 100 million QPS?

In order to cope with large-scale operation, we build a system of “operating platform” to bear our operation and maintenance logic.

  • Platformization – atomic operations, tools hosted on the platform
  • Process – tools into process, process, reusable
  • Visualization — All kinds of operation and maintenance operation visualization, simple and clear

Start with script editing as a tool, hosted on the platform. The principle of this tool is atomic operation, there are only two states of failure and success. Tools are strung together into processes, and each tool can be reused by multiple processes, so that most operation and maintenance operations, including loading and unloading machines, Redis migration, and scaling can be implemented through processes. At the same time, all kinds of operations are displayed through visualization, simple and clear.

At present, Tencent Cloud Redis operation platform has built hundreds of scenario-based workflow, with the number of daily calls reaching thousands, covering most of the operation and maintenance scenarios. The accidents caused by changes are reduced, the service is more stable and reliable, and the work efficiency of scenario-based operation and maintenance is increased by 300%. Through the platform, visualization and process-oriented “operation platform”, the whole team can do a better job of work coordination, accumulation and inheritance.

Challenge 3: How to achieve intelligent scheduling

Manual operation and maintenance processes are only semi-automated. How do we automate the whole operation?

  • Automatic dispatching system
  • Decision system

Automatic scheduling system: The alarm is triggered by time when the system is abnormal. For example, if a service is restarted at 3:00 p.m. every Wednesday, the alarm is triggered by time. The second is by event scheduling, where we register each alarm as an event in the scheduling system. After the scheduling system captures the event, it can call the task or process of the job platform to complete some work, which forms a closed loop of operation and maintenance.

Decision system: before dealing with a matter, we still need to obtain all kinds of information, how to make decisions according to the information? A decision-making system initiates decision-making requests first, which may involve some decision trees or AI decisions, etc. According to the results of decision-making, it determines which operation process to adjust, or whether to adjust operation process.

Summarize operations

Measurement of operation and maintenance maturity

Operation and maintenance maturity in Tencent cloud maturity measurement: from the more primitive way, to achieve some standard tools. And then visualization, process, platform, and the realization of automatic scheduling platform based on time and event trigger, to achieve fully automated operation and maintenance closed loop. Intelligent interpretation, through machine learning, deep learning methods can help us make better decisions, such as automatic database adjustment, intelligent analysis to achieve cold and heat settlement of data; Finally, we can bring more value to the business through business portrait, data analysis and cost optimization.

Summary: Technology supports business, technology drives business, technology leads business.

Comprehensive requirements for DBAs in the cloud era

In the cloud era, DBAs should set higher and more comprehensive requirements on their own capabilities. We not only want to ensure the system is efficient and stable, to help users, but also in the direction of product creation, the details of the architecture design, component source code, community follow-up and even lead, we need to have their own accumulation and influence.

We are operation, development and product. This is also a trend of service and DO integration in the cloud era!

Has been authorized by the author tencent cloud + community release, the original link: https://cloud.tencent.com/developer/article/1140691?fromSource=waitui

Welcome toTencent Cloud + communityOr pay attention to the wechat public account (QcloudCommunity), the first time to get more massive technical practice dry goods oh ~