What is Wechat Operation and Maintenance doing when we are grabbing red envelopes?

Author’s brief introduction

Week of soup

Responsible person of Tencent wechat Pay operation and maintenance

Worked in Tencent-wechat Business Group-Technical architecture Department-Operation and Maintenance Center. He joined Tencent in 2009 and has been responsible for the operation and maintenance of paipai, online shopping, Yixun, recharge, group purchase and other businesses. Transferred to wechat in 2014, responsible for the operation and maintenance of wechat payment business.

The background,

This picture is a screenshot from the Internet Report released by Internet Queen in 2016. The left picture shows the frequency of monthly payments by users. It can be seen that wechat Pay ranks first, with over 50 payments per month, and the second is the frequency of payments by debit cards in the United States.

The monthly payment times of wechat Payment are greater than the total number of debit cards and credit cards in the United States, which reflects the comprehensive arrival of the era of mobile payment. Meanwhile, it can also be seen that China’s mobile payment is leading the world in terms of products, users and payment frequency.

Some other Internet products, such as search, e-commerce and instant messaging, are born abroad and developed after localization in China, but the development of mobile payment in China should always be ahead of foreign countries.

The figure on the right shows the Chinese New Year in 2016, when the payment amount of wechat Pay reached 8 billion YUAN. It may not be intuitive to look at the amount, but let’s take a look at the technical data behind it.

According to the monitoring data, the payment peak reaches 150,000 times per second, and the red envelope opening rate is about 500,000 times per second. This is a huge business volume, and I personally think it should be the largest payment and settlement system in the world.

Let’s take a look at what kind of environment is behind this kind of business. Traditional payment is a lot of direct minicomputer + commercial database; While we directly adopt PC+MySQL, the most Internet approach, PC hardware failure rate is relatively high, from our experience about 2%; We also use the open source version of the database software, with a small amount of customization, without purchasing special technical support services; Obstacles and optimization are solved by our own team.

The scale is roughly as follows: more than 1,000 devices, more than 1,000 database instances, and more than 500 database businesses. However, there are only 3 DBA students, so the pressure and challenge are still very great.

A few goals to strive for in this context:

The first one is high performance. Our daily payment volume is about tens of thousands per second, and the TPS to the database level should be millions. It is not easy to achieve this performance in such an environment.

The second is high reliability. The business characteristics of the product are basically similar to those of finance, so there should be no mistake in data consistency. There are quite a number of scenarios that may lead to inconsistency, such as repeated red envelope opening, repeated payment, and inaccurate payment status.

, the third is a highly available offline businesses, hospitals, service place, users of the payment availability is very sensitive, if a user sends a WeChat, weibo message appear failure, it is much easier to do the client retry and delay processing, users perceived rarely, even for the most part failure may endure on the past, but to pay once appear failure, Or the order status of users and merchants is inconsistent, and we basically receive complaints immediately. Compared with other businesses, this fault complaint is totally different.

Finally, security. As a payment platform, there are a lot of sensitive data. How to ensure that it will not be illegally queried and tampered with by internal or external personnel is also a big challenge for us.

Conclusion the challenge

PC hardware +PC open source software platform, millions of payments per minute, ten million flow of business; Rapid business growth, change and iteration is very frequent, which is also a big difference from traditional banks; Any hardware and software failure, improper operation and maintenance may be millions of level of loss; So we dbAs are under a lot of pressure, and we really can’t afford to lose money in case of a breakdown.

We still want dbAs to avoid reading these books. Under this background, how to avoid DBA running away and how to do better data operation and maintenance quality? This is our topic today!

Second, the DBCMDB

I personally believe that DBCMDB is the cornerstone of database operation and maintenance management. The concept of CMDB should be familiar to everyone. I think it should be divided into three levels:

Basic CMDB, which stores basic configurations of IDC operation and maintenance, such as hardware, IP, and physical location information.
CMDB operation and maintenance, most of the students who do system operation and maintenance have contact with and use, it stores programs, ports, tasks, services and other information, which is the key configuration of logical layer deployment and capacity expansion;
Database management should also have a separate DBCMDB that stores information about specific database instances, ports, libraries, tables and so on that are relevant to the business.

Only on the basis of these information can the DB operation and maintenance management be carried out.

Why do DBCMDB?

Start-up companies may not have such troubles, but there will certainly be such problems after business growth. There are a lot of businesses and the configuration is very complex, like the thousands of servers we mentioned earlier. Various deployments, such as a business divided into multiple groups, may run multiple instances on a machine to optimize.

There are many libraries in an instance, and the service distribution tables are different. For example, some tables are divided into 1000 tables and some tables are divided into 1024 tables, and the naming rules are also different. Such a complex configuration relationship cannot be managed manually. In this case, the most basic database table structure change requirements, you have to find the exact machine, instance, library, table is a difficult thing.

In addition, there are also some active or fault caused deployment adjustment, business adjustment, the specific purpose of the machine is changing dynamically, in this case, how to ensure that the change, monitoring, backup can be synchronized after adjustment?

How to do DBCMDB?

First, you need to store some basic core configurations, such as IP, Port, service, responsible person, number of sub-tables, master/slave relationship, and so on. On top of these basic information but also provide friendly, complete Web interface management tools.

Here are some simple screenshots of our system, with a rough relational model on the left, a tree in the middle of one of our business physical deployments, and a business library table on the right.

What exactly can DBCMDB do?

With DBCMDB, we can build a whole set of database operation tools and systems based on the configuration data, and the ability to share with you is also basic, such as the construction of automated deployment tools, changes, monitoring… The DB connection configuration of the online business system is also obtained from here.

Change in the DB

There are many types of changes, such as switching, machine death, or capacity expansion. Mobile payment is also an Internet product with rapid iteration, with dozens or hundreds of changes every day. This is not possible in the traditional IT industry, the bank may change every week or so, we are constantly changing every day, so how to achieve stability and availability similar to the bank in such a rapid iteration, constantly changing scene, which is a big challenge for us.

Early DB changes

In the early days when our manpower was very tight, it was the developers who directly connected the database to make changes. I believe many start-up companies are also in the same situation. Here is the main problem.

One is that professional degree is not enough, there is no database development standards or specifications can not be landed, each developer according to his own ideas, practices to change the database, some usage is not professional, often use the school’s database design theory to do products, management is extremely difficult.

Second, frequent failures caused by changes, such as adding an index to some tables with massive data, may fail the online system, or SQL is not rigorous, often data error deletion and error modification, very headache.

Interim DB change

Later, we made a database requirement system, and all changes could only be implemented after the approval of the DBA. The changes after the DBA were significantly improved in professionalism and stability, but after the requirements were proposed, the DBA still needed to spend a lot of time and energy to evaluate the requirements, and the communication cost was very high. At the same time, the DBA is still manually executed at this stage, so the efficiency is very low. Many changes will be made in the early morning, and DBA students are very hard.

Now the DB changes

Now we make this tool into a DB change system. Firstly, all changes are developed into self-service bill of lading. The bill of lading process is based on the DBCMDB mentioned above. The DBA evaluates whether the SQL makes sense and comes up with an execution plan.

I would like to briefly introduce some capabilities of our execution plan. Because we have a very large volume of business now, the DB performance will be greatly affected by the change of business peak hours during the day. When DB access increases by tens of milliseconds, there will be a bunch of alarms and feedbacks.

So many changes in the business slack period implementation, we achieved basically now no one to keep the database changes, in the morning the other changes we can precisely control the gray scale, concurrency, and execute the interval, with the help of these methods, basic guarantee the changes affect the performance of the database, change the quality, efficiency, also had a better improvement.

3. DB monitoring

In the early monitoring

Monitoring is the DBA’s eye, and without monitoring data we have no way of knowing the health of the database. At the beginning, we use Nagios and Zabbix to realize these tools. Indeed, they can solve many problems. The monitoring point is relatively comprehensive, basically thinking of all the functions are relatively complete.

For example, there are thousands of instances, and configuration is often added or subtracted. It is difficult to detect and easy to break down configuration when using these tools. For example, when we make data changes today, but forget to add monitoring, we are completely unaware of DB performance problems on any day.

In addition, we lack the ability to follow up and deal with the event after monitoring the alarm. It is unclear whether we follow up and deal with the event after the alarm comes out and how the processing progress is. In history, we once had a relatively serious accident.

We have a batch of DB volume up very fast, fast disk full has alarm, then the DBA students quickly dealt with, but missing deal with a group, he didn’t find the others have not found, monitoring and control system is not found at fault because there is no alarm out continuously, the results after two or three days disk directly the overflowing.

In addition, I personally think that these foreign open source products are not smooth enough for Chinese people to use in terms of user experience and UI.

BD monitoring now

We have now built the entire database monitoring system, from the Agent at the bottom to storage and alarm, all developed by ourselves. The design of reconstruction is also based on the DBCMDB mentioned above, and the management has a lot of optimization. The subsequent deployment adjustment only needs to update the DBCMDB, and the monitoring system automatically ADAPTS, and there is basically no need to frequently operate and modify the monitoring configuration.

The alarm policies are optimized. Some policy templates are configured and applied to services instead of a single IP address level. After monitoring data is collected, the alarm policies applied to a DB can be found through a series of conversions, which greatly simplifies policy configuration and management.

There is also a complete event processing process mentioned above, that is, for similar accidents mentioned above, there will be continuous event follow-up after the alarm, and all the alarms and warnings will be graded.

After the alarm is generated, the event system will track to ensure that the abnormal can be handled in a timely manner. For example, for five-star alarms, we require the DBA to respond within 5 minutes and complete the processing within 30 minutes. If the processing exceeds the time limit, the alarm will be automatically raised and notified to more people.

Fourth, data security

Overall safety

The data security of any company is the life of the enterprise, and the impact of data leakage or loss on any enterprise and business is immeasurable.

Firstly, at the host level, our database can only be logged in from a special springboard, and unauthorized personnel cannot log in to the server. All operations of the DBA are recorded and audited, and any risky operations will be warned.

Monitoring, general MySQL attack will try to login, the attacker will scan guessing passwords, we can at this stage can be found effectively, special monitoring illegal connection now, if the login account, password mistake, we immediately alarm is not in the scope of authorization management, you can follow up analysis is really business use is wrong, or someone in the attack.

In terms of authorization, we basically minimize authorization on demand, very fine-grained control, and we’ll talk about that later. There is also a need to protect some key data in the business, such as the account balance field, if the internal staff want to add money to themselves, whether a direct update statement, can be changed from one hundred to one million?

Procedure, technical need to prevent internal personnel to do these things, it is necessary to have tamper-proof and verification ability. There are also security scanning and development specifications to detect and circumvent high-risk vulnerabilities such as SQL injection.

There are backup, out of the database is mostly from the internal, during the period of the database backup pull down to save the hard disk, after leaving the industry to put this impact is very big, how to prevent this thing?

It’s hard to protect against, so you have to harden it, and now all of our backups are encrypted. Usually backup storage is accessible to many people, and encrypted backup is to prevent other people and business people from taking it away. Then there is the normative constraint from management and practice that these people cannot try to do these things. Technology is just a means.

Authorization system

Taking a look at authorization, authorization used to be about developing requirements that dbAs did manually, IP by IP. The main problem is that the efficiency is low, and operations are required every time the system goes online or expands capacity. Permissions are prone to omission and error. The password provided during capacity expansion is different from the original password.

Now we are building authorization system to manage, the idea is as follows: based on the business module and business DB to manage permissions, the system will automatically change the CMDB into IP to authorize, deployment adjustment need not manually change permissions. In addition, linked with the release system, the abandoned permission can be recycled automatically, which basically solves some of the problems we talked about before.

High availability

Database high availability

This is usually done at two levels, the database layer and the business layer. So let’s look at the database. This is our Master/slave automatic switching tool. We will run a HAProxy in front of each MySQL to do TCP forwarding, and forward write requests to the real Master. We will run an Agent on each DB server to monitor the Master’s survival status.

If all agents determine that the Master is dead, we will use etCD storage to decide the switchover, and the Agent is responsible for adjusting the Master/slave relationship and TCP forwarding Settings. The tool is very focused and only does fast automatic switching. It is fully compatible with existing deployment, operation and maintenance methods and can be easily upgraded and rolled back. At the same time, three-machine read and write, three-park DISASTER recovery features; Stable and reliable, effectively solve the problem of hardware failure of this single machine.

In addition, our other teams are also working on PhxSQL components. This is a bit more radical than the previous solution, adding a powerful, fully consistent queue in the middle of SQL synchronization to solve the problem of data consistency, so its main features are strong consistency and fast switching.

Service high availability

As for how to deal with DB failure in the business, it is not enough for the data layer alone to prevent DB failure. We require that the business layer should be able to survive no matter what happens to DB. DB also has many unexpected things that the DBA cannot control.

So our r&d team came up with a hopping solution, which is only suitable for internally generated ID and does not require continuous scenes; This is how our order library is currently handled. First, the database is divided into many groups, such as dozens of groups, and each group can be read and written.

If a group of DB fault, reading and writing after the failure of the business layer state will put the information feedback to the above, the upper when assigned ID can do it according to the state, automatic skip fault them roughly, this benefit is can quickly let the business continue to run, new payment order completely unaffected, the availability of the payment and the old order, speaking, reading and writing experience.

Six, Golang

A brief introduction to Golang, which I personally think is very good and recommend to you. Here are some of its key features:

First came from Google, developed by a bunch of gods;
Syntax concise, clear, efficient, very powerful performance, cross-platform, can basically do the same code can run on multiple platforms;
Most of our previous development languages are the products of the single-core era. For the use of multi-core, most of them are made through multi-process or multi-threading capabilities of the operating system. However, Golang language itself is designed and implemented on the premise of multi-core, which is well optimized for high concurrency and distribution.
There is also automatic garbage collection, similar to the function of JAVA, the largest C++ development team in China should be in Tencent, the level of developers is also uneven, in memory leakage is difficult to eliminate, so you can still see how many times the program runs through restart to avoid the phenomenon of memory leakage;

With this GC capability, the development students are quite easy, there is little need to manage memory allocation and reclamation; There are also strongly typed, statically compiled, advanced types built in, and native programs that are very easy to deploy, which I like best, without any dependencies; Can be called the Internet language, so there are a lot of built-in network communication, encryption, decompression, database class library, the basic language internal to do this work.

So my personal summary of it is: C/C++ level performance, stability, ease of deployment, ease of use and high-level language features of 3P, Java, etc.

For use scenarios, it should be suitable for most of the distributed background type of business, where Java can be used, but its performance and ease of use I personally feel better than Java.

Another particular scenario, let’s say you’re a software developer, and you want to protect your code from being decompiled, and so far it’s hard to use things like 3P, Java, etc., there are quite a few tools that can decompile source code.

With C/C++ it is easy to do, but the development efficiency is relatively low, the requirements for developers are relatively high, recruitment is not easy, with Golang is very advantages, simple, high performance, but also can prevent decompilation.

Golang is also developing rapidly, and there are quite a lot of software in use in the industry. We can see Docker, Etcd and Nsq, and some new projects are gradually inclined to develop with it. My team also made a lot of attempts, such as external network quality monitoring, DB monitoring and DB fast switching, which were all developed with Golang.

Q & A

Question: Just now you said that Golang cross-platform is better than Java, exactly how to achieve?

The cross-platform implementation of the two is somewhat different. Java is compiled into intermediate bytecode and then interpreted and executed by the Java VIRTUAL machine. The implementation of the Java VIRTUAL machine on different platforms is different. Golang is machine code compiled directly into the target platform; Personally, Golang is more convenient in terms of deployment;

Question: Just now you said there is a question, often many businesses after creating some account rights, many businesses are still used, do you have any good experience and tools in this respect?

Tang Zhou: This is the problem of the depth of permission management, whether it is MySQL or network policy, many companies and businesses are only authorized, not reclaimed. Our approach is that the authorization is not IP granularity, but business granularity. When the deployment is adjusted, the permission will be recalculated and reclaimed.

Question: is the business offline or migration account cleared?

Zhou Tang: After the deployment system is adjusted, the authorization system interface will be invoked. If the authorization action is performed online, the permissions will be cleared at the lower limit. All these capabilities and tools need to be built by ourselves. At present, there are no universal tools available, but this idea can be used for reference.

Q: What kind of IDE do you use in Golang program development language? How do Windows programs work in different environments?

Weeks soup: Golang language itself is relatively simple, the code organization structure and directory structure are relatively clear, you can use all the text editor to develop, but the efficiency is relatively low, quite a number of IDES currently support Golang development, this is easy to find on the Internet. Vim, Eclipse, Visual Studio; There is also a custom IDE for Golang called LiteIDE, which is relatively easy to use. I mainly use this IDE, and it works on both Windows and Linux.

Golang can compile binary executables for the same code on different platforms, and even cross-platform. It’s basically a command.

The recent oliver

Evolution of 360 Network Operation and Maintenance Automation

Guangdong Development Bank: How Docker fits into Traditional Operation and Maintenance

Full Link Pressure Drill System — ForceBot

There are also more BAT senior operations and maintenance gathered in GOPS · Shenzhen station

Tencent Game Blue Whale product director, Tencent SNG operation and maintenance expert, Ali Game operation and maintenance director, Baidu Agile coach…… You can’t miss the lineup and speeches, take you to the forefront of operations technology.

GOPS2017 station, shenzhen

GOPS started in Shenzhen in 2016, when tickets closed a few weeks early, and returned a year later, carrying the expectations of operators.

Venue: Sentosa Hotel (Jade Store), Nanshan District
Conference time: April 21-22, 2017