Welcome to Tencent cloud community, get more Tencent mass technology practice dry goods oh ~

Dare to say no to the scripts of the past

preface

QQ Flying car as a racing game, from 2008 to now ten years, still strong, to operate and maintain such a product, very honored, pressure and power are there, there is pressure to have power. Since I took over the operation and maintenance of the car, I have spent a lot of energy on the expansion and reduction of capacity, so we have today’s theme, the expansion and transformation of the car.

Expansion regression of

QQ Speed has 4 big activity nodes a year: Spring Festival, May Day, summer vacation and National Day. The magnitude of activities are millions, but due to the cost and resource constraints, our machines can not keep the magnitude of expansion period for a long time, therefore, expansion and reduction of capacity has become an important work in the car operation and maintenance. Operation and maintenance takes a relatively long time to prepare for an event. It is usually divided into the following stages:

1) Resource application. Usually, the budget should be provided to SA colleagues one month in advance, and the quantity and delivery time of supported virtual machines, Docker machines and physical machines should be determined after evaluation. Generally, two weeks should be reserved for capacity expansion, one for capacity expansion and one for observation.

2) Know the surrounding colleagues support. Because of the magnitude of the activity, we need the support of colleagues around us. Calculate TGW, push email to Jiaping, let Jiaping colleagues allocate sufficient dedicated VIP; The second is to inform the network flat, referring to the flow data of the previous activity, to ensure that there is enough flow support during the activity; Is again notice anping, to ensure that VIP are under Zeus protection; Finally, notice jiping, can have sufficient resources to ensure the normal recharge. Thanks to the construction of major event security platform, this part of the work has been greatly simplified;

3) Resources are in place and capacity expansion is prepared. Add new machines to the TCM to ensure that the bin files of the new machines are consistent with those of the live network. Secondly, TGW application, according to the VIP group given by Jiaping colleague, apply TGW port for the new GAMSVR;

4) Expand capacity. After the machine is ready, you can start to expand the CAPACITY, QQ car now has a complete set of standard cloud expansion template, this is the topic we are going to discuss today, please see the officer then look down.

Defects of existing capacity expansion templates

After two separate expansions, flaws in the existing template were discovered.

First, multi-module expansion is not supported.

Secondly, the number of machines can not exceed 30 at a time;

Finally, the cost of learning the expansion script is relatively high, and beginners are slow to get started, which also increases the risk of expansion from one side.

Think about it, if we need to expand 300 machines, then we need to call the template at least 10 times to complete capacity expansion. According to the current capacity expansion conservative estimate of 30 minutes, the amount of time at this time is quite terrible, whether the operation and maintenance work, just on capacity expansion will kill people.

Start with the expansion script

The existing expansion script of the car is handed down from generation to generation. After the transformation and optimization of each generation, it has become a very stable version. Today we just talk about expansion, to do an in-depth analysis of expansion. As we all know, our self-developed business is managed by TCM, so capacity expansion is nothing more than to prepare these three files, do not make mistakes, each business has its own set of expansion and contraction ecology, now the rapid expansion to maintain six initial files

TXT dx2Other. TXT,wtother.txt

Front-end: dxfront. TXT, dx2front. TXT, wtfront. TXT

The problem is to maintain these six files. In general, the back end does not need to be moved, so it becomes to maintain these three files. Let’s first sort out the existing capacity expansion script:

1) Config. sh core script, which is mainly used to complete backup, input parameter conversion, call expansion script, generate VIP information and update anti-plugins list;

2) Kuorong. sh capacity expansion script, which mainly completes instance ID calculation and front-end file update;

  1. Vip. sh Invokes the CC interface to update VIP information.

4) Get_ip_la. sh is mainly used to update the anti-plug-in list

The calling relationship between them can be represented as follows:

[image upload failed…(image-41f0e8-1510127531301)]

The input parameters for existing templates are the large area number module name IP list. As mentioned earlier, one can only expand the capacity of a single module, and can not multi-task, because this will lead to file confusion; In addition, the number of MACHINE IP restrictions, the same module scores to complete the call.

Transformation began

Based on the above problems, we need to find some new ways to avoid this kind of repetitive labor. Both the config.sh script and the standard cloud interface can only support single-module capacity expansion. What is the implementation idea of multi-module capacity expansion?

First of all, let’s solve the problem of batch modifying host name and batch moving host module in standard cloud template. In the case of single-module expansion, because there is only one module and large region name, there is no problem in modifying host name and moving host module. But what about multi-module expansion? The modules we need to expand are unknown, so we cannot solve the problem by setting multiple variable names. This is just a tactic of adding fuel to the tank, and indexes do not solve the root cause. In addition, parameter overload is not impossible.

We couldn’t do it through the standard cloud, so we had to think differently. Based on the interface of blue Whale, I developed a small app, which encapsulates several interfaces (of course, interfaces can also be developed through the heart cloud).

Interface 1: interface for modifying host names

Interface 2: Modifies the interface of a host module

Interface 3: Refreshes the VIP /vport interface

The above three interfaces, all support multiple modules, multiple hosts operate at the same time, and return special fast, greatly saving time. In the future, more atomic operations can be extended and slowly enriched, because the interface is completely independent of the business, so it is more flexible and achieves some degree of decoupling.

With this app interface, I can concentrate on the initial conversion of parameters. As long as my parameters are passed in according to the app interface and the form required by the config.sh script, I can always return the correct result.

Next we to solve the problems of the processing parameters, through standard cloud not now preach to participate, we are going to a file, the file format is: regional expansion and | | module name IP, to write a script package auto_diliation_wrapper. Py, the function of the script is as follows:

1 Init_parms function: convert input parameters to JSON format:

Initialization parameters input: dx | gsvrd1 | ip1 dx | gsvrd2 | ip2 dx | gsvrd1 | ip3 wt | gsvrd1 | ip4 wt | gsvrd2 | ip5 wt | chatsvrd | ip6 output {"dx": {"gsvrd1":[ip1,ip3],
"gsvrd2":[ip2],
},
"wt": {"gsvrd1":[ip4],
"gsvrd2":[ip5],
"chatsvrd": [ip6, ip5],}"dx2": {}}Copy the code

Auto_dilatation function:

1) Move the IP address to the specified CC module _chg_host_moudle function 2) Change the host name to spee-dx-gavrd1 _chg_host_name function 3) refresh the VIP /port _update_vip_vport function of the machine Sh dx gsvrd1 ip1 ip2 ip3…. to generate the configuration ipn

At this point, our multi-module expansion of the core function of the transformation completed. Let’s review the solutions

1. Simplify input and don’t call the template repeatedly — so we define wrapped scripts to transform input and complete script calls;

2 Bypass the limitations of the standard cloud and customize interfaces that meet business requirements

After the revamp, we were less reliant on the standard cloud and more flexible, and now our expansion looks like this:

[image uploading failed…(image-7593f6-1510127531301)]

Incremental expansion and contraction capacity

The above is the package conversion on the existing basic script, now the configuration generation of the car is full generation, the two versions of the file cannot be compared, operation and maintenance after each expansion should be carefully compared operation. The incremental update script integrates all operations for capacity expansion, reduction, and change. O&m only needs to focus on the maintenance of the Gen. sh script. As shown in the figure (NoSEA painting) :

[image upload failed…(image-39e098-1510127531301)]

The idea of the incremental update script is as follows: The basic script is responsible for backup rollback, change check, exception handling and change operations (Nosea in the team has written such a stable basic script). The service side only needs to perform input conversion and a small amount of configuration modification to generate the script according to the script requirements. To sum up, the new version of the script features the following:

The configuration is incremental so that diff differences can be made;

  • Complete notification mechanism so that o&M can see configuration differences after any configuration changes;
  • Perfect backup rollback mechanism. Backup before operation, error can be rolled back immediately;
  • Atomic operation script, do not support splitting, there is an error immediately back;
  • Complete log, any key information is recorded, error can be located through the log;
  • Strict check, which must be checked after the change to ensure that each change is correct.
  • A working example of the core script is as follows

[Image upload failed…(image-352f1F-1510127531301)]

After the transformation, our capacity expansion, reduction and change can be realized through this set of scripts, and the capacity expansion and reduction template can become simpler.

Ecological construction of expanding and shrinking capacity

With the basic script support, we will rely on the standard operation and maintenance to transform the previous expansion and shrinkage “ecological” tool. Thanks to the maturity of D+, the mirror expansion template has been completed, and the mirror expansion saves the business packet transmission and some initialization work, which can save most of the time.

Late work

After the transformation of the script, has been used in this year’s National Day expansion, but there are still some small problems, summarize the following later direction:

1) Continuous testing of tools. Verify the correctness of the tool by expanding and shrinking the capacity constantly to ensure that the tool is foolproof;

2) Effect evaluation. Especially after the introduction of D+ mirror capacity expansion, compared with the traditional capacity expansion in terms of time cost can be much improved;

3) Document and version construction.

Thank you

At this point our transformation is basically complete. Many problems encountered during the transformation were solved one by one. Thanks for noSEA’s help in this process. I hope I can do a better job of running the car.

## Read recommendations

DCDB lets the second kill more calmly, shopping more carnival Weave cloud team to tell you, after the failure of how to formulate comprehensive measures to avoid

Has been authorized by the author tencent cloud community released, reproduced please indicate the article source The original link: https://cloud.tencent.com/community/article/402129