Case | tencent advertising AMS container path

The author

Zhang Yu joined Tencent in 2015 and is engaged in Tencent advertising maintenance work. In 20 years, I began to guide Tencent’s advertising technology team to connect to TKex-TEG of the company. From the daily pain points of the business and combining with the native characteristics of Tencent cloud, I improved the containerized solution of Tencent’s own advertising

Project background

Tencent advertising carries the advertising traffic of the whole Tencent and has access to the request of external alliance. In all the scenarios of increasing traffic, how to quickly allocate resources and even automatic scheduling after the traffic surge has become a problem that the advertising team needs to consider. In particular, this year’s overall advertising structure (delivery, broadcast) striped disaster recovery optimization, for the allocation of resources on demand, according to the regional allocation of resources and other functions have a stronger dependence. Inside the advertisement, the streaming system carries the function of the entire advertisement broadcast, and the stability here directly determines the income of the entire Tencent advertisement. The structure diagram is as follows:

Business features:

The request volume is large, with an average daily request volume of nearly 100 billion, and the machine volume accounts for more than 60% of the machines owned by AMS. Even a small fluctuation of the overall performance will involve the change of a large number of machines.
The link topology is complex and the performance pressure is great. The whole broadcast link involves 40+ subdivision modules. In the very short time between 100 and 200 milliseconds (different traffic requirements vary), you need to access all the modules and calculate a best AD.
Computer-intensive, extensive use of binding and screening capabilities to face the pressure of millions of AD order retrieval.

On the cloud scheme selection

In the past 20 years, Tencent advertising has been in the cloud on a large scale, mainly using AMD’s SA2 CVM cloud host, and has completed the compatibility and debugging of the network, the company’s public components, advertising own components and so on. On this basis, CVM-based Node Cloud has also started to perform tuning and business use, elastic scaling, Docker-based transformation, extensive use of various PaaS services, and give full play to the advanced functions of cloud. The following TKE architecture is used for advertising:

Early resource preparation (upper left) : Apply for CVM and CLB and other resources from Tencent’s internal cloud official website platform, and apply for subnet network segments required by master, node and pod on the cloud official website at the same time (subnet is region-specific, such as Shenzhen Guangming, attention should be paid to the network segment allocation of nodes and pods in the region. Need to be consistent). Both CVM and CLB are imported into TKEX-TEG, and when FIP mode is selected, the resulting PODS get their own EIP from the assigned Subnet.
Warehouse and the use of the image (upper right) : advertising operations side provide the base image (mirrors.XXXXX.com/XXXXX/XXXXXXX-base:latest), business in the from of the base image at the same time, pull through the blue shield after git mirror build, complete business image building.
Container usage mode (the second half) : through the TKEX-TEG platform, pull the business image to start the container, and then through CLB, Polestar and other services for external use.

Containerization process (difficulties and solutions)

Difficulty 1: universality (1) in the face of 84 advertising technical teams, how to achieve the adaptation of all businesses (2) mirror management: the transparency of the basic environment for the business team (3) Tencent advertising container configuration specifications Difficulty 2: CPU-intensive search (1) advertising order number: millions (2) binding: CPU-binding isolation between applications (3) Clockdown: Shutdown Hyperthreading Difficulties (3) High availability in stateful service upgrades (1) Continuous availability of advertising resources during container upgrades (2) Continuous high availability during iteration, destruction and reconstruction

generality

1. Introduction to basic image of advertising

The advertising operation and maintenance side provides a set of basic images covering most application scenarios, which is based on XXXXXX-Base: Latest, which integrates various environment configuration, basic agent, business agent and so on of the original advertisement on the physical machine. Based on this base image, multiple business environment images are provided. The list of images is as follows:

mirrors.XXXXX.com/XXXXX/XXXXXXX-base:latest

mirrors.XXXXX.com/XXXXX/XXXXXXX-nodejs:latest

mirrors.XXXXX.com/XXXXX/XXXXXXX-konajdk:latest

mirrors.XXXXX.com/XXXXX/XXXXXXX-python:3

mirrors.XXXXX.com/XXXXX/XXXXXXX-python:2

mirrors.XXXXX.com/XXXXX/XXXXXXX-tnginx:latest

Specific use of mirroring is as follows:

In the basic image of the advertisement, because the permission set is not used in Systemd, the startup script is used as the No. 1 PID, and a general Tencent general Agent & ad-unique Agent startup script is built in the basic image. In the process of starting the business mirror, You can select whether to invoke in the respective startup scripts.

2. Containerized CI/CD

The CD part was heavily used on other platforms, but now with TKE, it is no longer available on other platforms. However, the continuous integration part on TKex-TEG is weak for the realization of automatic assembly line, and manual participation is required. Therefore, the CI/CD scheme introduced in the advertisement is Tencent’s internal continuous integration and continuous deployment scheme: Blue Shield.

Pipeline release is realized throughout here, and there is no need for human participation in addition to audit, reducing the impact of human factors.

Stage1: Mainly uses manual trigger, git automatic trigger, timed trigger, and remote trigger

Manual trigger: easy to understand, requires a manual click to start the pipeline.
Automatic trigger: When Git generates a merge, the stream can be triggered automatically, which is suitable for agile iterations of the business.
Timed trigger: trigger the whole assembly line at a fixed time point every day, which is applicable to the large modules co-developed by OTeam. Iterate once within a predetermined period of time, and all participants shall confirm the modification points of this iteration.
Remote trigger: depending on the use of other external platforms, such as advertising review mechanism on its own platform (Leflow), the entire pipeline can be remotely triggered after the completion of the entire release review.

Stage2 & stage3: Persistent integration that pulls Git for custom compilation

Blue Shield provides the default CI image for compilation. Those who do not perform binary compilation can choose the default (such as PHP, Java, NodeJS, etc.). However, Blade is widely used in the background business of Tencent advertising. Often use mirrors.XXXXXX.com/XXXXXX/tlinux2.2-XXXXXX-landun-ci:latest as building image, the image is provided by tencent advertising effectiveness team and internal integration of tencent advertising in continued integration in the process of various environment and configuration. After the compilation is completed, the dockerfile in Git library is used for mirroring build through the plugin, and then it is pushed to the repository and a copy is kept in the cloud.

Stage4: Online grayscale set released to observe data performance at grayscale flow. Iterate a workload through the cluster name, NS name, and Workload name to mirror the Tag and authenticate it using a token inside TKex-TEG.

Stage5: After confirming stage4 is OK, start the full volume online, each time it is reviewed and confirmed.

Stage6 & Stage7: Statistics.

In addition, there is a notification function of robot crowd in Blue Shield, which can customize the process information that needs to be informed and push it to an enterprise WeChat group, so that everyone can confirm and review it.

3. Tencent Advertising Container Configuration Specifications

The mother computers inside the advertisement are all used Tencent Cloud XingxingHai AMD (SA2), where there is 90-core super-threaded CPU +192G memory, and the disk uses high-speed cloud hard disk 3T. In daily use, this model is the largest model that Tencent Cloud can provide at the present stage with such configuration (SA3 has been tested, and the highest model will have a larger configuration).

So business at the time of use, do not recommend the pods auditing too big (for example, more than 32 nuclear), due to TKE affinity of the default Settings will try to pull the container in the idle machine tool, so use to the middle and later (such as cluster have been used for 2/3) lead to fragmentation problem, will lead to more than 32 nuclear pods can hardly increase. Therefore, it is recommended to split the existing high-core service into more low-core pods (half the core of a single pod and twice the core of a whole pod) if you can expand horizontally when pulling containers.
At Workload creation time, mount the emptyDir temporary directory for the log directory to ensure that data is not lost during the upgrade process and to facilitate subsequent problem solving. (Destroy and rebuild will still delete all files in that directory.)

If the workload is already online, you can increase the mount of the catalog by modifying YAML:

        - mountPath: /data/log/adid_service
          name: adid-log
      volumes:
      - emptyDir: {}
        name: adid-log

In Tencent’s advertising, a large number of striped functions are used, that is, the service is not limited to the scope of Shanghai, Shenzhen and Tianjin. Striping can achieve a more detailed distinction, such as the deployment of the Shanghai – Nanhui computer room as a unit, which can achieve disaster recovery (most of the network failures are based on the computer room as a unit, which can quickly cut to another road). It can also reduce the time consuming after striped deployment. For example, the two computer rooms in Shanghai have a time consuming of 3ms due to the distance between them. In the process of large packet transmission, the time consuming problem of cross-room deployment will be magnified, and a gap of 10ms will appear inside the advertisement.

Therefore, in the process of TKE striping in advertisements, we will specify the choice of computer rooms by the way of label. Tencent Cloud has labeled the CVM of each computer room by default, which can be directly called.

The inventory workload can also modify YAML to enforce scheduling.

      spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: failure-domain.beta.kubernetes.io/zone
                operator: In
                values:
                - "370004"

It is recommended to use 4-16 core container configuration in the internal background of the advertisement, and most front-end platforms use 1 core, so as to ensure that emergency capacity expansion can be carried out in the scenario of high cluster utilization rate. And if you want affinity to enforce quarantine on pods, you can also do so using the following configuration (VALUES is the specific workload name) :

         affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: k8s-app
                  operator: In
                  values:
                  - proxy-tj
              topologyKey: kubernetes.io/hostname
            weight: 100

4. The HPA Settings

There are two ways to cope with the surge in business traffic during containerization.

Set the request and limit of the container. The request resource can be understood to determine that 100% can be allocated to the business, while the limit is the oversold resource, which is the shared resource in the buffer pool. The advantage of this is that each service can configure the request resources normally used, and when the traffic surge occurs, the resources of the limit part will bear the performance problems after exceeding the request.

Note: But being oversold here is not a panacea. Two problems are also obvious:

Pods automatically migrate to other nodes if the remaining resources in the current node are less than the limit value. 2) If the core binding function is needed, it is the function that requires QoS. Here, it is mandatory that request and limit must set the same configuration.

Set up automatic capacity expansion, here you can set the threshold according to their respective performance bottlenecks, finally to achieve the function of automatic capacity expansion.

Most businesses are bottleneckled by CPU performance, so the common way is to set scaling for CPU request usage

Search for millions of AD orders

1. Advertising core retrieval module

Advertising has the concept of a set of sites for each traffic, and each site set has a different set in order to distinguish between the impact of each traffic and different time requirements. In 2010, we dismantled a set for each module and carried out CVM cloud loading. On this basis, we carried out containerized cloud loading for the core module Sunfish in 2011. This module features highly CPU-intensive retrieval, so it can not use hyperthreading (hyperthreading scheduling can increase the time), and the internal program is bound to handle (reduce the CPU scheduling between multiple processes).

2. Container binding core

This is one of the biggest features of advertising, and the biggest difference between TKE and CVM/ physical machines.

In the scenario of CVM/ physical machine, the virtualization technology is able to get the correct CPU core information from /proc/cpuinfo. Therefore, in the original business core binding process, the CPU core number and information are obtained from /proc/cpuinfo to carry out the core binding operation of each program.

The reason is that /proc/cpuinfo is sorted according to the container’s own number of cores, but this order is not the actual sequence of CPUs on the container’s parent machine. Real sequence of CPU need from/sys/fs/cgroup/cpuset/cpuset cpus, for example below two examples:

/proc/cpuinfo /proc/cpuinfo /proc/cpuinfo /proc/cpuinfo

/ sys/fs/cgroup/cpuset/cpuset. Show CPU serial number of cpus (true) :

As you can see from the above two figures, /proc/cpuinfo only does a sort by the number of CPU cores allocated to it, but it does not really correspond to the core sequence of the parent. In this way, during the process of core binding, if you bind the No. 15 core, you are actually binding the No. 15 core of the parent. But the fifteenth CPU of the mother machine is not allocated to the container.

So I need from the/sys/fs/cgroup/cpuset/cpuset cpus to derive corresponding sequences of CPU in machine tools, to achieve to nuclear, pictured above 2. And you can add the following command in the startup script, you can realize the format conversion of the CPU’s real core number, convenient binding.

cpuset_cpus=$(cat /sys/fs/cgroup/cpuset/cpuset.cpus) cpu_info=$(echo ${cpuset_cpus} | tr "," "\n") for cpu_core in ${cpu_info}; do echo ${cpu_core} | grep "-" > /dev/null 2>&amp; 1 if [ $? -eq 0 ]; then first_cpu=$(echo ${cpu_core} | awk -F"-" '{print $1}') last_cpu=$(echo ${cpu_core} | awk -F"-" '{print $2}') cpu_modify=$(seq -s "," ${first_cpu} ${last_cpu}) cpuset_cpus=$(echo ${cpuset_cpus} | sed "s/${first_cpu}-${last_cpu}/${cpu_modify}/g") fi done echo "export cpuset_cpus=${cpuset_cpus}" >> /etc/profile

Source /etc/profile calls the environment variable in the following format:

Note: Core binding depends on QoS configuration (i.e. Request and Limit must be set to the same).

3. Turn off hyperthreading

Hyperthreading is turned on in most scenarios, but needs to be turned off in computationally intensive scenarios. The solution here is to choose to turn off hyperthreading when you apply for CVM.

Then make a stain and label on the shutdown parent machine, so that the ordinary pull will not pull to the shutdown parent machine. When the shutdown resource needs to be allocated, open the tolerance and set the label in YAML, and the corresponding shutdown resource can be obtained.

The verification configuration of Yunti resource application:

High availability in stateful service upgrades

Upgrading a stateless container is the simplest, and the availability of a business port is the availability of the container.

However, the startup of stateful service is more complicated, and the preparation of the state needs to be completed in the startup script. In the advertisement, it mainly involves the push and load of the advertisement order resources.

1. Advertising resources are continuously available during the container upgrade process

The main difference between a container upgrade and a physical machine is that the container destroys the original container and then pulls the new container from the new image to provide services. The disks, processes, and resources of the original container are destroyed.

However, the resources of advertising order here are all at the level of millions. If the files need to be pulled again in each upgrade, it will directly lead to slow startup. Therefore, we have added temporary hanging directory in the container.

This way of mounting allows the container to keep the files in the above directory during the upgrade process without having to pull them up again. However, emptyDir can only be retained in the upgrade scenario, and the destruction rebuild will still be pulled after destruction. Here is how the stock service can directly modify YAML:

              volumeMounts:
        - mountPath: /data/example/
          name: example-bf
      volumes:
      - emptyDir: {}
        name: example-bf

2. High availability of services during the upgrade process

In the course of business iteration, there are actually two issues that cause the business to provide services that are detrimental.

If the workload is associated with load balancing, the container will be in the RUNNING state when the first sentence of the startup script is executed. This will help everyone participate in the associated load balancing, but the business process is not ready at this point, especially since the stateful service must complete the previous state before it can start. At this point, the business will cause a business error by adding a service that is not available.
During the upgrade process, with the exception of the Deployment mode upgrade, the original container is destroyed first and the new container services are pulled. The problem is that when we upgrade, we remove it from the associated load balancing and then immediately move to the destruction phase. If it’s an L5 call upstream it’s not going to be able to synchronize quickly enough that the pods have been culled, it’s going to send a request downstream to a container that has been destroyed, and the whole business is going to get an error.

So one of our main ideas here is:

How to bind the business state to the container state.
In the process of upgrade/destroy and rebuild, is it possible to make a post-script? We can do some logical processing before the destruction, the simplest is to sleep for a period of time.

Here we introduce two concepts of business upgrading:

The probe is ready
The rear script

1) Probe readiness requires port readiness detection to be selected at Workload creation time so that it is involved in the associated load balancing once the business port is started.

also modifies YAML in the inventory workload

      readinessProbe:
          failureThreshold: 1
          periodSeconds: 3
          successThreshold: 1
          tcpSocket:
            port: 8080
          timeoutSeconds: 2

The emergence of a similar unhealty is the process of waiting for a business port to become available after the container is started

2) Postscript

The core function of a postscript is to perform a series of business custom actions between the time it is removed from the associated load balancing and the time it destroys the container.

The order of execution is: Submission to destroy rebuild/upgrade/shrink → remove Polaris /L5/CLB/service → execute the post-script → destroy the container

The simplest function is that when using L5 in the upstream, sleep 60s after removing L5 can be used to make the upstream update to destroy the pods after removing them.

           lifecycle:
          preStop:
            exec:
              command:
              - sleep
              - "60"

  lifecycle:
          preStop:
            exec:
              command:
              - /data/scripts/stop.sh

From the long-term use experience, the main problem lies in L5. If it is a service with large traffic, it will be OK within the sleep 60s; if it is a service with small request volume, and no error is expected, it needs to sleep 90s.

Results show

Comparison of CVM and TKE usage and time spent

Here we compare the CPU and elapsed time between the CVM and the TKE container on a normal machine with the same configuration, and we can see that there is little difference and no change in elapsed time.

CVM:

TKE:

Overall revenue

conclusion

Different from other sharing that introduces cloud native from the bottom level, this paper mainly introduces the advantages and use methods of cloud native in large online services from the perspective of business, and combines the characteristics and strategies of Tencent advertising to realize containerization practice of Tencent advertising in high concurrency, automation and other scenarios. For the business team, any work is based on quality, efficiency and cost, while cloud native can improve from these three aspects. I hope there will be more sharing from our Tencent advertising in the future.

Container Service (Tencent Kubernetes Engine, TKE) is a one-stop cloud native PaaS service platform based on Kubernetes provided by Tencent Cloud. We provide users with enterprise-level services that integrate container cluster scheduling, HELM application choreography, Docker image management, ISTIO service governance, automated DevOps and a full set of monitoring operation and maintenance systems.

[Tencent cloud native] cloud said new, cloud research new technology, cloud tour new live, cloud appreciation information, scan the code to pay attention to the public number of the same name, timely access to more dry goods!!