The author

Zhang Yu joined Tencent in 2015 and was engaged in Tencent advertising maintenance work. In 20 years, I began to guide Tencent advertising technology team to access tKEX-TEG of the company, and improve Tencent advertising’s own containerization solution from the daily pain points of business combined with Tencent cloud native features

Project background

Tencent advertising carries the entire advertising traffic of Tencent, and has access to the requests of external alliances. In all the increasing traffic scenarios, how to quickly allocate resources or even automatic scheduling after the sudden increase of traffic has become a problem that the advertising team needs to consider. In particular, this year’s striping Dr Optimization of the overall advertising architecture (advertising and broadcasting) relies more heavily on functions such as on-demand resource allocation and regional resource allocation. Inside the advertisement, the streaming system bears the function of the entire advertisement broadcast. The stability of the system directly determines the income of the entire Advertisement of Tencent. The following is the architecture diagram:

Business Characteristics:

  • The number of requests is large, with daily requests reaching nearly 100 billion levels, and the number of machines accounts for more than 60% of AMS’s own machines. Even a small fluctuation of the overall performance will involve a large number of machine changes.

  • Link topology is complex and performance pressure is great. The whole playback link involves 40+ subdivision modules. All modules need to be accessed to calculate the best AD in a very short time, between 100 and 200 milliseconds (depending on traffic requirements).

  • Computation-intensive, with extensive use of binding and closing capabilities, to face the pressure of retrieving millions of AD orders.

Selection of upper cloud solution

In 20 years, Tencent advertising has been in the cloud on a large scale, mainly using AMD’S SA2 CVM cloud host, and has completed the compatibility and debugging of the network, the company’s public components, advertising components and other components. On this basis, cVM-based Node cloud native also began to be tuned and used in business, elastic scaling, docker-based transformation, extensive use of various PAAS services, give full play to the advanced functions of the cloud. Here is the TKE architecture for advertising:

  • Early resource preparation (upper left) : Apply for CVM, CLB and other resources from Tencent’s internal cloud official website, and apply for subnets for Master, node, and Pods at the same time. (Subnets are region-specific, such as Guangming Shenzhen, so you need to pay attention to the network segment allocation of node and Pods in the region. Need consistency). Both CVM and CLB are imported into TKEX-TEG, and when FIP mode is selected, the resulting PODS get their EIP from the allocated subnet.

  • Warehouse and the use of the image (upper right) : advertising operations side provide the base image (mirrors.XXXXX.com/XXXXX/XXXXXXX-base:latest), business in the from of the base image at the same time, pull through the blue shield after git mirror build, complete business image building.

  • Container usage mode (lower part) : The tKEX-TEG platform is used to pull service images to start containers, and then services such as CLB and Polaris are used for external use.

Containerization process (difficulties and solutions)

Difficulty 1: generality (1) How to adapt all businesses in the face of 84 advertising technical teams (2) Mirror management: transparency of the basic environment for the business team (3) Tencent advertising container configuration specification Difficulty 2: CPU-intensive retrieval (1) Number of advertising orders: millions (2) Binding core: CPU binding and core isolation between applications (3) Close core: close hyperthreading difficulty (3) High availability in stateful service upgrade (1) Continuous availability of advertising resources in container upgrade (2) Continuous high availability in iteration, destruction and reconstruction

generality

1. Introduction of advertising basic mirror image

The AD o&M side provides a basic image that covers most application scenarios. The image is based on XXXXXXX-base:latest, which integrates various environment configurations, basic agents, and service agents on the original physical machine. Based on this basic image, multiple service environment images are provided. The image list is as follows:

mirrors.XXXXX.com/XXXXX/XXXXXXX-base:latest mirrors.XXXXX.com/XXXXX/XXXXXXX-nodejs:latest mirrors.XXXXX.com/XXXXX/XXXXXXX-konajdk:latest mirrors.XXXXX.com/XXXXX/XXXXXXX-python:3 mirrors.XXXXX.com/XXXXX/XXXXXXX-python:2 mirrors.XXXXX.com/XXXXX/XXXXXXX-tnginx:latest

The mirror usage is as follows:

In the basic image of advertisement, the permission set is not used in systemd, so the startup script is used as PID no. 1, and a universal Tencent General Agent and advertising unique Agent startup script is built in the basic image. During the startup process of the service image, You can choose whether to invoke it or not in the respective startup script.

2. Container CI/CD

The CD part that used to be heavily used on other platforms is now unavailable on other platforms with TKE. However, the sustained integration part of TKEX-TEG is weak for automatic assembly line implementation and requires manual participation. Therefore, the CI/CD scheme introduced in the advertisement is Tencent’s internal sustained integration and sustained deployment scheme: Blue Shield.

The whole process of assembly line release is realized here, without manual participation except for audit, so as to reduce the impact of human factors.

Stage1: manual trigger, git automatic trigger, timing trigger, and remote trigger are used

  • Manual trigger: easy to understand, need to manually click after the assembly line.

  • Automatic trigger: When Git generates a merge, it automatically triggers the merge flow, which is suitable for agile business iterations.

  • Timed trigger: the trigger of the entire assembly line is scheduled to start at a certain time every day, which is suitable for large modules co-developed by Oteam. The iteration is scheduled once in a certain time, and all participants confirm the modification points of this iteration.

  • Remote trigger: Depending on the use of other external platforms, such as advertising review mechanism on its own platform (Leflow), it can remotely trigger the execution of the whole pipeline after the completion of the whole release review.

Stage2 & STAGe3: Persistent integration, custom compilation after pulling git

Blue Shield provides the default CI image for compilation, and those without binary compilation can choose the default (such as PHP, Java, NodeJS, etc.), while Tencent advertisement uses a large number of blades in the background business. Often use mirrors.XXXXXX.com/XXXXXX/tlinux2.2-XXXXXX-landun-ci:latest as building image, the image is provided by tencent advertising effectiveness team and internal integration of tencent advertising in continued integration in the process of various environment and configuration. After compiling, build the image through the image plug-in, relying on the Dockerfile in git library, and then push it to the warehouse, while keeping a copy in the weaving cloud.

Stage4: Online gray set release, used to observe the data performance under gray flow. A workload is iterated through mirrored tags by cluster name, NS name, and workload name, and authenticated by an internal TOKEN of TKEX-TEG.

Stage5: After confirming that stage4 has no problem, start the full amount online, which is reviewed and confirmed each time.

Stage6 & STAGe7: Statistics.

In addition, there is a robot crowd notification function in Blue Shield, which can customize the process information to be informed and push it to an enterprise wechat group for everyone to confirm and review.

3. Tencent advertising container configuration specifications

The mother machine inside the advertisement is Tencent Cloud Star SEA AMD (SA2), here is 90 core hyper-threading CPU +192G memory, the disk is using high-speed cloud hard disk 3T, in daily use such configuration this model is Tencent cloud at the present stage can provide the largest model (SA3 has begun to test, the highest model configuration will be larger).

  • So business at the time of use, do not recommend the pods auditing too big (for example, more than 32 nuclear), due to TKE affinity of the default Settings will try to pull the container in the idle machine tool, so use to the middle and later (such as cluster have been used for 2/3) lead to fragmentation problem, will lead to more than 32 nuclear pods can hardly increase. Therefore, it is recommended that users split the original high-core service into more low-core pods (the number of cores per pod is halved, and the number of pods as a whole is twice) if they can expand horizontally when they pull the container.

  • During workload creation, the emptyDir temporary directory is mounted to the log directory to ensure that data in the directory will not be lost during the upgrade and facilitate subsequent troubleshooting. (Destroying and rebuilding will still delete all files in that directory)

If workload is already online, you can modify YAML to increase directory mounting:

        - mountPath: /data/log/adid_service
          name: adid-log
      volumes:
      - emptyDir: {}
        name: adid-log
Copy the code
  • Within Tencent ads, there is a heavy use of striping, which means the service is not limited to Shanghai, Shenzhen, tianjin, etc. Striping can be more detailed. For example, if the Shanghai – Nanhui network is deployed in equipment rooms, disaster recovery (Dr) can be implemented. Most network faults occur in equipment rooms, so you can quickly switch to another route. For example, due to the distance between the two machine rooms in Shanghai, the time consuming problem of the cross-machine room deployment will be magnified in the process of large packet transmission, and a gap of 10ms will appear in the advertisement.

Therefore, in the process of using TKE striping for advertisements, we will specify machine room selection by means of label. Tencent Cloud labels CVM of each machine room by default, which can be directly called.

Workload can also be modified yamL for mandatory scheduling.

      spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: failure-domain.beta.kubernetes.io/zone
                operator: In
                values:
                - "370004"
Copy the code
  • It is recommended to use 4 to 16 cores in the internal background, and 1 core in most front-end platforms. In this way, emergency capacity expansion can be performed when the cluster usage is high. In addition, if you want affinity to forcibly quarantine each Pod, you can also use the following configuration (VALUES are the specific workload name) :
         affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: k8s-app
                  operator: In
                  values:
                  - proxy-tj
              topologyKey: kubernetes.io/hostname
            weight: 100
Copy the code

4. The HPA Settings

In the use of containerization, there are two ways to deal with the surge of business traffic.

  • Set request and limit of the container. Request resources can be understood as determining 100% of the resources that can be allocated to the business, while limit is the oversold resources that are shared in the buffer pool. In this way, each service can configure request resources that are normally used. In case of a traffic surge, the limit resources are responsible for the performance problems after the request is exceeded.

Note: But oversold here is not a panacea, he has two problems are also very obvious:

  • If the remaining resources used in the current node are less than the limit value, the PODS will automatically migrate to another node. 2) If the core binding function needs to be used, it is a qos function, so the configuration of request and limit must be the same.

  • Set automatic capacity expansion, where you can set thresholds according to your respective performance bottlenecks, and finally achieve automatic capacity expansion.

CPU performance is the bottleneck of most services. Therefore, you can configure capacity expansion based on CPU request usage in common mode

Millions of AD order search

1. Core advertising retrieval module

Advertising for each traffic there is a site set concept, each site set is divided into different sets, in order to separate the impact of each traffic and different time requirements. In 2012, we removed a set for each module and carried out CVM cloud building. Based on this, we carried out containerized cloud building for the core module Sunfish in 2014. This module is characterized by CPU intensive retrieval, so it cannot use hyperthreading (which can lead to time-consuming scheduling), and internal programs are core-bound (reducing CPU scheduling between multiple processes).

2. Bind the core of the container

This is one of the biggest features of advertising, and the biggest difference between TKE and CVM/ physical machines.

In the CVM/ physical machine scenario, the virtualization technology can obtain the correct CPU core information from /proc/cpuinfo. Therefore, the number and information of CPU cores are obtained from /proc/cpuinfo to bind the CPU cores for each program.

However, the CPU information in the container is very skewed because /proc/cpuinfo is sorted by the number of cores in the container itself, but this order is not the actual CPU sequence of the container on the mother machine. Real sequence of CPU need from/sys/fs/cgroup/cpuset/cpuset cpus, for example below two examples:

/proc/cpuinfo CPU serial number display (false) :

/ sys/fs/cgroup/cpuset/cpuset. Show CPU serial number of cpus (true) :

In /proc/cpuinfo, the number of CPU cores allocated is sorted by the number of CPU cores allocated to the parent machine. However, the 15th CPU of the mother machine is not allocated to this container.

So I need from the/sys/fs/cgroup/cpuset/cpuset cpus to derive corresponding sequences of CPU in machine tools, to achieve to nuclear, pictured above 2. In addition, the following commands can be added to the startup script to convert the format of the real CPU cores to facilitate binding.

cpuset_cpus=$(cat /sys/fs/cgroup/cpuset/cpuset.cpus) cpu_info=$(echo ${cpuset_cpus} | tr "," "\n") for cpu_core in ${cpu_info}; do echo ${cpu_core} | grep "-" > /dev/null 2>& 1 if [ $? -eq 0 ]; then first_cpu=$(echo ${cpu_core} | awk -F"-" '{print $1}') last_cpu=$(echo ${cpu_core} | awk -F"-" '{print $2}') cpu_modify=$(seq -s "," ${first_cpu} ${last_cpu}) cpuset_cpus=$(echo ${cpuset_cpus} | sed "s/${first_cpu}-${last_cpu}/${cpu_modify}/g") fi done echo "export cpuset_cpus=${cpuset_cpus}" >> /etc/profileCopy the code

Source /etc/profile invokes the environment variable in the following format:Note: Binding depends on qos configuration (i.e. Request and limit must be set in the same way)

3. Disable hyperthreading

Hyperthreading is enabled in most scenarios, but needs to be disabled in computation-intensive scenarios. The solution here is to choose to disable hyperthreading when applying for a CVM.

Then stain and label the closed parent machine so that the ordinary pull will not pull the closed parent machine. When the closed resources need to be allocated, turn on tolerance and set the label in YAML, and the corresponding closed resources can be obtained.

Yunti resource application clearance configuration:

High availability in stateful service upgrades

Stateless containers are the easiest to upgrade. The availability of a service port is the same as the availability of the container.

However, the startup of stateful services is complicated, and you need to prepare the status in the startup script. Advertising here mainly involves the push and loading of advertising order resources.

1. Continued availability of advertising resources during container upgrade

The biggest difference between a container upgrade and a physical machine is that the container will destroy the original container, and then pull the new container from the new image to provide services. The disks, processes, and resources of the original container will be destroyed.

However, the advertising order resources here are all at the level of millions. If the files need to be pulled again every time they are upgraded, it will directly lead to slow startup. Therefore, we added temporary hanging directories in the containers.


In this mode, the files in the preceding directory can be retained during the upgrade of the container without needing to be pulled again. However, emptyDir can only be retained in upgrade scenarios. Destruction and reconstruction will still be destroyed and then pulled again. Here is how to modify YAML directly for stock service:

              volumeMounts:
        - mountPath: /data/example/
          name: example-bf
      volumes:
      - emptyDir: {}
        name: example-bf
Copy the code

2. Services are highly available during the upgrade

During business iteration, there are actually two problems that cause the business to provide lossy services.

  • If workload is associated with load balancing, the container will be in the running available state after executing the first sentence of the startup script, which will help everyone to put into the associated load balancing. However, at this time, the business process is not ready, especially the stateful service must complete the pre-state before starting. At this point, the business will report an error due to the addition of an unavailable service.

  • During the upgrade process, except for the deployment mode upgrade, the original container is destroyed first and then the new container service is pulled. One of the problems is that when we upgrade, we remove the associated load balancers and then immediately go to the destruction phase. If the upstream L5 call is actually not able to synchronize quickly until the Pods have been removed, it will continue to send requests downstream to the container that has been destroyed, and the whole business will report an error.

So one of the main ideas here is:

  • How to bind the state of the business to the state of the container.
  • During the upgrade/destroy rebuild process, can we make a post-script, before the destroy we can do some logic processing, the simplest is to sleep for a period of time.

Here we introduce the concept of two business upgrades:

  • The probe is ready

  • The rear script

    1) Probe readiness Need to select port readiness probe when workload creation, so that the associated load balancing will be put into the service port after startup.

    Yaml can also be modified in the workload

      readinessProbe:
          failureThreshold: 1
          periodSeconds: 3
          successThreshold: 1
          tcpSocket:
            port: 8080
          timeoutSeconds: 2
Copy the code

A similar unhealty occurs, which is the process of waiting for a business port to become available after the container is started

2) Post-script

The core function of a post-script is to perform a series of business-specific actions between culling from associated load balancing and destroying containers.

The order of execution is: Commit destroy rebuild/upgrade/downsize operation → remove Polaris /L5/CLB/ Service → execute post-script → Destroy container

One of the simplest functions is that when using the upstream L5 call, sleep 60s after removing the L5, the upstream can update the pods until they are removed and then destroy them.

           lifecycle:
          preStop:
            exec:
              command:
              - sleep
              - "60"
Copy the code
  lifecycle:
          preStop:
            exec:
              command:
              - /data/scripts/stop.sh
Copy the code

In terms of long-term experience, the main problem lies in L5. If the service has a large traffic volume, it will be ok within sleep 60s; if the service has a small number of requests, it needs sleep 90s in the hope that no errors will be reported.

Results show

Comparison of CVM and TKE usage and time consumption

Here, we compare the CPU and time between the CVM and TKE containers on a normal machine with the same configuration. We can see that there is basically no significant difference in CPU and time.

CVM: TKE:

Overall revenue

conclusion

Different from other shares that introduce cloud native from the bottom, this paper mainly introduces the advantages and application methods of cloud native in large-scale online services from the business perspective, and combines Tencent advertising’s own characteristics and strategies to realize Tencent advertising’s container practice in high concurrency, automation and other scenarios. For the business team, any work is based on quality, efficiency and cost, and cloud native can improve from these three aspects, I hope there will be more sharing from our Tencent advertising in the future.


Container Service (TKE) is a one-stop cloud native PaaS service platform based on Kubernetes provided by Tencent Cloud. It provides users with enterprise-level services that integrate container cluster scheduling, Helm application orchestration, Docker image management, Istio service governance, automatic DevOps and a full set of monitoring operation and maintenance systems.