User stories | tencent small video & transcoding platform cloud native container path

The author

Li Huibo, senior business operation and maintenance engineer of Tencent, currently works in technical operation and Quality Center of TEG Cloud Architecture Platform Department, and is responsible for video transcoding operation and maintenance of wechat and QQ social services.

Abstract

With the rise and rapid development of short video, there are more and more demands for video transcoding. Low bit rate, high clarity, 4K, ULTRA HD, HD, SD adapt to different terminals and different network environments to improve user experience, as well as watermark, logo, cropping, screenshots and other diversified user needs. It is also necessary to respond quickly to the diversified demands of resources and the elastic expansion and shrinkage capacity. With the advancement of the company’s self-developed cloud project, the stable line and diversity of equipment can provide more choices to meet the rapid, stable and anti-sudden resource demands of transcoding services such as moments of friends, video accounts, advertisements and public accounts.

Service scenario

Media Transcoding Service (MTS) is a quasi-real-time (and offline) video processing service in voD scenarios. It provides minute-level hd compression, screenshots watermarking, simple clips and other basic video processing functions for businesses, and also has the ability to downward integrate customized picture quality improvement, quality evaluation and other in-depth functions.

The business developer defines the batch processing template. When the content producer uploads data, the transcoding operation is triggered to output multi-specification compressed video and video cover, and then the push can be published.

background

Wechat side and small video platform undertake a lot of video files, and these videos are basically processed in the transcoding platform according to business requirements, in order to reduce the bit rate, reduce costs, reduce users due to network lag and other functions. In the earliest transcoding platform, each business basically maintains an independent cluster. There are many clusters, and resources between clusters cannot be used by each other. In addition, the capacity of a single cluster is small, and businesses with large requests have to deploy multiple sets of cluster support.

This brings great challenges to operation and requires a platform with a larger capacity ceiling, more flexible resource scheduling and more convenient operation. With the advancement of the company’s self-developed cloud project and TKE containerization, the transcoding platform needs to be able to quickly connect with TKE resources and make use of the company’s massive computing resources to meet business demands for video transcoding.

Construction ideas and planning

Video access and transcoding are often faced with multi-service emergencies. On the premise of ensuring service quality, it is necessary to improve utilization and operation efficiency.

Platform capacity building: the upper limit of single cluster capacity is raised, service frequency control is isolated from each other, and resource scheduling is flexibly adjusted

Resource management construction: Fully excavate idle debris resources around THE TKE container platform, and increase CPU utilization by staggered elastic expansion and contraction of high and low peaks through HPA. At the same time, the video access service traffic is high, the CPU usage is low, the transcoding service traffic is low, the CPU usage is high, and the physical machine resources are fully utilized by mixing the two scenarios to prevent the low load of the pure traffic cluster

Operation system construction: adapt to business scenarios, improve the change and removal process, process monitoring and alarm elimination, and establish a stable guarantee platform

Platform Capacity Building

Infrastructure upgrade

Old transcoding platform architecture:

As a master/slave structure, the DISASTER recovery capability is relatively weak and the master performance is limited. A single cluster manages about 8000 workers
At the resource scheduling level, workers with different cores cannot be differentiated in a friendly way, resulting in high load and low load in some cases
Frequency control cannot be implemented based on service dimensions, and a single service burst affects the entire cluster

New Transcoding platform architecture:

The Master/Access module is merged into SchED. The SchED scheduling module is deployed in distributed mode. If any node fails, it can be automatically removed
The worker and Sched establish heartbeat and report their own status and CPU cores, so as to facilitate scheduling to allocate tasks according to the worker load and ensure the load balancing of CPU workers with different specifications deployed in the same cluster
Single cluster management capacity 3W + worker, to meet the sudden increase of business during holidays
If services are merged into a large public cluster, single services can be frequency-controlled to reduce direct service interference

With the upgrade of the architecture, the platform is no longer limited by the single cluster capacity, daily and holiday peak can quickly meet the demand, and the business combination of large cluster stagger high and low peak, resource utilization

Access service SvPapi Upgrade DevOps 2.0

With the help of business TKE dongfeng, small video platform access service SVPAPI to achieve standardized upgrade. Key improvements include:

Integrate the original multi-change system, multi-monitoring system, and multi-basic resource management system into the intelligent research institute unified entrance, including R&D testing, daily version release, resource elastic expansion and scaling, service monitoring alarms, and service log retrieval and analysis. Operation of TKEStack directly through CICD process shielding is more secure
The module is hosted in seven color stone, and supports 1-minute service switch, flexible traffic scheduling and service flow control during holidays
In the second half of this year, the access service plan uses intelligent research to monitor the flow level of the cluster, combines TKEStack with HPA capacity of traffic, and practices the unattended capacity of resource expansion

Resource management construction

After having the platform capability, the next step is to classify and schedule resources of different container specifications in a balanced manner. The main difficulties are as follows:

1. Diversity of business scenarios: TKE cluster involves a lot of different performance specifications, from 6 to 40 cores need to be able to use

2. Resource management and operation need to be considered: Dockerfile image production, adaptation to different TApp cluster configuration, container loading and unloading, operation and maintenance change specifications, etc

Sort out the container configuration under different TKE clusters

- name: MTS_DOCKER_CPU_CORE_NUMvalue: "16"
- name: MTS_DOCKER_MEM_SIZE
  value: "16"
Copy the code

# calculate the platform affinity setting when the load exceeds70%, the pod is expelledaffinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: max_vm_steal
            operator: Lt
            values:
            - "70"
Copy the code

Resource Scheduling Balancing

Transcoding is an asynchronous task, and the request time of each task processed is different and stateful. Therefore, tasks cannot be balanced based on Polaris, and scheduling strategies need to be designed by the platform side

Allocate tasks evenly based on worker performance of different CPU specifications
Schedule according to different worker versions, supporting quick version iteration of small businesses

For containers with different specifications, scheduling is balanced by Score and version

Based on scheduling capability, the task balancing on different CPU specifications is similar to that of C6 and C12, which will not lead to resource waste of large-size containers

Operation System construction

How to expand worker resources in transcoding cluster to corresponding cluster? A resource management layer is added here, and the specified worker needs to be removed from the cluster manually. Professional OSS system should be developed on the corresponding platform side, and sched/worker/ tasks of the cluster should be made into pages for easy operation, and the API of loading and unloading should be encapsulated. TKE actually has no association with transcoding platform. In order to achieve decoupling, o&M side develops the function of docking TKE on and off shelves, develops the process, and synchronize TKE’s expanded and scaled resources by calling OSS API. The specific logic is shown in the following figure:

TKE supports Polaris service, associates corresponding TApp with Polaris service name, and manages Polaris service as metadata of IP address expansion and scaling of different transcoding clusters to ensure resource consistency between TKE and transcoding side

Process monitoring

There are tens of thousands of workers managed by the transcoding platform, and the process status of the container cannot be monitored in time during the running process or the release of the new version. The batch scanning takes too long, and the abnormal process status cannot be quickly known. Therefore, the process monitoring alarm of the transcoding container is constructed in combination with the in-group process monitoring platform. Abnormal worker can be removed through the wechat and phone alarm notification of the robot enterprise to improve service quality

Resource utilization optimization

Transcoding services are basically social self-research services at present, with obvious holiday effects and large resource demands. Most of them are quasi-real-time and sensitive to transcoding time. Therefore, in addition to ensuring speed, 30%~50 buffs will be reserved at ordinary times, and business is basically low in the early morning, so part of the resources are wasted in the early morning. TKE supports automatic scaling according to system indicators, and its charging mode is also based on the actual usage of a day. Here, we configure elastic scaling based on CPU utilization indicators, including capacity reduction at low peak and automatic expansion at peak, so as to reduce the occupation of resources and reduce costs

Elastic expansion and contraction capacity

Based on the actual number of load nodes, the capacity of the load nodes is reduced at low peak in the early morning

The peak value of Workloads CPU usage to request reaches 75% or above, improving CPU utilization while ensuring stable service

Results summary

At present, the transcoding platform has been merged from a scattered small cluster to a large cluster in three places. With the improvement of operation capacity and the improvement of resource utilization rate, it is striving to improve the cloud native maturity until May 2021.

Business accumulated access to wechat moments, video number, C2C, public account, Kanyikan, advertising, Qzone and other internal video services, video transcoding processing 100 million + every day
The CPU usage is around 70% on a daily basis, and the capacity is automatically expanded and shrunk flexibly according to the load, significantly improving the service maturity

About us

More about cloud native cases and knowledge, can pay attention to the same name [Tencent cloud native] public account ~

Benefits:

① Public account background reply [Manual], you can get “Tencent Cloud native Roadmap manual” & “Tencent Cloud native Best Practices” ~

② Public number background reply [series], can get “15 series of 100+ ultra practical cloud original dry goods collection”, including Kubernetes cost reduction and efficiency, K8s performance optimization practices, best practices and other series.