Author: Chai Kebin (Netease Media Technology Team)

As the earliest content and information platform in China, netease Media is also facing many challenges in its infrastructure with the increase of business volume and the acceleration of business iteration. In order to provide fast, stable and secure infrastructure and basic services, netease Media carried out a container upgrade of its infrastructure in early 2019. This paper describes in detail the work, problems, solutions and future plans of media infrastructure in containerization upgrade, hoping to provide some reference for readers in containerization upgrade

1. * * * *Problems encountered and upgrade objectives

As the media business continues to grow, the infrastructure also faces many challenges, including the following

  • Resource utilization rate: before the containerization of media infrastructure, the resource utilization rate is about 20%, and there are peaks and troughs. During the troughs, the overall utilization rate cannot be effectively utilized

  • Complex processes and many manual operations: Services need to go through the process to apply for resources, which takes a long time. Expanding resources takes a long time when traffic peaks. After traffic peaks, O&M needs to manually take resources offline

  • Frequent data crawling: some core interfaces of media have been crawled by external networks, resulting in unstable service and sudden increase of traffic

  • Manual capacity expansion or reduction consumes manpower: It is difficult to set a proper resource usage for services. When service peaks occur, the capacity needs to be expanded, and after the peak, the capacity needs to be manually reduced

  • Unclear service dependency topology: When the service is to be migrated or offline, the service provider needs to find all the service dependent parties through various manual methods and make offline notification. Manual methods often fail to find all the dependent parties, resulting in the failure of the consumer after the service provider is offline

In order to cope with the above challenges, the media decided to carry out an overall upgrade of the infrastructure, hoping to achieve the following goals

  • Improve the overall utilization rate of resources: media hope to increase the overall utilization rate of resources to 50% ~ 60%, and make full use of trough resources to optimize the cost

  • Improve efficiency: The media plans to eliminate some tedious processes, so that businesses can obtain resources from the resource pool at any time. We will monitor users’ use of resources and optimize resources for businesses with low resource utilization

  • Security hardening: Media need to build gateway services to provide unified traffic access for users, and add capabilities in fusing downgrading, security, and auditing on the gateway

  • Stability: Services can be rapidly migrated when resources are faulty, capacity can be rapidly expanded when traffic suddenly increases, and capacity can be automatically reduced after traffic recovers

  • Quickly locate service dependencies: Media want to quickly and accurately locate service dependencies, see the overall service dependency topology, and see the URL call topology of core interfaces

2. * * * *Infrastructure evolution

The evolution of media architecture has gone through four stages

  • Physical machine phase: Media services are deployed on physical machines at the earliest

  • Virtualization phase: To save costs, the entire media business is moved to the proprietary cloud

  • Containerization stage: In order to further save cost, improve efficiency, guarantee stability and reinforce security, the whole media business is moved to containerization cloud, resources are pooled, and business is available on demand

  • Container cloud upgrade phase: After being migrated to the container cloud, basic resource management becomes more convenient and provides a strong foundation for flexible and refined resource management

This paper mainly introduces the work of media container stage

3**. Allocate resource pools **

With an understanding of the business architecture and requirements, we planned three resource pools

  • Container resource pool: All stateless services are containerized and run in a container resource pool

  • Cloud host resource pool: Some use Rsync, IP binding, IP hashing, and stateful services. However, they do not have time for architecture transformation, so they use cloud host resource pool

  • Dedicated server resource pool: Some applications (such as recommendations and algorithms) that use large resources do not need to undergo container transformation or cloud host migration, but retain dedicated servers

  • The container resource pool and cloud host resource pool reside in a large VPC, and physical servers reside outside the VPC. The DGW gateway and NLB are used to communicate with each other

Container resource pools are divided into the following seven categories

  • APP: This resource pool is used to deploy all business applications

  • Redis: Redis is a sensitive service. It divides the resource pool separately and configures the resource pool independently

  • PUSH: The PUSH service occupies more than 70% of the resources in the daytime, but the resource usage is very low at night. Therefore, an independent resource pool is created to prevent the impact of PUSH on other services during the peak hours of the day

  • Rec: Recommended resource pools. Recommended services require an independent resource pool and do not want to share the resource pool with other services

  • Kafka: Kafka Operator resource pool, where all Kafka services are deployed

  • ES: deploy all ES services in the ES Operator resource pool

  • GW: gateway resource pool. All container gateways are deployed

4**. Inside and outside the container call each other **

We have planned a container resource pool, a VIRTUAL machine resource pool, and a physical machine resource pool. However, how do services inside and outside the container call each other? After discussion and design, we adopt the following methods to invoke services inside and outside the container

  • Out-of-container invocation of out-of-container services: We use Media’s own service framework, NDSF, as the framework for service invocation and Consul as the service registry

  • Intra-container calls intra-container services: Intra-container calls are made directly with ServiceName

  • In-container calls out-of-container services: In-container calls out-of-container services. We also use the NDSF service framework, using Consul as the service registry

  • Out-of-container calls to in-container services: Out-of-container service calls uniformly pass through the entry gateway

5Beta environment/production environment **

In order to completely isolate the production environment from the test environment, we created a Beta environment. The Beta environment is completely isolated from the production environment’s network devices, cabinets, and machines. The Beta environment has an independent VPC resource pool, container cloud resource pool, and cloud host resource pool

By default, the Beta environment cannot access the production environment. However, some businesses have requirements for the Beta environment to access the production environment. Therefore, we set up a whitelist security group

6**.IP resource planning **

We created a container resource pool and a cloud host resource pool. The network of the container resource pool uses OVS and is parallel to the virtual machine in the entire VPC environment. We divided a large network segment and several subnets in this large network segment

  • Container Public network Segment: This network segment allows access to the public network

  • Intra-container network segment: This network segment does not allow access to the public network. Most container instances reside on this network segment

  • Cloud public network segment: This network segment allows access to the public network

  • Intra-cloud network segment: this network segment does not allow access to the public network, and most virtual machine instances reside on this network segment

  • Gateway network segment: subnet used by the container gateway

  • Redis network segment: subnet used by Redis

  • Kafka network segment: subnet used by Kafka

  • ES network segment: subnet used by ES

  • Management network segment: subnet used by the Kubernetes management and control components

7**. Technical selection **

For the exclusive cloud environment, we choose the cloud 1.0 provided by netease Hangzhou Research. Cloud 1.0 is based on OpenStack extension and provides virtual machine, RDS, NCR, LB and other basic resources as well as public middleware, and the network uses OVS. Cloud 1.0 is the cloud environment we have been using, which is stable and reliable

Container cloud environment we used the research provide a platform for the canoe, netease hangzhou canoe the platform was introduced on the basis of community Kubernetes PAAS platform, canoe itself based on community, but based on the community to provide the ability, stability and usability are doing a lot of ascension, conform to the requirements of the production environment of use, we use the canoe of the following components

  • NCS: Container services, which provide basic, reliable container instance services

  • NSF: Container ServiceMesh, using Envoy and Istio, supports Dubbo, Thrift protocols, hot upgrades, lazy loads, service governance capabilities, and more on a community open source basis. When we use NSF, only call-side interception is enabled

  • Mixing unit: a platform for offline and online services to improve THE OVERALL CPU usage

  • Container gateway: A unique access point outside the container pool to services inside the container for security, auditing, fuse/degrade, monitoring, and more

  • Redis Operator: container for the Redis PAAS service

  • Kafka Operator: container Kafka PAAS service

  • HPA: Kubernetes default support capability, using HPA to achieve elastic scaling, support CPU load, QPS, task backlog and other indicators of elastic scaling

8. Current status

After about a year’s efforts, the media has established the container resource pool, cloud machine resource pool and physical machine resource pool. The container resource pool has run tens of thousands of PODS and thousands of services. The core business has been migrated to the container environment, and 80% of physical machine resources have entered the container resource pool

9**.ServiceMesh**

As the basic framework of microservices, service scheduling and governance framework plays a very important role. There are many open source frameworks available on the market, such as Dubbo, Spring Cloud, etc. Before media containerization, there was a set of service invocation and governance framework, which we named NDSF.

The functional architecture diagram of NDSF is shown below:

NDSF framework mainly includes three parts

  • SDK: Provides Client SDK of database and middleware, which encapsulates sensitive information such as user password and access address, as well as the call statistics capability of core indicators

  • Framework: Provides basic capabilities for micro-service startup, including service online/offline, service status viewing, service topology reporting, and service health status monitoring

  • Service invocation: service framework invocation, service registration, service discovery, load balancing, provision of service governance capabilities

As the basic framework of media, NDSF is basically used by all services. In terms of service invocation and service governance, NDSF framework uses Consul as a service registry. Based on HTTP2.0C service invocation framework, service invocation realizes two capabilities of Jar package invocation and Proxy invocation

  • Jar package call: for Java language micro services, through the introduction of NDSF Jar package service invocation, registration and governance in the project

  • Proxy invocation: Applies to non-Java language micro-services. A Proxy is deployed on each machine to invoke, register, and govern the service

At present, NDSF has covered more than 80% of the business of the media. In the process of business access and use, there are also many problems, the main problems are as follows

  • The upgrade of the framework requires the awareness of the business side, and the business side needs to redistribute the application in conjunction with the upgrade Jar package, which makes the upgrade very difficult

  • Jar package conflicts are easily caused during service access or upgrade. As a result, service providers repair Jar packages, increasing service access costs and time

  • The business side needs to modify the service invocation code when accessing NDSF, which also causes many difficulties to access

We decided to use ServiceMesh as the framework for the entire service governance in order to make the business unaware and without any changes. With ServiceMesh, we wanted to achieve the following goals

  • No service awareness: The service code does not need to be modified, and the service code does not need to import the Jar package

  • Language independent: All languages used by the business side are supported

  • Multi-protocol support: Supports common Dubbo, Thrift, HTTP, and gRPC protocols

  • Dynamic: Service governance, service discovery, and service registration can take effect dynamically, and version upgrade is not aware of services

  • Perfect service governance: Supports timeout retry, fusing downgrading, black/white list, and traffic limiting

In terms of ServiceMesh scheme selection, we use NSF as the underlying support of ServiceMesh. NSF is enhanced based on Istio and Envoy, which not only supports Dubbo and Thrift, but also makes relatively large upgrades in dynamic upgrade, dynamic interception and service bottom-pocket strategy

Since some media services use Dubbo, ServiceMesh needs to do some support for the Dubbo protocol. NSF has added support for Dubbo on the basis of Istio and Envoy. The main framework is as follows:

  • The business side invokes the destination IP directly using Dubbo

  • Intercept traffic on a specified port (currently 10000 ports) via iptable and redirect it to the envoy

  • Retain the ZK registry and extend the Galley component to pull the dubbo service registry information and service dependencies from ZK

  • The Galley component reports service Entry resources to the Pilot through the MCP. The extended field contains the dependencies of the Dubbo service

  • When synchronizing XDS configuration, pilot delivers the required configuration based on the service entry dependency

Dubbo is extended above the VirtualService protocol

The following table lists the capabilities currently supported by ServiceMesh

10* * * *. Mix department

One of the main goals of media containerization is to improve overall CPU utilization and save resources. At present, the overall resource utilization rate of media is about 20%, and the resources in trough cannot be effectively utilized. We hope that the overall resource utilization rate can reach 50% ~ 60% without affecting the stability of the overall service. After analysis, the resource utilization is divided into the following three categories

  • The resource utilization rate is very low, around 3% ~ 5%. This kind of application business is sensitive and cannot be delayed. The typical one is Redis cluster

  • There are obvious peaks and troughs in online resources. Under the troughs of online resources, a large number of computing resources will be free

  • The usage rate of online resources is not very low, usually around 15% ~ 20%. These services are not very sensitive and are typically stateless services

We analyze the current resource usage and services, and divide the overall services into online and offline services. The main features of these two services are as follows:

We have made the following restrictions on the offline business/online business mix:

  • Offline services cannot affect online services

  • Priority online service availability guarantee

  • If the online service resource usage is low, the offline service uses resources to ensure that the overall resource usage is at a certain watermark

Container resource pools are divided into three types of resource pools based on the online and offline dimensions

  • Online resource pool: Only online services can be scheduled

  • Offline resource pool: Only offline services can be scheduled

  • Mixed resource pool: a service type that can schedule both online and offline services

The service types deployed in containers are divided into four types

  • Job: the offline task is scheduled to the offline resource pool

  • Serivce: an online task is scheduled to an online resource pool

  • Colocation-job: indicates the job that can be mixed and is scheduled to the mixed resource pool

  • Colocation-service: a service that allows mixing and is scheduled to the mixing resource pool

We used the Zeus mixing system provided by The Light Boat team as the base platform for the media mixing. Zeus maximizes CPU resources while maintaining a stable online business, and supports flexible mixing strategies and flexible scaling

The main architecture of Zeus is as follows:

The mixing system mainly includes six components, and the control surface has four components:

  • Zeus – Webhook-Server: An admission control plug-in based on Kubernetes Dynamic Admission Control. Intercepts user requests to set default values and check the validity of fields

  • Zeuss-manager: Controller developed based on Kubernetes’ operator mode. CRD resource life cycle management and off-line task rescheduling are implemented

  • Zeus-scheduler-extender: An extended scheduler developed based on kubernetes scheduler HTTP Extender mechanism to implement dynamic scheduling of offline business

  • Zeus-exporter: Similar to the Kube-state-metrics component, a statistical aggregator of various resource data to be pushed to Prometheus

The data surface has two components:

  • Zeus-monitor-agent: Agent that collects node monitoring data and runs on each node in the mixed resource pool. The Monitor component periodically updates the load status of nodes to the NodeProfile (Custom Resource) resource and provides a local restful API for query by Zeus-Isolation-Agent

  • Zeus-isolation-agent: A component responsible for implementing in-/ off-line business resource isolation, running on each node of the hybrid resource pool. Access Zeus-monitor-Agent to obtain monitoring data and perform quarantine rules periodically.

We divide the container mixed resource pool into three categories based on the characteristics of the resources used by the business

  • APP: The resource usage of online applications is high and may suddenly increase

  • NCR: The resource usage of online applications is low but sensitive. Therefore, the resource usage cannot be too high offline to prevent jitter

  • PUSH: There are obvious peaks and troughs in online applications, and the utilization of troughs needs to be improved

In terms of reasonably estimating the AVAILABLE CPU for offline tasks, if the online service load is high and the node resource utilization is high, less resources are allocated to offline services or the node does not need to mix offline services at all. If the load of online services is low and node resource utilization is low, the resources allocated to offline services can increase. Based on the above analysis, we define the following formula

Capacity = MAX (Total available resources on a node x target node usage - Online service usage, minimum number of guaranteed cores)Copy the code

The node target utilization is dynamically set based on the sensitivity of each resource pool

For example, if the CPU capacity of a mixed node is 56 cores and the ALlocatable is 55 cores, 10 cores are used for online services on the node, the TARGET CPU utilization of the node is 50%, and the minimum resource guarantee is 2 cores, the number of available CPU cores on the mixed node is as follows:

Capacity = Max(55* 50%-10, 2) = 12.5 coresCopy the code

After online/offline mixing, the overall utilization rate of resources has been greatly improved, saving about 30% of resources

We monitor predefined mixed resource pools. The overall resource usage trend is as follows:

11**. Encountered pit **

  • A capacity assessment of the business must be made

    You need to assess the capacity of services entering containers in advance to prevent faults caused by increased pressure on the container cluster. The assessment includes: Network bandwidth, PPS, QPS, CPU computing power, memory size, disk I/O, service features, LB bandwidth, and gateway bandwidth ensure that the container cluster can support the traffic of new services

  • Do not label nodes too much

    K8S available through different Tag on the node, the business at the time of deployment can choose different resource pool for deployment, if the resource pool of Tag played too much, will bring a lot of management difficulty, for reasonable use of resources can cause a lot of obstacles, resource pool according to business characteristics of advice business deployment options to choose, At present, the main resource pools of media include APP, gateway, Redis, Kafka, PUSH, recommendation and ES. These resource pools have independent features and cannot affect each other

  • Creating a Service

    If the Service needs a Service resource, then create it. After a Service resource is created, the Endpoint changes after Service release, capacity expansion, capacity reduction, and update. IPTables is fully refreshed. If a Service has a large number of PODS, full refresh is slow and has unnecessary impact on the entire network

  • The container IP address should not be too fine-grained

    When users access Mysql databases, IP whitelists are added to the libraries to ensure data security. IP addresses can be fixed in both cloud and physical systems. However, in container scenarios, IP addresses are randomly assigned on a large network segment. For each cluster to create their own IPPool, so there is a big increase for managing complexity, but also for IP also has a lot of waste of resources, because of each cluster, all need to set aside a certain number of IP as a backup, suggested that under the container cluster, white list, do not have to limit the Mysql database, or limit container of the cluster a long, It would be a lot easier to manage

12**. Follow-up planning **

At present, the core business of media has entered the container, and the management of basic resources has been sunk. In the future, we will mainly do the following things on the basis of K8S

  • Resources of fine management: we analyze the usage of the container cluster, found the gap between business applications and the actual use of resources is very big, lead to business takes up a lot of resources, but other business cannot obtain needed resources from the resource pools, next, we will refine each cluster filings and the gap between actual usage of resources, to close the gap between the two

  • Automatic resource usage recommendation: Based on the historical resource usage of a cluster, an application can be provided with a reasonable amount of resources for reference

  • Resource elasticity: In order to cope with the rapid expansion of traffic, the flexibility of containers needs to be increased. Currently, HPA is used to solve the resource elasticity problem, and VPA is also used to solve the problem

  • Promoting ServiceMesh: The service governance capability is sinking, making the business side pay more attention to the business itself. At present, we have supported ServiceMesh capability, and we will promote such capability to the business next

  • Capacity management: Monitors the capacity of network, computing, memory, and disk resources in a container cluster to accurately evaluate the capacity of new services entering containers