This article is published by netease Cloud


Author: Feng Changjian

Abstract:Netease Cloud Container PlatformWe hope to provide a complete solution and closed-loop user experience for the team that has implemented the microservice architecture. For this reason, we have taken the lead in dogfooding practice in the container service team since 2016 to see if the container cloud platform can support the microservice architecture of the container service itself, which is an interesting attempt.

Once decided to do micro service architecture, there are a lot of realistic problems in front, such as technology selection, business problems, high availability, fault tolerance service communication, service discovery and management, cluster, configuration management, data consistency problem, conway’s law, distributed call trace, CI/CD, service test, and the scheduling and deployment, etc. This is not a simple solution. There are thousands of ways to practice microservices architecture, and we explored and practiced one of them, hoping to give you a clue. This article is the first of a series of articles entitled “micro servitization practice of netease Container Cloud Platform”.

Docker container technology has passed its initial noisy period and is gradually being used by various companies and technical teams. Although by today, we have gradually recognized from the concept of “image is defined as application delivery standards, the container as a standard application running environment” point of view, but there are still quite a few people in the confused container technology as a standard, how should be born, how to large-scale online application, how to play to liberate productive forces, Promoting software delivery efficiency and quality? The answer lies in the application architecture.

The microservices architecture was not invented by Docker container technology, but it was invented by container technology. Container technology provides consistent distribution means and operation environment, so that only the application architecture after microservitization can give full play to its value with container. However, microservitization architecture introduces great complexity. Only application containerization and large-scale container arrangement and scheduling can avoid the decline of operation and maintenance efficiency. Container technology and microservitization architecture complement each other.

Netease Container Cloud Platform, formerly known as netease Automatic Application Deployment Platform (OMAD), can utilize the infrastructure provided by IaaS cloud to manage the entire application life cycle, including integration of construction and deployment. Container technology, represented by Docker, came into the public eye in 2014, and we were pleasantly surprised to find that container technology is the most important piece of the puzzle in the evolution of automated deployment platforms from tool-based applications to platform applications. Users need to initialize hosts and then build and deploy applications using the automatic deployment platform. After the introduction of container technology, users do not need to care about host initialization, communication between hosts, instance scheduling and a series of non-application problems during the whole application delivery process from function development to testing to one-click deployment. This is good news for DevOps believers.

We started to explore the best practices of container technology in 2015. From the product forms of “fat containers” and container clusters, to the definition of stateful and stateless services, and now new computing and high performance computing, we have been thinking and enriching the application scenarios of container technology. No matter how to adjust the product shape, the container of the core concepts of the cloud platform has been a “service”, through the abstract provides high-performance micro service container cluster management solutions, support elastic expansion and vertical expansion, gray level upgrade, service discovery, service choreography, error recovery, performance monitoring, and other functions, meet user application delivery efficiency and rapid response to the needs of the business change. Netease Cloud container Platform expects to provide complete solutions and closed-loop user experience to the teams implementing the micro-service architecture. For this reason, since 2016, our container service team has taken the lead in dogfooding practice. On the one hand, it is to test whether the container cloud platform can support the micro-service architecture of the container service itself. On the other hand, the practice experience of microservitization can feed back the product design of container cloud platform. This is a very interesting attempt, which is also our original intention to share the practice of container cloud platform microservitization architecture.

Before talking about micro-service architecture practice of container service, it is necessary to give a general introduction to netease cloud container service. At present, the netease cloud container service team manages over 30 microservices in the form of DevOps and builds and deploys over 400 times per week. Logically, netease cloud container service architecture is composed of four layers, namely, infrastructure layer, Docker container engine layer, Kubernetes (hereinafter referred to as K8S) container Orchestration layer, DevOps and automation tool layer from bottom to top:

The overall service architecture of container cloud platform is as follows:

Leaving aside the specific business of container services, the business features can be divided into the following types (examples in parentheses) :

  • End user oriented (OpenAPI Service Gateway), service Oriented (bare metal service)
  • Synchronous communication (user center), asynchronous communication (build service)
  • Strong data consistency requirement (ETCD synchronization service), Final consistency requirement (Resource recovery service)
  • Throughput sensitive (logging service), delay sensitive (real-time service)
  • CPU intensive (signature authentication center), NETWORK IO intensive (image warehouse)
  • Online services (Web services), offline services (mirror check)
  • Batch tasks (charging log push) and scheduled tasks (distributed scheduled tasks)
  • Long connection (WebSocket service gateway), short connection (Hook service)

Once decided to do micro service architecture, there are a lot of realistic problems in front, such as technology selection, business problems, high availability, fault tolerance service communication, service discovery and management, cluster, configuration management, data consistency problem, conway’s law, distributed call trace, CI/CD, micro service tests, and scheduling and deployment and so on… This is not a simple solution.

As a container service whose primary programming language is Java, it was a natural choice for Spring Cloud to go with K8S. Spring Cloud and K8S are both great microservices development and runtime frameworks. From an application lifecycle perspective, K8S covers a wider range of areas, particularly resource management, application choreography, deployment, and scheduling, that Spring Cloud can’t handle. In terms of functions, they overlap to a certain extent, such as service discovery, load balancing, configuration management, cluster fault tolerance, etc., but they have completely different approaches to solving problems. Spring Cloud is purely for developers, who need to consider all aspects of microservices architecture at the code level. K8S is aimed at DevOps and offers a generic solution that tries to solve microservices-related problems at the platform level, shielding developers from complexity. To take a simple example of service discovery, Spring Cloud provides a traditional Eureka solution with a registry, requiring developers to maintain the Eureka server while transforming the service caller and service provider code to access the service registry. Developers need to care about all the details of implementing service discovery based on Eureka. However, K8S provides a decentralized solution, which abstracts the Service (Service) and solves the problem of Service exposure and discovery through DNS+ClusterIP+ Iptables. It is completely non-intrusive to the Service provider and the Service caller.

For technology selection, we have our own considerations and prefer more stable solutions. After all, stability is the lifeblood of cloud computing. We are not “K8S fundamentalists” and we have chosen to implement K8S based on various aspects of microservices architecture mentioned above, such as service discovery, load balancing, high availability, cluster fault tolerance, scheduling and deployment, etc. There are options to use Spring Cloud solutions, such as synchronous inter-service communication; It also combines the advantages of both, such as fault isolation and fusing of services. Of course, there are also some mature third-party solutions and self-developed system implementation, such as configuration management, log collection, distributed call tracing, flow control system, etc.

The biggest improvements we made with K8S to manage microservices were in scheduling and deployment efficiency. With our current situation, different service deployed in different rooms and cluster (alignment, testing environment, pre-release environment, production environment, etc.), have different demand of software and hardware configuration (memory, SSD, security, access to accelerate overseas, etc.), these requirements have been more difficult by traditional automation tools. K8S performs Label management on Node hosts. As long as we specify Pod Label, the K8S scheduler will automatically deploy the service to the Node hosts that meet the requirements according to the matching relationship between Pod and Node Label, which is simple and efficient. A built-in rolling upgrade strategy with LiVENESS and Readiness probes and Lifecycle hooks enables continuous service updates and rollbacks. In addition, blue-green deployment and Canary deployment of services can be achieved by configuring related parameters. In terms of cluster fault tolerance, K8S maintains the number of service copies (replicas) through the replica controller. No matter service instance faults (process abnormal exit, OOM-KILLED, etc.) or Node host faults (system faults, hardware faults, network faults, etc.), the number of service copies can always be maintained in a fixed number.

Docker creatively solves the problem of consistency between application and running environment through layered mirroring, but generally speaking, the configuration of services in different environments is different. As a result, images built in the development environment cannot be used in the test environment, and images verified by QA in the test environment cannot be deployed online. This causes the Docker image for each environment to be rebuilt. The solution to this problem is to extract the configuration information and inject it in the form of environment variables when Docker container starts. K8S also provides a solution such as ConfigMap. However, this method has a problem that configuration information changes cannot take effect in real time. We use the Disconf unified configuration center solution. After unified hosting is configured, container images built from the development environment can be directly submitted to the test environment for testing. After passing QA verification, they can be uploaded to the rehearsal environment, pre-release environment, and production environment. On the one hand, it avoids repetitive application packaging and Docker image construction, and on the other hand, it truly realizes the consistency of online and offline applications.

Spring Cloud Hystrix has played an important role in our microservices governance, and we have redeveloped it to provide more flexible fault isolation, degradation, and circuit breakers to meet the specific business needs of services such as API gateway. Intra-process fault isolation is only one aspect of service governance. On the other hand, on a host with mixed applications, applications should be isolated from each other to avoid resource conflict and impact on service SLA. For example, it is absolutely necessary to avoid an offline application losing control and occupying a large amount of CPU, so that the online application of the same host is affected. We use K8S to limit the resource quota (mainly CPU and memory limits) of the container runtime to achieve interprocess failure and exception isolation. K8S’s cluster fault tolerance, high availability, and process isolation, combined with Spring Cloud Hystrix’s fault isolation and fuses, is a good way to live up to the Design philosophy of “Design for Failure.”

Service fragmentation directly affects the benefits of implementing a microservice architecture. The difficulties of service separation often lie in unclear business boundaries, difficult transformation of historical legacy systems, data consistency, Conway’s law and so on. From our experience, the solution idea for the first two problems is the same: 1) only open to determine the boundary can be independent business. 2) Service split is essentially the split of data model, the upper application can withstand inversion, the bottom data model can not withstand inversion. For fuzzy boundary business, even if you want to tear down, only tear down the application database.

Here are our sample steps for smoothing out user services from the main project:

  1. Separate user-related UserService and UserDAO from the main project and add UserController and UserDTO to form user services and expose HTTP RESTful apis.
  2. Replace the UserService class associated with the main project user with the UserFacade class and invoke the UserService API with Spring Cloud Feign annotations.
  3. The main project will rely on the UserServce interface instead of the UserFacade interface to smooth the transition.

Through the above three steps, the user service stands alone as a microservice with little increase in the complexity of the overall system code.

Data consistency problems are common in distributed systems and will be magnified under microservices architecture, which also illustrates the importance of proper separation of services from another perspective. Most of the data consistency scenarios we encounter are acceptable and ultimately consistent. “Timing task retry + idempotent” is a Swiss army knife to solve such problems, we developed a set of independent of the specific business “distributed regular task + reliable event” processing framework, no need to keep data will eventually reach the consistent operation is defined as an event, such as user instance initialization, reconstruction, resources recycling, such as log index business scenario. Take user initialization as an example. After a user is registered, it must be initialized. The initialization process is time-consuming and asynchronous, including tenant initialization, network initialization, and quota initialization. Initialization is defined as an initTenant event. The initTenant event and its context are stored in a reliable event table. The execution of the initTenant event is triggered by a distributed scheduled task. If the execution fails, the scheduled task system triggers the execution again. For some scenarios with high real-time requirements, an event processing can be triggered first, and then the event can be stored in the reliable event table. For every event handler, it is important to ensure that idempotent execution is supported. There are many ways to implement idempotent execution, from Boolean status bits, to UUID reprocessing, to CAS based on version numbers. I’m not going to expand it here.

From our practical experience, when the business boundary conflicts with the organizational structure, we prefer to choose the service separation boundary that is more consistent with the organizational structure. This is also a conway law approach. Conway’s Law states that system architecture is equivalent to the communication structure of an organization. Organizational architecture implicitly constrains the shape of software system architecture. In violation of Conway’s law, it is very easy to have blind spots in system design, and we have encountered this situation between teams and within teams.

This article is the first of a series of articles titled “micro servitization Practice of netease Container Cloud Platform”. It introduces the relationship between container technology and micro service architecture, the purpose of making container Cloud platform, and briefly introduces the practical experience of netease Cloud container service based on Kubernetes and Spring Cloud. Limited by space, some microservices architecture points are not covered, such as service communication, service discovery and governance, configuration management, etc. Some of the topics left unmentioned, such as distributed call tracing, CI/CD, and microservice testing, will be covered in a future series of articles. There are thousands of ways to practice microservices architecture, and we explored and practiced one of them, hoping to give you a clue.



Understand netease Cloud:

The official website of netease Cloud is www.163yun.com/

New user package: www.163yun.com/gift

Netease Cloud community: sq.163yun.com/