Exploration and practice of Nest, Meituan Serverless platform

Serverless is a hot technical topic at present. Various cloud platforms and Internet companies are actively building Serverless products. This paper will introduce some practical experience of Meituan Serverless products in the landing process, including consideration of technology selection, detailed design of the system, optimization of system stability, ecological construction of the surrounding products and the landing situation in Meituan. Although the background of each company is different, there are always some ideas or methods that can be borrowed from each other, and I hope to give you some inspiration or help.

1 background

The term Serverless was proposed in 2012 and became widely known in 2014 due to the rise of Amazon’s AWS Lambda Serverless computing service. Serverless, often literally translated as “no server”, is the way that Serverless computing lets users build and run applications regardless of the server. With Serverless computing, the application is still running on the server, but all server administration is handled by the Serverless platform. For example, machine application, code release, machine downtime, instance expansion and reduction, and machine room disaster recovery are all automatically completed with the help of the platform. Business development only needs to consider the implementation of business logic.

Review the evolution of the computing industry, from physical machines to virtual machines, and then from virtual machines to containers; The service architecture changes from traditional monolithic application architecture to SOA architecture and then to microservices architecture. If we look at the overall technology development trend from the two main lines of infrastructure and service architecture, we may find that both infrastructure and service architecture are evolving from large to small or from large to small. The essence of such evolution is nothing more than to solve the problem of resource cost or r&d efficiency. Of course, Serverless is no exception, it is also designed to solve both of these problems:

Resource utilization: The Serverless product supports rapid and flexible scalability to improve resource utilization. During peak traffic hours, the computing capacity and capacity of the service automatically expand to accommodate more user requests. When the traffic decreases, the resources used shrink to avoid resource waste.
R&d operation and maintenance efficiency: On Serverless, developers generally only need to fill in code paths or upload code packages. The platform can help complete construction and deployment. Developers do not directly face the machine, the management of the machine, whether the machine is normal and whether the flow of high and low peaks need to expand and shrink and other problems, these all do not need to consider, Serverless products to help research and development personnel to complete. This frees them from the grind of operation and maintenance, shifts them from DevOps to NoOps, and focuses more on the implementation of business logic.

Although AWS launched its first Serverless product Lambda in 2014, the application of Serverless technology in China has been tepid. However, in the past two or three years, driven by container, Kubernetes and cloud native technologies, Serverless technology has developed rapidly. Major Domestic Internet companies are actively building Serverless related products and exploring the landing of Serverless technology. In this context, Meituan also started the construction of Serverless platform in early 2019, with an internal project named Nest.

Up to now, Nest platform has been under construction for two years. Reviewing the overall construction process, it has mainly gone through the following three stages:

Quick validation, Landing MVP version: We quickly implemented the basic capabilities of Serverless products, such as construction, release, elastic scaling, contact origin, function execution, etc., through technology selection, product and architecture design, development iteration. After the launch, we promoted the pilot access of some businesses to help verify and polish products.
Optimize core technology and ensure business stability: after the pilot business verification in the early stage, we soon found some problems related to the stability of the product, including the stability of elastic expansion, the speed of cold start, the availability of the system and business, and the stability of the container. In view of these problems, we have made special optimization and improvement on the technical points involved in each problem.
Improve the technology ecology and realize the benefits: after optimizing the core technology points, the product is gradually mature and stable, but it still faces ecological problems, such as lack of research and development tools, failure of upstream and downstream products and insufficient platform opening ability, which affect or hinder the promotion and use of the product. Therefore, we continue to improve the technical ecology of products, remove barriers to business access and use, and realize the business benefits of products.

2 fast verification, landing MVP version

2.1 Technology selection

Building Nest platform, the primary solution is the problem of technology selection, Nest mainly involves the selection of three key points: evolution path, infrastructure, development language.

2.1.1 Evolution Route

At first, Serverless mainly includes Function as a Service (FaaS) and Backend as a Service (BaaS). In recent years, Serverless has expanded to include application-oriented Serverless services.

FaaS: Is a function service that runs in a stateless computing container. Functions are usually event-driven, have a short lifetime (even a single call), and are managed entirely by a third party. Related FaaS products in the industry include Lambda of AWS, function calculation of Ali Cloud, etc.
BaaS: Back-end services built on the cloud service ecosystem. BaaS products in the industry include AWS S3 and DynamoDB.

Application-oriented Serverless services: For example, Knative provides comprehensive service hosting capabilities from code package to image construction, deployment, and elastic scaling. Public Cloud products include Google Cloud Run (based on Knative) and Ali Cloud’s SAE (Serverless Application Engine).

In Meituan, BaaS products are actually internal middleware and underlying services, which have been very rich and mature after years of development. Therefore, the evolution of Serverless products in Meituan is mainly in the two directions of functional computing service and application-oriented Serverless service. So how does that evolve? The main concern was that FaaS functional computing services were more mature and established in the industry than application-oriented Serverless services. Therefore, we decided to “build FaaS function computing service first, and then application-oriented Serverless service” as an evolutionary path.

2.1.2 Infrastructure

Because elastic scaling is an essential capability of the Serverless platform, Serverless inevitably involves scheduling and management of underlying resources. This is why many open source Serverless products (such as OpenFaaS, Fission, Nuclio, Knative, etc.) are implemented based on Kubernetes, because this type of selection can take full advantage of Kubernetes’ infrastructure management capabilities. The internal infrastructure product of Meituan is Hulk. Although Hulk is based on the packaged product of Kubernetes, it does not use Kubernetes in the original way due to the difficulty of landing and various reasons at the beginning of landing, and adopts the rich container mode at the container layer.

In this historical context, we were faced with two options when choosing the infrastructure: one was to use Hulk as Nest’s infrastructure (non-native Kubernetes), and the other was to use native Kubernetes infrastructure. We consider the current industry using native Kubernetes is the mainstream trend and the use of native Kubernetes can also make full use of Kubernetes native ability, can reduce repeated development. As a result, we adopted native Kubernetes as our infrastructure.

2.1.3 Development language

Since the dominant language in the cloud native space is Golang, and in the Kubernetes ecosystem, Golang is definitely the dominant language. At Meituan, however, Java is the most widely used language and has a better internal ecosystem than Golang. Therefore, in the selection of language we choose Java language. At the beginning of Nest’s product development, the Kubernetes community’s Java client was not perfect, but as the project progressed, the community’s Java client gradually enriched, and is now fully adequate. In addition, we also contributed some Pull Requests in the process of using it, which fed back to the community.

2.2 Architecture Design

Based on the above evolution path, infrastructure and development language selection, we conducted the architecture design of Nest products.

In the overall architecture, traffic is triggered to Nest platform by EventTrigger (EventTrigger source, such as Nginx, application gateway, scheduled task, message queue, RPC call, etc.). Nest platform will route to specific function instance according to the characteristics of traffic and trigger function execution. The internal code logic of the function can call each BaaS service in the company, and finally complete the execution of the function and return the result.

In terms of technical implementation, Nest platform uses Kubernetes as the basic base and appropriately refers to some excellent designs of Knative. Its architecture is mainly composed of the following core parts:

Event gateway: The core capability is responsible for connecting traffic to external event sources and then routing it to function instances; In addition, the gateway is also responsible for the statistics of the incoming and outgoing traffic information of each function, and provides data support for the elastic scaling module to make scaling decisions.
Elastic scaling: the core ability is responsible for the elastic scaling of function instances. The scaling mainly calculates the number of function target instances according to the flow data of function operation and the instance threshold configuration, and then adjusts the number of function instances with the help of Kubernetes resource control ability.
Controller: the core capability is responsible for Kubernetes CRD (Custom Resource Definition) control logic implementation.
Function instance: A running instance of a function. When event gateway traffic is triggered, the corresponding function code logic is executed within the function instance.
Governance platform: a user-oriented platform that is responsible for building, versioning, publishing functions, and managing some function meta-information.

2.3 Process Design

In terms of CI/CD flow, how does Nest differ from traditional models? To illustrate this, let’s look at the overall life cycle of functions on the Nest platform. There are four phases: build, version, deployment, and scaling.

Build: Developed code and configuration is built to generate images or executables.
Versioning: Build the generated image or executable plus the release configuration to form an immutable version.
Deployment: Release the version, that is, complete the deployment.
Scaling: Elastic scaling of a function instance based on its traffic and load.

In terms of these four phases, Nest is fundamentally different from the traditional CI/CD process in terms of deployment and scaling: traditional deployment is machine-aware and typically distributes code packages to certain machines, but Serverless is intended to shield machines from users (at deployment time, the number of possible instances of functions is still zero); In addition, traditional modes generally do not have dynamic capacity expansion and shrinkage. Serverless platform dynamically expands and shrinks capacity based on service traffic requirements. Elastic scaling will be covered in more detail in subsequent chapters, so we will only discuss deployment design here.

The core point of deployment is how do you shield machines from users? To solve this problem, we abstract the machine and propose the concept of grouping, which is composed of three information: SET (identification of the unit architecture, which will be attached to the machine), swimlane (isolation identification of the test environment, which will be attached to the machine), and region (Shanghai, Beijing, etc.). User deployment only operates on the appropriate group, not the specific machine. The Nest platform helps users manage machine resources, and each deployment initializes the corresponding machine instance based on grouping information in real time.

2.4 Function Triggering

Function execution is triggered by events. To trigger the function, the following four processes need to be implemented:

Traffic import: Registers the event gateway information with the event source and imports traffic to the event gateway. For example, for AN MQ event source, you can import MQ traffic to the event gateway by registering an MQ consumer group.
Traffic adaptation: The event gateway ADAPTS the incoming traffic from the event source.
Function discovery: The process of obtaining function metadata (function instance information, configuration information, etc.), similar to the service discovery process of microservices. Event traffic received by the event gateway needs to be sent to specific function instances, which requires function discovery. The essence of discovery here is to retrieve information stored in Kubernetes built-in resources or CRD resources.
Function routing: The process of routing event traffic to a specific function instance. In order to support traditional routing logic (such as SET, swimlane, area routing, etc.) and version routing capability, we use multi-layer routing, with the first layer routing to the group (SET, swimlane, area routing) and the second layer routing to the specific version. Use the load balancer to select instances in the same version. In addition, with this version of routing, we easily support canary, blue and green publishing.

2.5 Function Execution

Unlike traditional services, which are executable programs, functions are snippets of code that cannot be executed by themselves. How does the function perform after traffic is triggered to the function instance?

The first problem with function execution is the environment in which the function is run: Nest platform is based on Kubernetes implementation, so the function must be running in Kubernetes Pod (instance), Pod inside the container, the container inside the runtime, the runtime is the function traffic receiving entrance, and finally by the runtime to trigger the execution of the function. Everything seemed to go smoothly, but we still encountered some difficulties in the implementation. The most important difficulty was that the developers could use the components in the company seamlessly within the function, such as OCTO (service framework), Celler (cache system), DB, etc.

In Meituan’s technology system, it is difficult to run the company’s business logic in a pure container (without any other dependencies) due to years of technology deposition. Because a lot of environment or service governance capabilities are deposited in the container of the company, such as Agent service of service governance, instance environment configuration, network configuration, etc.

Therefore, in order for the business to seamlessly use the company’s components within functions, we reuse the company’s container architecture to reduce the cost of writing functions for the business. No one at the company had tried this route. Nest was the company’s first platform built on native Kubernetes, and the “first mover” was always going to have a few bumps. With regard to these pits, we have no choice but to “cut a path through mountains and bridge a path through rivers” in the process of advancing, and solve them one by one. To sum up, the most core is the technical system such as CMDB, which is opened in the start of the container, so that the container running the function is no different from the machine that the developer usually applies for.

2.6 Elastic Expansion

There are three core problems of elastic expansion: when to expand, how much to expand, and how fast to expand? That is, the problem of expansion timing, expansion algorithm and expansion speed.

Scaling time: the expected number of instances of the function is calculated in real time according to flow Metrics to expand and shrink. Metrics data of traffic come from the event gateway, where the concurrency indicator of the main function is counted. The elastic scaling component will proactively obtain Metrics data from the event gateway once per second.
Scaling algorithm: Concurrency/single-instance threshold = Expected number of instances. Based on the Metrics data collected and the thresholds of the service configuration, the desired number of instances is calculated through the algorithm, and the specific number of instances is set through the Kubernetes interface. Although the whole algorithm looks simple, it is very stable and robust.
Scaling speed: Depends on the cold start time, which will be covered in the next section.

In addition to the basic scaling capability, we also support scaling to 0, maximum and minimum number of instances (the minimum instance is reserved instance). The specific realization of scaling to 0 is that we add an activator module inside the event gateway. When the function has no instance, the request flow of the function will be cached inside the activator, and then the elastic scaling component will be driven to expand immediately through the Metrics of the flow. After the expanded instance is started, The activator retries the cached request on the expanded instance to trigger function execution.

Optimize core technologies to ensure business stability

3.1 Optimization of elastic scaling

The three elements of expansion timing, expansion algorithm and expansion speed mentioned above are ideal models, especially the expansion speed. The current technology can not do the expansion capacity of millisecond level. Therefore, in actual online scenarios, elastic scaling may fail to meet expectations. For example, frequent scaling of instances or insufficient capacity expansion may cause unstable services.

To solve the problem of frequent scaling, we maintain the sliding window of statistical data in the elastic scaling component, smooth the index by calculating the mean value, and alleviate the problem of frequent scaling by delayed scaling and real-time expansion. In addition, we added a scaling strategy based on the QPS metric, because the QPS metric is more stable than the concurrency metric.
To solve the problem of too late for capacity expansion, we take the means of advance capacity expansion. When 70% of the instance threshold is reached, the capacity expansion can better alleviate this problem. In addition, we also support multi-metric hybrid scaling (concurrency, QPS, CPU, Memory), timing scaling and other strategies to meet various business requirements.

The following figure shows a real case of online elastic scaling (the minimum number of instances configured is 4, the single-instance threshold is 100, and the threshold usage is 0.7). The upper part is the number of requests per second of the service, and the lower part is the decision graph for scaling up the instance. It can be seen that the service perfectly copes with the traffic peak when the success rate is 100%.

3.2 Cold start optimization

Cold start refers to the function call link that includes resource scheduling, image/code download, start container, runtime initialization, user code initialization, and so on. When the cold start is complete and the function instance is ready, subsequent requests can be executed directly by the function. Cold start is critical in the Serverless space, where the time it takes determines the speed of elastic scaling.

The so-called “world martial arts, no firm not broken, only fast not broken”, this sentence is also used in the Serverless field. Imagine that if an instance is pulled up fast enough, as fast as millisecond level, almost all function instances can be scaled down to 0 and then expanded to handle requests when there is traffic. This will greatly save the cost of machine resources for businesses with high and low peak traffic. Of course, the ideal is full, the reality is very skinny. It’s almost impossible to do it in milliseconds. However, as long as the cold start time is shorter and shorter, the cost will naturally be lower and lower. In addition, the extremely short cold start time is of great benefit to the availability and stability of the scaling function.

Cold start optimization is a gradual process, we mainly experienced three stages of cold start optimization: mirror start optimization, resource pool optimization, core path optimization.

Image to start the optimization: we take part in the process of start of mirror image (start the container and run-time initialization) for the targeted optimization, mainly the container IO speed limit, some special Agent start-up time, startup disk and disk data copies of key points such as optimization, eventually will start the process of system takes from 42 s optimization to about 12 s.

Resource pool optimization: The image startup time is optimized to 12 seconds, which almost reaches the bottleneck point. There is little space for further optimization. So, we thought, can we bypass the image startup time? Finally, we adopted a relatively simple idea of “space for time”, using the resource pool scheme: cache some started instances, when the capacity needs to be expanded, directly obtain instances from the resource pool, bypass the link of mirror start container, the final effect is very obvious, optimize the system start time from 12s to 3s. Note that the resource pool itself is also managed by Kubernetes Depolyment. Instances removed from the pool will be automatically replenished immediately.

Core path optimization: On the basis of resource pool optimization, we tried to improve again, and optimized the two time-consuming links of download and decompression code in the startup process. In the process, we adopted high-performance compression and decompression algorithm (LZ4 and Zstd) and parallel download and decompression technology, which achieved very good results. In addition, we have support for general logic (middleware, dependency packages, etc.) sinking, which optimizes the end-to-end startup time of functions to 2s through preloading, meaning that it only takes 2s to expand a function instance (including function startup). If you exclude the initial startup time of the function itself, the platform-side time is in the millisecond range.

3.3 High availability Guarantee

By high availability, we mean high availability on the platform itself, but Nest’s high availability also includes functions hosted on Nest. Therefore, Nest’s high availability needs to be ensured from both platform and business functions.

3.3.1 Platform high availability

For the high availability of the platform, Nest has made a comprehensive guarantee from the architecture layer, service layer, monitoring operation layer and business perspective.

Architecture layer: We adopt a master-slave architecture for stateful services, such as elastic scaling modules. When the master node fails, the slave node will be replaced immediately. In addition, we have implemented multiple layers of architectural isolation. Horizontal geographical isolation: Kubernetes two two clusters strong isolation, service (event gateway, elastic scaling) cluster two weak isolation (Shanghai elastic scaling is only responsible for Shanghai Kubernetes cluster business scaling, event gateway has two call requirements, need to access two places Kubernetes). Vertical service line isolation: Service lines are strongly isolated and different service lines use different cluster services. Resources in the Kubernetes layer are weakly isolated by namespace.

Service layer: mainly refers to the event gateway service. Since all function traffic passes through the event gateway, the availability of the event gateway is particularly important. In this layer, we support flow limiting and asynchronization to ensure the stability of the service.
Monitoring and operation layer: mainly through improving the system monitoring alarms, sorting out core links and promoting related dependent parties for governance. In addition, we will comb SOP regularly and conduct fault injection drills through the fault drilling platform to find hidden problems in the system.
Business perspective layer: We have developed an online continuous real-time inspection service, which simulates the request flow of user functions to real-time detect whether the core link of the system is normal.

3.3.2 High Service availability

For high availability of business, Nest mainly makes relevant guarantees from the service layer and platform layer.

Service layer: supports service degradation and traffic limiting: When a back-end function fails, you can degrade the configuration to return the degradation result. For abnormal function traffic, the platform supports limiting its traffic to prevent backend function instances from being overwhelmed by abnormal traffic.
Platform layer: Supports instance survival, multi-layer Dr, and rich alarm monitoring capabilities. When a function instance is abnormal, the platform automatically isolates the instance and immediately expands the capacity of a new instance. The platform supports service deployment in multiple regions. Function instances can be scattered in different rooms in the same region. When a host, machine room, or region fails, a new instance is immediately created on the available host, machine room, or region. In addition, the platform automatically helps services to monitor functions in various indicators such as delay, success rate, instance scaling, and number of requests. When these indicators do not meet expectations, an alarm is automatically triggered to notify service developers and administrators.

3.4 Optimization of container stability

As mentioned earlier, Serverless is a different CI/CD process than the traditional model, which presets the machine and then deploits the program, whereas Serverless is an instance of real-time elastic scaling based on the high and low peaks of traffic. After the new instance is expanded, service traffic is processed immediately. This may sound innocuous, but in a rich container ecosystem there were problems: we found that the load on the newly expanded machine was very high, causing some business requests to fail and affecting business availability.

After the container is started, o&M tools upgrade Agent and modify configurations, which consume CPU. In the same rich container, it naturally preempts the resources of the function process, causing the user process to be unstable. In addition, the resource allocation of function instances is generally much smaller than that of traditional service machines, which also exacerbates the problem. Based on this, we referred to the industry and cooperated with the container facility team to implement lightweight containers. All agents of operation and maintenance were put into Sidecar containers, while business processes were separately put into App containers. This container isolation mechanism ensures service stability. At the same time, we also promoted the container clipping plan to remove some unnecessary agents.

4. Improve the ecology and implement the benefits

Serverless is a system engineering, which involves Kubernetes, container, operating system, JVM, runtime and other technologies in technology, and involves various aspects of CI/CD processes in platform capabilities.

In order to provide users with the ultimate development experience, we provide users with the support of development tools, such as CLI (Command Line Interface) and WebIDE. In order to solve the problem of the interaction between the existing upstream and downstream technology products, we have integrated with the existing technology ecology of the company, which is convenient for students to use. In order to facilitate the docking of downstream integration platforms, we open the API of the platform to realize Nest enabling each downstream platform. To solve the problem of low resource utilization of low-frequency service functions caused by heavy containers and high system overhead, we support function combination deployment to double the resource utilization.

4.1 Provide r&d tools

The development tool can reduce the cost of using the platform and help the developers to carry out CI/CD process quickly. At present, Nest provides CLI tools to help developers quickly complete application creation, local construction, local testing, debugging, remote release and other operations. Nest also offers WebIDE, an online one-stop shop for modifying, building, publishing, and testing code.

4.2 Integration of technology ecology

It was not enough to only support these research and development tools. After the promotion and use of the project, we soon found that the development students had new requirements for the platform. For example, they could not complete the operation of functions on the Pipeline Pipeline and offline service instance arrangement platform, which also formed some obstacles to the promotion of our project. Therefore, we integrate the mature technology ecology of these companies into the existing upstream and downstream technology system, and solve the worries of users.

4.3 Open platform capabilities

There are many Nest downstream solution platforms, such as SSR (Server Side Render), service Choreography platform, etc., which further liberates productivity by docking Nest OpenAPI. For example, users can quickly realize the creation, publishing and hosting of an SSR project or programming program from 0 to 1 without the need for developers to apply, manage and operate machine resources themselves.

Nest not only opens the API of the platform, but also provides users with the ability to customize the resource pool. With this ability, developers can customize their own resource pool, customize their own machine environment, and even sink some common logic to further optimize the cold start.

4.4 Supporting combined deployment

Merged deployment refers to the deployment of multiple functions within a single machine instance. There are two main backgrounds for merged deployment:

The current container is heavy, and the system overhead of the container is high. As a result, the resource utilization of service processes is low (especially low-frequency services).
In the case that the cold startup time cannot meet the requirements of services for delay, we use reserved instances to meet the requirements of services.

Based on these two backgrounds, we consider supporting the combined deployment, deploying some low-frequency functions to the same machine instance to improve the resource utilization of the service process in the reserved instance.

In terms of implementation, we refer to the design scheme of Kubernetes to design a set of function combination deployment system based on Sandbox (each Sandbox is a function resource), and compare Pod to Node resource of Kubernetes. The Sandbox analogy is Kubernetes’ Pod resource, and the Nest Sidecar analogy is Kubelet. In order to implement sandbox-specific deployment, scheduling, and other capabilities, we have also customized several Kubernetes resources (SandboxDeployment, SandboxReplicaSet, SandboxEndpoints, etc.) to support dynamic plug and plug of functions to specific Pod instances.

In addition, in the form of merged deployment, isolation between functions is also an unavoidable problem. In order to solve the interference problem between functions (merged in the same instance) as much as possible, we adopt different strategies in Runtime implementation according to the characteristics of Node.js and Java language: Node.js functions use different processes to implement isolation, while Java language functions use classloading isolation. The main reason for this strategy is that Java processes take up much more memory than Node.js processes.

5 landing scene and revenue

Nest products are currently very popular in meituan’s front-end Node.js sector and are the most widely deployed technology stack. At present, Nest products have achieved large-scale implementation in the front end of Meituan, covering almost all business lines and accessing a large amount of core traffic at THE B/C end.

5.1 Landing Scenario

Front-end scenarios include Backend For Frontend (BFF), Client Side Render (CSR) and Server Side Render (SSR), background management platform scenarios, scheduled tasks, and data processing.

BFF scenario: The BFF layer mainly provides data For the front-end page and adopts Serverless mode. Front-end students do not need to consider the operation and maintenance links they are not good at, and the transformation from BFF to Serverless For Frontend (SFF) mode is easily realized.
CSR/SSR scenario: CSR/SSR refers to client rendering and server rendering. With the Serverless platform, more front-end businesses try to use SSR to realize the rapid display of the front screen without considering the operation and maintenance link.
Background management platform scenario: The company has many Web services of the background management platform. Although they are heavier than functions, they can directly host the Serverless platform and fully enjoy the ultimate publishing and operation and maintenance efficiency of the Serverless platform.
Scheduled task scenario: Companies there are a lot of periodic tasks, such as pull data every few seconds, 0 point clear the log every day, every hour to collect all data and generate reports, etc., Serverless platform directly with task scheduling system get through, simply write a task on the processing logic and configuration timing trigger on the platform, namely timing task access, no machine resources management.
Data processing scenarios: When MQ Topic is connected to Serverless platform as the event source, the platform will automatically subscribe to the message of Topic. When there is message consumption, the function will be triggered to execute, similar to the scenario of scheduled task. As a user, only the logic of data processing needs to be written and MQ trigger is configured on the platform, that is, the access of MQ consuming end is completed. No need to manage machine resources at all.

5.2 Landing Revenue

The benefits of Serverless are clear, especially in the front-end area, where the bulk of business access is best illustrated. Specific benefits can be seen from the following two aspects:

Cost reduction: With the flexible scalability of Serverless, the utilization rate of high-frequency service resources can be increased to 40%-50%. Low frequency business functions can also greatly reduce the operating cost of functions through combined deployment.
Efficiency improvement: Overall R&D efficiency increased by about 40%.
From the perspective of code development, complete CLI, WebIDE and other research and development tools are provided to help developers generate code scaffolding, focus on writing business logic, and quickly complete local tests. In addition, business services have the ability to view logs and monitor them online at no cost.
In terms of publishing, services do not need to apply for a machine through the cloud native mode. Publishing and rolling back are second-level experiences. In addition, you can utilize the natural capabilities of the platform in conjunction with the event gateway to implement flow cutting, canary testing, and more.
From the perspective of daily operation and maintenance, services do not need to pay attention to traditional problems such as machine faults, insufficient resources, and equipment room disaster recovery. In addition, when a service process is abnormal, Nest can automatically isolate abnormal instances and quickly replace them, reducing service impact.

6 Future Planning

Scenario-based solutions: There are many scenarios that access Serverless, such as SSR, back-end management terminal, and BFF. Different scenarios have different project templates and scenario configurations, such as scaling and trigger configurations. In addition, the configurations vary with languages. This virtually increases the use cost of the business and hinders the access of new business. Therefore, we consider the idea of scenarioization to build the platform, the ability of the platform is strongly associated with the scene, the platform deeply precipitate the basic configuration and resources of each scene, so that different scenes, businesses only need simple configuration can play Serverless.
Serverless of traditional microservices: Serverless of traditional microservices refers to application-oriented Serverless services mentioned in route selection. The most widely used development language in Meituan is Java, and there are a number of traditional microservices projects within the company that would be impractical to migrate to functional patterns. Just imagine if these traditional micro-service projects do not need to be transformed, but also can directly enjoy the technical dividend of Serverless, its business value is self-evident. Therefore, Serverless of traditional microservices is an important direction for us to expand our business in the future. In the implementation path, we will consider the technical integration of service governance system (such as ServiceMesh) and Serverless. The service governance component provides scalability index support for Serverless and achieves accurate traffic allocation in the scaling process.
Cold start optimization: Although the cold start optimization of functions has achieved good results, especially the system startup time on the platform side, which has very limited room for improvement, the startup time of business code itself is still very prominent, especially the traditional Java microservices, which basically takes minutes to start. Therefore, our subsequent cold-start optimization will focus on the startup time of the business itself, and strive to greatly reduce the startup time of the business itself. In terms of specific optimization methods, we will consider using AppCDS, GraalVM and other technologies to reduce the start-up time of the business itself.
Other planning
Enrich and improve r&d tools, such as IDE plug-ins, to improve r&d efficiency.
Break through the upstream and downstream technology ecology, deeply integrate into the company’s existing technology system, and reduce the use barriers caused by upstream and downstream platforms.
Container lightweight, lightweight containers can bring better start-up time and better resource utilization, therefore, container lightweight has been Serverless’s unremitting pursuit. In terms of specific landing, the container facility team is prepared to promote the deployment of some agents in the container by DaemonSet, and sink to the host machine to enhance the payload of the container.

Author’s brief introduction

Yin Qi, Hua Shen, Fei Fei, Zhi Yang, Yi Kun, etc., from the application middleware team of the Infrastructure Department.
Jia Wen, Kai Xin, Ya Hui et al, from the big front end team of financial technology platform.

Recruitment information

Meituan infrastructure team sincerely seeks senior and senior technical experts based in Beijing and Shanghai. We are committed to building a unified high concurrency and high performance distributed infrastructure platform for meituan, covering the main technical fields of infrastructure such as database, distributed monitoring, service governance, high performance communication, message-oriented middleware, basic storage, containerization and cluster scheduling. If you are interested, please send your resume to [email protected].

Read more technical articles from meituan’s technical team

| in the public bar menu dialog reply goodies for [2020], [2019] special purchases, goodies for [2018], [2017] special purchases such as keywords, to view Meituan technology team calendar year essay collection.

| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please mark “Content reprinted from Meituan Technical team”. This article shall not be reproduced or used commercially without permission. For any commercial activity, please send an email to [email protected] for authorization.