Design and implementation of Ten billion API Gateway service Shepherd

In microservices architecture, service fragmentation multiplies the size of apis, and it is becoming a trend to use API gateways to manage apis. In this context, Meituan unified API gateway service Shepherd came into being. It is suitable for Meituan business and fully developed by itself. It is used to replace the traditional Web layer gateway application, and business developers can open functions and data to the outside world through configuration. This article will introduce the background of the birth of Meituan unified API gateway, key technology design and implementation, as well as the future planning of API gateway, hoping to give you some help or inspiration.

I. Background introduction

1.1 What is an API Gateway?

API gateway is an architectural pattern that arose with the concept of microservices. Originally a huge All in one business system was split into many Microservice systems for independent maintenance and deployment. The changes brought about by service splitting are the doubling of API scale and the increasing difficulty of API management. Publishing and managing apis using API gateways is a growing trend. Generally speaking, THE API gateway is a traffic entrance between external requests and internal services, realizing the protocol conversion, authentication, flow control, parameter verification, monitoring and other general functions of external requests.

1.2 Why do WE make Shepherd API gateway?

Before the Shepherd API gateway, meituan business developers could export internal services to external HTTP API interfaces. Generally, a Web application is built to complete basic authentication, traffic limiting, log monitoring, parameter verification, and protocol conversion. Meanwhile, code logic needs to be maintained and basic components need to be upgraded. Therefore, the research and development efficiency is relatively low. In addition, every Web application requires maintenance of machines, configurations, databases, etc., and resource utilization is very poor.

Some internal business lines of Meituan suffered from lack of ready-made solutions, so they developed API gateways related to business according to their own business characteristics. Looking at the industry, Amazon, Alibaba, Tencent and other companies also have mature API gateway solutions.

Therefore, The Shepherd API Gateway project was officially launched. Our goal is to provide Meituan with a high-performance, highly available and scalable unified API gateway solution that enables business developers to open functions and data to the outside world through configuration.

1.3 What are the benefits of using Shepherd?

From a business developer’s perspective, what are the benefits of accessing Shepherd API gateway? In short, there are three aspects.

Improve r&d efficiency
- Service developers can quickly open service interfaces by configuring them.
- Shepherd uniformly provides non-business basic capabilities such as authentication, traffic limiting and fusing.
- Shepherd enables business developers to extend API gateway capabilities by developing custom components.
Reduce communication costs
- After configuring the API, service developers can automatically generate API front-end and back-end interaction documents and client SDKS, facilitating front-end and back-end developers’ interaction and joint debugging.
Improving resource Utilization
- Based on the architecture of Serverless, API fully managed, business research and development personnel need not care about the problem of machine resources.

Ii. Technical design and implementation

2.1 Overall Architecture

Let’s start by looking at the overall architecture of the Shepherd API gateway, as shown below:

The control surface of the Shepherd API gateway consists of the Shepherd management platform and the Shepherd monitoring center. The management platform implements API lifecycle management and configuration delivery, while the monitoring center collects API request monitoring data and provides service alarm functions.

The configuration center of Shepherd API gateway mainly completes the information interaction between the control plane and the data plane, which is realized by Lion, the unified configuration service of Meituan.

The data surface of the Shepherd API gateway is also known as the Shepherd server. A complete API request, which may originate from a mobile application, Web application, partner, or internal system, passes through the Nginx load balancing system and arrives at the server. The server integrates a series of basic functional components and business customization components, requests back-end RPC service, HTTP service, function service or service choreography service through the generalized call, and finally returns the response result.

These three main modules are described in detail below.

2.1.1 control surface

Using the control surface of the API gateway, business developers can easily complete the full lifecycle management of the API, as shown in the following figure:

Business research and development personnel start from the creation of API, complete parameter input, DSL script generation; You can then test the API through documentation and MOCK functionality; After the API test was completed, in order to ensure the on-line stability, Shepherd management platform provided a series of security assurance measures such as release approval, gray on-line, version rollback, etc. During the API running, the system monitors API call failures, records request logs, and generates alarms in time once exceptions are found. Finally, after an API that is no longer used is taken offline, all resources occupied by the API are reclaimed and wait to be enabled again.

The whole life cycle is self-managed by business r&d personnel through configuration and process, and it takes less than 10 minutes to get started, which greatly improves r&d efficiency.

2.1.2 Configuring the center

The configuration center of the API gateway stores the configuration information of the API, which is described in a custom Domain-specific Language (DSL). The domain-specific Language (DSL) is used to deliver configuration changes such as ROUTES, rules, and components of the API to the data plane of the API gateway.

In the design of the configuration center, the unified configuration management service Lion and local cache are combined to achieve dynamic configuration and release without shutdown. The CONFIGURATION of the API is shown below:

API configuration details:

Name and Group: Name and Group.
Request: specifies the requested domain name, path, and parameters.
Response: Result assembly, exception handling, Header, and Cookies information of the Response.
Filters, FilterConfigs: Functional components and configuration information used by the API.
Invokers: Request rules and choreography information for back-end services (RPC/HTTP/Function).

2.1.3 the data plane

API routing

After sensing the API configuration, the data side of the API gateway establishes the request path and the routing information of the API configuration in memory. Generally, HTTP request paths contain some path variables. Considering performance problems, Shepherd designs two data structures to store them instead of using regular matching. As shown below:

One is a MAP structure that contains no direct mapping of path variables. Key is the complete domain name and path information, and Value is the specific API configuration.

The other is a prefix tree data structure that contains path variables. In the way of prefix matching, the leaf node is searched accurately first, and the search node is pushed onto the stack. If the match is not found, the top node of the stack is removed, and then the variable node at the same level is pushed onto the stack. If it still cannot be found, it continues to backtrack until the path node is found (or not found) and exits.

Functional components

When request traffic hits the API request path and enters the server, the logic is handled by a series of functional components configured in the DSL. The gateway provides rich integration of functional components, including link tracing, real-time monitoring, access log, parameter verification, authentication, traffic limiting, fusing degradation, and gray splitting, as shown in the following figure:

Protocol transformation & Service invocation

The final step in API calls is protocol transformation and service invocation. The gateway needs to obtain HTTP request parameters and Context local parameters, assemble back-end service parameters, convert THE HTTP protocol to the back-end service, and invoke the back-end service to obtain the response result and convert it into the HTTP response result.

The figure above takes the invocation of back-end RPC service as an example. Parameter values of different parts of HTTP request are obtained through JsonPath expression, and values of the corresponding parts of RPC request parameters are replaced to generate SERVICE parameter DSL. Finally, the service invocation is completed with RPC generalization call.

2.2 High availability design

As a fundamental component of the access layer, the Shepherd API gateway has always been of great concern to business developers. The next. Let’s explore Shepherd’s practice in highly usable design.

2.2.1 Troubleshooting Potential Performance Risks

A high availability system, to prevent the occurrence of failure, the first to eliminate potential performance, to ensure high performance.

Shepherd made full asynchronous processing of API request, and the request was asynchronously submitted to the business processing thread pool through Jetty IO thread, and the back-end service was called in the asynchronous way of RPC or HTTP framework, which freed the thread occupation caused by network waiting and made the number of threads no longer become the bottleneck of the gateway. Here is the server-side request thread processing logic using the Jetty container:

We test the end-to-end QPS of the single gateway by domain name, and find that when the QPS exceeds 2000, there will be many timeout errors, but the server load and performance of the gateway are very redundant. The survey found that other Web applications in the company had this problem. After joint investigation with Oceanus team, it was found that the long connection function between Nginx and Web applications was not enabled and could not be configured. The Oceanus team successfully increased Shepherd end-to-end QPS to over 10,000 after an emergency schedule, development and launch of long-connect functionality.

In addition, we optimized the Shepherd server for API request preheating to achieve optimal performance immediately upon gateway startup and reduce the occurrence of burrs. Then, the performance bottleneck can be found by checking THE CPU hotspot during the pressure test, reducing the local log printing on the main link, and transforming the request log asynchronously and remotely. Shepherd’s end-to-end QPS increased again by more than 30%.

After the Shepherd service was launched and operated stably for a year, we optimized the performance again and made a network framework upgrade. Jetty container was completely replaced with Netty network framework, and the performance was improved by more than 10%. Shepherd end-to-end QPS was successfully increased to more than 15000. The following figure shows the server request thread processing logic using Netty framework:

2.2.2 Service Isolation

Cluster isolation

Drawing on the experience of mature components such as corporate caching and task scheduling, Shepherd was designed with cluster isolation by line of business dimensions in mind, as well as independent deployment of critical services. As shown below:

Request the isolation

Service node dimension, Shepherd supports fast and slow thread pool isolation of requests. Fast and slow thread pool isolation is mainly used for apis that use synchronous blocking components, such as SSO authentication and custom authentication, which may block the shared business thread pool for a long time.

The principle of fast/slow isolation is to collect statistics on the processing time of API requests and isolate THE API requests that take a long time to process and exceed the tolerance threshold to the slow thread pool to avoid affecting other normal API calls.

In addition, Shepherd also enables business developers to configure custom thread pools for isolation. The specific thread isolation model is shown below:

2.2.3 Stability guarantee

Shepherd provides some common stability safeguards to ensure the availability of its own and back-end services. As shown below:

Traffic control: Provides traffic protection from user-defined UUID traffic limiting, App traffic limiting, IP traffic limiting, and cluster traffic limiting.
Request caching: Enable the request caching function for idempotent, frequently queried, and insensitive data timeliness requests.
Timeout management: Each API sets a timeout period for processing requests. For timeout requests, the system processes failures quickly to avoid resource occupation.
Fuse degrade: Supports the fuse degrade function, monitors the statistics of the request in real time, and returns the default value when the failure threshold is reached.

2.2.4 Request Security

Request security is a very important capability of API gateway. Shepherd integrates a wealth of security-related system components, including basic request signature, SSO single sign-on, SSO authentication-based UAC/UPM access control, user authentication Passport, merchant authentication EPassport, merchant equity authentication, anti-crawl and so on. Service r&d personnel only need simple configuration.

2.2.5 can be gray

As a request entrance, API gateway often shoulders the important task of gray validation of request traffic.

Grayscale scene

In terms of grayscale capability, Shepherd supports grayscale API logic itself and grayscale downstream services, or grayscale API logic and downstream services at the same time. As shown below:

When grayscale API logic itself, the grayscale capability is realized by diverting traffic to different API versions. Grayscale downstream service, by marking the flow, shunt to the specified downstream grayscale unit.

Gray level strategy

Shepherd supports a variety of grayscale strategies, which can be grayscale according to proportion or specific conditions.

2.2.6 Monitoring Alarms

Three-dimensional monitoring

Shepherd provides 360-degree three-dimensional monitoring, providing 7×24 hours of professional guards from business indicators, machine indicators and JVM indicators, as shown in the following table:

	Monitoring module	The main function
1	Unified monitoring of Raptor	Report request invocation information and system indicators in real time, and monitor application layer (JVM) and system layer (CPU, IO, and network)
2	Link tracing Mtrace	In charge of full-link transparent transmission and full-link tracing and monitoring
3	Log monitoring Logscan	Monitoring local log anomaly keywords: for example, 5XX status code and null pointer exception
4	Remote log center	API request logs, Debug logs, and component logs can be reported to the remote log center
5	Health Check Scanner	Heartbeat detection and API status detection are performed on gateway nodes to detect abnormal nodes and abnormal apis in time

Multidimensional alarm

With a comprehensive monitoring system, it is natural to have a supporting alarm mechanism. The main alarm capabilities include:

	The alarm types	trigger
1	The current limit alarm	The API request reached the traffic limiting threshold. Procedure
2	Request failure alarm	The authentication failure, request timeout, or back-end service exception triggers the request failure alarm
3	Component Exception Alarm	Custom components take a long time to handle and have a high failure rate
4	API Exception Alarm	An API exception alarm is generated when the API fails to be published or the API check is abnormal
5	The health check failed alarm is generated	When the API heartbeat check fails or the gateway node is disconnected, a health check failure alarm is generated

2.2.7 Fault self-healing

The Shepherd server is connected to the elastic scaling module, which can be rapidly expanded or shrunk based on CPU specifications. In addition, quick problem node removal is supported, as well as more fine-grained problem component removal.

2.2.8 transferable

For some Web services that already provide APIS externally, business developers consider migrating them to the Shepherd API gateway in order to reduce operation and maintenance costs and improve subsequent r&d performance.

For some non-core apis, consider migrating directly using Oceanus’ grayscale publishing capabilities. However, for some core apis, the grayscale publishing function above is machine-level, with large granularity and not flexible enough to support the grayscale verification process.

The solution

Shepherd provides a grayscale SDK for business developers. The Web service connected to the SDK can identify grayscale traffic and forward it to Shepherd API gateway for verification.

The API and percentage of grayscale can be dynamically adjusted in Shepherd management terminal to take effect in real time. Business r&d personnel can also customize the grayscale strategy through SPI. After gray validation is passed, API is migrated to Shepherd API gateway to ensure the stability of migration process.

Gray process

Before gray scale: Create API group in Shepherd management platform, and configure the domain name as the current domain name. On Oceanus, the original domain name rules remain unchanged.

Gray level: turn on the gray level function in Shepherd management platform, and the gray level SDK forwards gray level traffic to the gateway service for verification.

Post-grayscale: migration is performed after API configuration on Shepherd is verified by grayscale flow to meet expectations.

2.3 Ease-of-use

Shepherd API Gateway is powerful and complex, and ease of use is critical for business developers. We focus on a solution that automatically generates DSL and API operations.

2.3.1 Automatically Generating DSL

When business developers actually use the gateway management platform, we try to reduce the burden of writing the DSL through graphical page configuration. However, the DSL configuration of the service parameter transformation still needs to be manually written by the business developer. In general, the process for generating a DSL of service parameters is:

Introduce interface package dependencies for services.
Get the service parameter class definition.
Write Testcase to generate JSON templates.
Fill in the parameter mapping rule.
Finally, manual input management platform, release API.

The whole process is cumbersome and error-prone. If dozens or hundreds of APIS need to be recorded, the efficiency of manual input by business r&d personnel is very low.

The solution

So can you automate the generation of service parameter DSLS? The answer is yes. Business RD only needs to input API document information at the gateway, and then input Appkey, service name and method name information of the service. Shepherd management terminal will obtain JSON Schema information of service parameters from the newly released console of service framework. JSON Schema defines the type and structure information of service parameters. Based on this information, the management terminal can automatically generate JSON Mock data of service parameters. Automatically replace the Value of the same parameter name with the information in the API documentation. This automatic DSL generation scheme is transparent and standardized to the business during use. The business side only needs to upgrade the latest version of the service framework to use it, which greatly improves the efficiency of research and development. At present, it is widely praised by business researchers.

2.3.2 API operation efficiency improvement

Quick API creation

The core capabilities of THE API gateway are based on THE API configuration. However, while providing powerful functions, it brings high complexity. Many service developers ridicule the TEDIOUS API configuration and high learning cost. The ability to create apis quickly emerged, allowing business developers to create apis with little information. Currently, the rapid creation API can be divided into four types (back-end RPC service API, back-end HTTP service API, SSO CallBack API, and Nest API). In the future, more rapid creation API types will be provided based on different service application scenarios.

The batch operation

Service developers need to manage a large number of service groups on the API gateway. Each service group can have a maximum of 200 API configurations. Multiple apis may have many same configurations, such as component configuration, error code configuration, and cross-domain configuration. Each API has to be configured once for the same configuration, and the operation is very repetitive. Therefore, Shepherd supports batch operation of multiple apis: After selecting multiple apis, the [Batch Operation] function can complete configuration update of multiple apis at a time, reducing the operation cost of repeated service configuration.

API Import and Export

Shepherd provides the ability to import and export apis to each other in different R&D environments. After the offline testing, business r&d personnel only need to use the API import and export function to export the configuration to the online production environment, avoiding repeated configuration.

2.4 Scalability design

A well-designed base component, in addition to providing strong base capabilities, also needs to be well extensible. The extensibility of Shepherd is mainly reflected in the ability to support custom components and service choreography.

2.4.1 Customizing Components

Shepherd provides rich system components to complete authentication, traffic limiting, and monitoring capabilities, which can meet most business requirements. However, there are still some special business requirements, such as custom verification, custom result processing, etc. Shepherd enables businesses to extend some of their custom logic by providing the ability to load custom components.

The following figure shows an example of a custom component implementation. In getName, fill in the name of the custom component application. In the invoke method, implement the business logic of the custom component, such as continue execution, page jump, directly return results, throw exceptions, etc.

At present, Shepherd has successfully supported meituan optimization, takeout, catering, taxi and other important businesses through custom components, and the number of access to more than 200 custom components.

2.4.2 Service Orchestration

Typically, an API configured on the gateway corresponds to a back-end RPC or HTTP service. If the calling side has the need to aggregate and orchestrate the back-end services, then HTTP request invocations must be made as many times as there are back-end services. This leads to some problems, such as too many HTTP requests on the calling side, low efficiency, and too much logic to aggregate services on the calling side.

The need for service orchestration arises. Service orchestration is the orchestration of existing services and the processing of acquired data. It is mainly used in data aggregation scenarios: Data returned by an HTTP request needs to be called multiple times (RPC or HTTP) to obtain complete results.

After preliminary investigation, the company has a mature service orchestration framework, and the pirate components developed by the customer service team (see pirate Middleware: Best Practices for Meituan Service Experience Platform to Connect business Data) are also public services within Meituan.

So we worked with the Pirate team to design Shepherd’s service choreography support solution. The pirate provides service choreography capabilities in a independently deployed manner, with calls made between Shepherd and the pirate via RPC. This decouples the Shepherd from the pirate, avoids affecting other services on the cluster due to service choreography capabilities, and does not significantly increase the elapsed time for an additional RPC call. It is also transparent and convenient for business developers to use the service choreography capability after configuring the SERVICE choreography API on the management side and delivering it to Shepherd server and Pirate service simultaneously through the configuration center. The overall interaction architecture diagram is as follows:

Iii. Future planning

At present, more than 18,000 apis are connected to Shepherd API gateway, and more than 90 clusters are running online, with the total number of daily calls reaching more than 10 billion. With the continuous growth of Shepherd API Gateway business scale, our availability, ease of use, scalability will put forward higher requirements. Shepherd’s priorities for the coming year include cloud native architecture evolution, static web hosting, and component markets.

3.1 Evolution of cloud native Architecture

Shepherd API Gateway’s cloud-native architecture evolution has three goals: to simplify access gateway steps and improve r&d efficiency for business developers; Reduce the size of the SERVER War package to improve security and stability. Access Serverless flexibility reduces costs and improves resource utilization.

In order to achieve these three goals, we plan to migrate gateway service to the company’s Serverless service Nest (see “Exploration and Practice of Nest of Meituan-based Serverless Platform”) as a whole, and integrate into the gateway cluster of business by extracting Shepherd’s core function into SDK. Business developers can select only the custom components they need to use, dramatically reducing the size of the server War package.

3.2 Static Web hosting

The goal of static web hosting based on Shepherd API gateway is to build a general static web hosting solution and provide developers with convenient, stable and high scalability static web hosting services.

Static web hosting solutions for business development personnel to provide the main functions include: hosting static web resources, including storage and access; Manage the application lifecycle, including custom domain configuration and authentication and authorization; CI/CD integration, etc.

3.3 Component Market

The Shepherd API Gateway component market aims to create a win-win development ecosystem where business developers can develop custom components for other business development teams to use.

We want to involve business developers in the development of custom components, complete the usage documentation and make it a common component, open to all business developers using Shepherd, so as not to duplicate the wheel.

Author’s brief introduction

Chong Ze, Zhi Yang and Li Min are all from meituan Basic Technology Department-infrastructure team.

Recruitment information

Meituan Basic Technology Department – Infrastructure team sincerely seeks senior and senior technical experts, based in Beijing and Shanghai. We are committed to building a unified high concurrency and high performance distributed infrastructure platform for meituan, covering the main technical fields of infrastructure such as database, distributed monitoring, service governance, high performance communication, message-oriented middleware, basic storage, containerization and cluster scheduling. Interested students are welcome to submit their resumes to: [email protected].

Read more technical articles from meituan’s technical team

| in the public bar menu dialog reply goodies for [2020], [2019] special purchases, goodies for [2018], [2017] special purchases such as keywords, to view Meituan technology team calendar year essay collection.

| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please mark “Content reprinted from Meituan Technical team”. This article shall not be reproduced or used commercially without permission. For any commercial activity, please send an email to [email protected] for authorization.