The author | Deng Xuexiang (wing) detailed source | Serverless public number

Autonavi started the Serverless construction in FY21. One year has passed since then, the peak value of Autonavi Serverless business has exceeded 100,000 QPS level, with platforms ranging from 0 to 1 and QPS ranging from zero to 100,000, making Autonavi the largest BU in Alibaba Group in Serverless application. What’s the process like? What problems have you encountered? Why do Autonavi want to Serverless/Faas? How to make Serverless/Faas? What is the technical solution? How’s it going so far? What are the follow-up plans? This article will do a simple share with you.

1. Why is Autonavi engaged in Serverless

Why does Autonavi want to be Serverless? The background reason is that Autonavi launched a client on cloud project in FY21. The main purpose of cloud projects on the client side is to improve the efficiency of client development iterations.

In the past, the client business logic was all on the end, and the changes of product requirements needed to go through the client version release, while the client version release needed to go through a variety of testing processes, gray processes, to solve the client crash and other problems, the current rhythm is a version a month.

After the client goes to the cloud, some of the volatile business logic goes to the cloud. To develop the new product demand in the clouds, there is no monthly release, efficiency, speed up the development of demand iteration research from production with frequency are a step closer to the ideal goal, why want to say “again”, because before Scott also did some optimization research to the production of the same frequency direction, but we hope that the cloud integration development can also be one of the most effective technical power).

1.1 Objective: Client development mode — end-to-end cloud integration

Although the development mode has changed from the previous end development to the current cloud + end development, the development students should still be responsible for the corresponding business students, but as we all know, the server side development and the client side development are obviously different, the client side development is oriented to the stand-alone mode of development, the server side development is usually cluster mode, Coordination, load balancing, failover degradation and other complex issues of distributed systems need to be considered. This transition risk is greater if you use the traditional server-side pattern for development.

Faas is a good solution to this problem. Combining with the existing Xbus framework of Autonavi client (a framework for registering and invoking local services on the client), we extend the xbus-Cloud component, making the development on the cloud like the development on the end. The goal is to run a set of code in two places, and a set of business code can run on both the client and the server.

Autonavi client mainly has three terminals: IOS, Android, car (like Linux operating system). There are two main languages: C++ and node.js. Traditional map functions: such as map display, navigation path display, navigation broadcast and so on, because of the need to cross three ends, the use of C++ language to develop. Some map application functions based on map navigation, such as pre-trip/post-trip cards and destination recommendations, are mainly developed with Node.js.

In FY20, Amoy front-end team developed node.js Faas Runtime. Autonavi client cloud project, node.js part adopts the existing Amoy Node.js Runtime to access Faas platform of the group and complete some business cloud of Node.js part. The National Day period of 2020 has well supported autonavi’s National Day travel Festival business.

There is no existing solution for C++ Faas, so we decided to add on top of the group’s infrastructure and build a new C++ Faas base platform to help autonavi clients to access the cloud.

1.1.1 Best practice key for integrated cloud: Interface abstraction between client and Faas

The logic of the original client is moved to the Faas server, or part of the new requirements are developed on the Faas server. The key to success or failure is: The definition of the interface protocol between the client and Faas is also the DEFINITION of the API of Faas. A good API definition is not only good for the maintainability of the system, but also important for the subsequent iterative development of supporting businesses. For a good API definition, please refer to the document of Gu Pu Da God: Thoughts on the Best Practices of API Design.

Ideally: the client is a browser that parses the result data returned by Faas. Once browser protocols are defined, they don’t change very often, as you can see with Internet Explorer and Chrome. Of course, our browser is a little bit more complicated, and our browser is a map browser. How to verify that the interface between the client and Faas is well defined can be seen in subsequent iterations of product requirements. If some iterations of product requirements only need to be done on Faas without any changes on the client side, then the interface abstraction is successful.

1.2 BFF layer development and efficiency improvement

Scott mentioned, we first thought should be the tool attribute: Scott is a navigation tool (the term is not very accurate now, because the gold over the past few years doing tools to the transformation of the platform, we have to do everything gold, gold trading business has been on the rise, gold take a taxi, tickets, hotel and other business development is very rapid).

For Autonavi, the large number of read-only scenarios is a technical feature of the business compared to other parts of the group, such as e-commerce. In these read-only scenarios, most requirements are BFF (Backend For Frontend) read-only scenarios. Why do you say so? Because the most core functions of navigation, such as routing, traffic and ETA, are relatively stable, the main work of this part is to continuously optimize the algorithm to make autonavi’s traffic more accurate and the calculated path better. These core functions are relatively stable in interface and function, while the front-end requirements are changeable, such as adding a path on the tip of the width limit pier.

Faas is particularly suitable for BFF layer development, calling each Baas service with relatively stable back-end on Faas, Faas service to do data and call logic encapsulation, rapid development, release. In the industry, the BFF scenario (also known as SFF scenario, service for frontend) is the most commonly used scenario for Faas.

1.3 Serverless is the high-level language of the cloud age

FY21, gold is the first comprehensive group within the cloud BU, although gold has been comprehensively on the cloud, but that is not the end of a cloud era, mainly is the comprehensive pouch and on the cloud, and do standardized containers, in terms of scale, resource utilization can fully enjoy the cloud dividend, but the business development model is basically same as before, It’s still a notation for a large distributed system. The r&d model has not yet enjoyed the benefits of the cloud, as we are now writing services running on the cloud in assembly language. Serverless and Cloud native can be understood as the high-level language in the Cloud era, which truly achieves Cloud as a computer. It only needs to focus on business development and does not need to consider various complexities of large distributed systems.

1.4 Go-FAas supplement Go language ecology

As mentioned earlier, we developed C++ Faas Runtime on the basis of alibaba cloud FC (functional computing) team. Not only that, but we also developed Go-FAAs. Why did we do Go-FAAs? Here is also a brief introduction to the background. The peak QPS of The Go part of Autonavi server has exceeded one million. Autonavi has completed the Go clients of ali’s middleware and co-built with the middleware department of the Group. Observability and automated test system have also been basically improved, and Go ecology has been basically improved. After completing go-FAAS, we can write Baas service with Go and Faas service with Go. Different service implementation methods are adopted in different business scenarios. Go-faas is mainly applied to BFF scenarios mentioned above.

2. How-technical solution introduction: Add on the existing infrastructure of the group

2.1 Overall technical architecture

The reason why we want to do this is described above, and then we will talk about how we do this, how to achieve it, and what the specific technical scheme is.

In line with the idea of adding on the basis of the group’s existing infrastructure and middleware, we cooperated with CSE and ali cloud FC function computing team to develop C++ Faas Runtime and Go Faas Runtime. The technical architecture of the whole group and LATong is shown in the figure below, which is mainly divided into three parts: RESEARCH and development, operation and maintenance.

2.1.1 running state

Let’s start with the running state. The traffic comes in from our gateway, calls to the FC API Server, forwards to C++/Go Faas Runtime, and the Runtime completes the functions in the user function. The architecture of the Runtime is described in detail in the next section of this article.

The Dapr Side Car is deployed with Runtime Container to collect and report logs. The Dapr Side Car is used to invoke group middleware.

In addition, dapR is still in the pilot stage, and middleware is invoked mainly through brokers and various middleware proxies, such as HSF, Tair, Metaq, Diamond and other middleware proxies.

Finally, the Autoscaling module manages the scaling of function instances to achieve the purpose of automatic scaling. There are various strategies for scheduling, from scheduling based on the number of concurrent requests to scheduling based on the CPU usage of the function instance. You can also set the number of reserved instances in advance to avoid cold startup problems when the capacity is reduced to 0.

The underlying call is the group ASI capability, ASI can be simply understood as the GROUP’s K8S+ Sigma (group scheduling system), the final deployment is the FC call ASI to complete the function instance deployment, elastic scaling, the smallest unit of deployment is pod in the figure above. A POD contains runtime Container and Sidecar set Container.

2.1.2 development state

The development state determines how a function works. The development state is concerned with the development experience of a function, how easily developers can develop, debug, deploy, and test a function.

One difficulty with C++ Faas is that there are some dependency libraries in the C++ Faas runtime that are not as convenient as Java dependency library management. This dependency library installation is more troublesome, Faas scaffolding is to solve this problem, call scaffolding, a key to generate C++ Faas sample project, install a variety of dependency packages. For local debugging, a C++ Faas Runtime Boot module is developed. The function Runtime Boot entry is in the Boot module. The Boot module integrates Runtime and user Faas functions. You can single-step debug the Runtime.

In collaboration with the Aone team, function publishing is integrated into the Aone environment, making it very easy to publish Go or C++ Faas on Aone. The one-click generation of the example code base is also integrated on Aone.

The compilation of C++ and Go Faas depends on the corresponding compilation environment. Aone provides the function of custom compilation image, we uploaded the compilation image to the group’s public image library. When the function is compiled, the corresponding compilation image is specified in the function code base, and the dependency library and SDK of Faas are installed in the compilation image.

2.1.3 operational state

Finally, the operation and maintenance monitoring of functions, Runtime internal integration of hawk-eye, Sunfire log collection function, Runtime will write these logs, After collecting hawkeye through the Agent in sidecar or sunfire monitoring platform (FC is collected through SLS), the group’s existing monitoring platform can be used for Faas monitoring and can also be connected to the GROUP’s GOC alarm platform.

2.2 C++/Go Faas Runtime architecture

The Runtime is part of the overall architecture that integrates with Aone, FC/CSE, and ASI. The following is a detailed description of the architecture of Runtime and how it is designed and implemented.

The top part of the user Faas code only needs to rely on the Faas SDK, the user only needs to implement the Function interface in the Faas SDK to write their own Faas. Then, if you need to call the external system, you can call it through the Http Client in the SDK. If you want to call the external middleware, you can call the middleware through the Diamond/Tair/HSF/ Metaq Client in the SDK. These interfaces in the SDK shield the complexity of the underlying implementation, so the user doesn’t have to care how the calls end up being implemented, or how the Runtime is implemented.

The SDK layer is the Function definition mentioned above and the interface definition for the various middleware calls. The SDK code is developed for Faas users. SDK is relatively light, mainly interface definition, does not include concrete implementation. The implementation of the middleware invocation is implemented in two ways in the Runtime.

Down below is the overall architecture of Runtime. The Starter is the Starter module of the Runtime. The Runtime itself acts as a Server. It starts the Runtime according to the configuration of the Function Config module.

The next layer is the Service layer, which implements the middleware invocation interface defined in SDK, including RSocket and DAPR. RSocket calls middleware through the pattern of RSocket broker. Dapr (Distributed Application Runtime) is integrated with Runtime, and middleware can also be invoked through DAPR. In the early pilot stage of DAPR, if middleware invocation through DAPR fails, Middleware calls are degraded to rsocket mode.

The protocol layer of Rsocket is further down, which encapsulates the various metadata protocols that call Rsocket. Dapr calls are invoked in GRPC mode.

The bottom layer is the integration of RSocket and DAPR.

The rsocket call also involves broker selection. The upstream module manages the broker cluster, registers the broker unregister, keepalive checks, and so on. The LoadBalance module implements load balancing for broker selection, event management, connection management, reconnection, and more.

Finally, the Metrics module in Runtime is responsible for accessing Hawk-eye Trace, intercepting Faas link time using filter mode, and outputting hawk-eye logs. Print sunfire logs for Sidecar to collect. Here is an actual sunfire monitoring interface:

2.2.1 Dapr

The DAPR architecture is shown below and can be found in the official documentation.

The runtime used to call middleware through Rsocket, where rsocket Broker had a centralization problem. In order to solve the outgoing traffic decentralization problem, dapR architecture was introduced in cooperation with the group middleware team. Only at the Runtime level, dapR is integrated, and the user Faas is not aware of it. There is no need to care whether the specific middleware is called through Rsocket or DAPR. After the runtime calls middleware to switch to DAPR, the user Faas does not need to be modified.

3. How-how do SERVICES access Serverless

As mentioned above, access is unified on Aone. C++ Faas/Go Faas access documentation is provided. The example code base of the function is provided, with examples of various scenarios, including code examples that call the group’s various middleware. C++ Faas/Go Faas access is developed for the whole group. At present, some BU other than autonavi have implemented C++ /Go Faas in their own businesses. Node.js Faas can be accessed using the Runtime and template provided by Taobao, while Java Faas can be accessed using the runtime and template provided by Ali Cloud FC.

3.1 Access specifications – Stability Three axes: can be monitored, gray scale and rollback

In view of the stability of landing new technology we may worry about, our magic weapon is the stability of Ali Group three axe: can be monitored, can be gray, can be rolled back. Establish Faas link support group, involve all upstream and downstream relevant business parties and basic platforms together to respond to online alarm and quick troubleshooting within 1 minute according to the 1-5-10 requirements of the Group; Process within 5 minutes; Recover within 10 minutes.

In order to standardize the access process and avoid online faults caused by mistakes, we formulated Faas access specifications and a checkList to help business parties quickly use Faas.

Monitoring, gray scale and rollback are mandatory requirements. In addition, it would be better if the business side can achieve degradation. The cloud business on our C++ client is ready to be degraded at the beginning of the pilot phase. If the Faas call fails, the call will be automatically degraded to a local call. Basically no damage to the client function, but will increase some response delay, in addition to the client version of this function, may be a little older than the server, but the function is forward compatible, basically does not affect the use of the client.

4. Now- Our current situation

4.1 Basic platform construction

  • The Go/C++ Faas Runtime has been developed, and the interconnection with fc-ginkgo /CSE and Aone has been completed. The stable version 1.0 has been released.
  • A lot of stability construction, elegant offline, performance optimization, C compiler optimization, using the compiler optimization team provided by ali Cloud basic software department to optimize the compilation of C++ Faas, performance improved significantly.
  • C++/Go Faas access eagle eye, sunfire monitoring completed, the function has observable.
  • The pooling function is complete with second-level flexibility. When a runtime mirror pool is added to the CSE, the time for adding a new instance changes from the original minute level to second level.

4.2 Autonavi Serverless Business Implementation

C++ Faas, Go Faas and node.js Faas have been widely used in autonavi. A few examples:

The first two graphs in the figure above are for C++ Faas development: long distance weather, search along the way. The next two screenshots are of businesses developed by Go-FAas: navigation Tips and footprint maps.

Autonavi is the largest BU of Serverless applications in Alibaba Group. The Serverless applications that have been deployed have a daily peak value of more than 100,000 QPS.

4.3 Main Income

What are the benefits of Autonavi’s implementation of the largest Serverless application in the group?

The first and most important benefit is development efficiency. Our end cloud integrated component based on Serverless helps the client to access the cloud, removes the dependency problem of real-time client release, and improves the client development and iteration efficiency. BFF layer developed based on Serverless improves the development and iteration efficiency of BFF scenarios.

The second benefit is: operation and maintenance efficiency. With Serverless’s automatic flexible expansion and shrinkage technology, Autonavi can handle various travel peaks more easily. For example, during the annual travel peak of National Day, May Day, Qingming Festival and Spring Festival, there is no need for operation and maintenance or business development to expand the capacity before the holiday and then reduce the capacity after the holiday. The characteristics of Autonavi’s business peak are also different from the second kill scenario of e-commerce. Peak traffic does not suddenly rise in a second, and the second-level flexibility achieved by our current pooling technology can fully meet the needs of This business scenario of Autonavi.

The third benefit: lower costs. Autonavi’s business characteristics include large daytime flow and low nighttime flow, large difference between peak value and trough value, and obvious time division. The automatic capacity reduction technology of Serverless at night when the traffic is low peak greatly reduces the cost of server resources.

5. Next — Follow-up plan

  • Optimize the use of FC in-missile function calculation, continuously optimize the performance, stability and use experience of FC in-missile function calculation together with the FC team. With the rich business scenarios of large traffic within the group, we will continuously polish the C++/Go Faas Runtime, and finally output it to the public cloud. More enterprises in the wave of digital transformation of pratt & Whitney.
  • Dapr was implemented to solve the problem of outcoming traffic decentralization, and gradually launched some C++/Go Faas to call group middleware in the way of Dapr.
  • Faas chaos engineering, failure drill, escape ability construction. Faas will also participate in the fault drill of BU in the new fiscal year and solve the problems found during the drill one by one.
  • Access edge computing. In the scenario of end-to-end cloud integration, Faas + edge computing can provide lower latency and better user experience.

There is a long way to go to do the above things. In addition, in FY22, our department will also do the pilot and implementation of cloud native. Technical students all know that there is still a long way to go from technology selection and prototype to actual business implementation.

If you are interested in Serverless, cloud native, or Go application development and want to do something together, please join us. Name – technical direction – from Serverless), here is a large-scale landing scene and simple open technical atmosphere, welcome to recommend or recommend!

This article is organized from Alibaba senior technical expert — Xiang Yi sharing PPT on “Ali Cloud Serverless Developer Meetup in Shanghai”. Access method: Pay attention to Serverless public account, background dialog box reply “PPT” can live playback watch address: developer.aliyun.com/live/246653