The maturity of technology comes from large-scale practice. In the field of Java, Ali constantly feeds back its own practice to the micro service technology system. In the field of Node.js, Ali is setting off an unprecedented wave of front-end revolution, feeding practice back to Serverless technology system, and gradually expanding to other multi-language system and back-end BaaS.

Serverless Cloud R&D platform is an integrated cloud R&D platform initiated by the front committee of Alibaba Group. The bottom layer is based on functional computing FC, and it is the R&D entrance of the whole Node Serverless system, undertaking r&d, delivery and operation and maintenance work of Taobao, Flying Pig, ICBU, Kaola, Autoravi, Cultural entertainment, etc. At present, the Group has thousands of front-end and client engineers using Serverless cloud R&D platform for business development, including but not limited to marketing guide, middle background, industry front stage and other large-scale scenarios.

From the overall market data of this year’s Double 11, the support flow of Taoshi Node Serverless alone has increased from 2K QPS peak last year to 30K QPS peak this year, and the peak flow has increased by nearly 15 times. The group as a whole has reached the peak of 50K QPS this year from nearly 5.8K QPS.

In terms of solutions, we customized capabilities for more scenarios, including landing of the Koala Dart solution and some model-driven solutions for shopping guides. In terms of operations, we have optimized the big push and daily normal processes so that developers can reduce their energy expenditure by at least 50% when dealing with higher QPS scale. In terms of R&D experience, we built a solution system, lowered the r&d threshold, supported the rapid entry of the front end, and improved the R&D efficiency by 39%. On the underlying Serverless base, we adapted multiple Serverless platforms to support real-time switching between multiple platforms to cope with the uncertainties of a single platform.

This article describes the capabilities provided by the Serverless Cloud DEVELOPMENT platform to enable rapid development and secure delivery of tenant services.

The nature of R&D

Everyone may be paying high labor costs for “people collaboration and service reliability”, but the essence of R&D is delivering “business functions”.

Today, we from the traditional “front-end developer” slowly to application developers, seasoning is not easy, in addition to the need to think about “what is the real on-demand pay” and “flexibility” of the underlying operational relevant proposition, also need to consider the relevant proposition “development efficiency”, which is why the more efficient the collaborative model, the change of organization relationship, Even the whole production relationship of front and back end coordination is changing. Today, we talk about “cloud integration”, which is essentially to think about problems from the perspective of users and solve business problems in a more efficient way.

In today’s world of software development, cost control is becoming more and more demanding, and productivity per unit of time is slowly becoming the standard to measure whether a team is efficient or not.

Therefore, from the nature of r&d, let’s look at the proposition to be solved by Serverless Cloud R&D platform:

  • Make business development light and focus on business logic;
  • Make business development faster, improve the efficiency of production and research;
  • Make the infrastructure thicker and more stable.

Architecture diagram of Serverless Cloud R&D platform

Serverless solution customization ability to improve the cloud integrated developer market, to provide developers with more choices, to create a cloud integrated R & D integrated closed-loop to provide faster delivery of business, as well as low-cost use of the basic BaaS service ability and business BaaS as the core of the development platform.

Serverless DEVELOPMENT platform

Serverless Business solutions

We define a solution as a set of capabilities that address a horizontal or vertical domain throughout the creation, development, delivery, and operation phases. The core reason why we needed to define the customization capability of solutions at that time was that business students in different business divisions had different customization requirements in today’s cloud integration scenario.

We have investigated several business divisions, including AE, Kaola, Tao, etc. At the beginning, the customized development ability of Serverless cloud R&D platform was weak, which could not well meet business demands. We need to make the platform have certain open customization ability, such as tao’s low code customization ability for R&D panel. Koala function-oriented capital loss risk level and application risk level input and other requirements.

However, open capabilities involve the stages of creation, research and development, delivery and operation and maintenance. The customized capabilities and openness provided by each process should be comprehensively considered by the platform according to the collected needs and the platform’s own control requirements. The so-called “people move to live, trees move to die”. After structuring several key capabilities, the customization capability of Serverless Cloud DEVELOPMENT platform open solution was generated by multiple tenants’ research at that time.

The figure above shows the investigation of structuring several customizable nodes and multiple scenarios

Based on the structured information in the figure above, we define the metadata information related to the solution, for example, the metadata information related to the middle and background integrated solution.

{"name": "ice-faas ", "display_name": "Web integration ", "description": "Traditional Web integration solution, solve the middle and background development requirements (ICE, React, etc.), and support the development of the middle and background front-end page and FaaS ", "owner": "*", "generator": {"id": 30}," depServer ": [], "page": {}, "widget": {}, "baas": {}, "ide_plugin": ["midway-helper"], "checkConfig": { "cf": true, "cr": true, "fone": true }, "flow": { "id": 1 }, "ops": { "resource": [{ "type": "faas" }, { "type": "assets" }] } }Copy the code

So far, Serverless Cloud Development Platform has accumulated a total of 14 solutions through co-construction, including 5 common solutions and 9 customized solutions for different tenants.

Three typical solutions follow.

Integrated solution

The all-in-one application solution is based on Midway Hooks. With Serverless + Hooks + “zero” API calls, developers can efficiently deliver applications by focusing only on business logic in their development process.

Integrated application has many advantages when it is used:

  • Easy to develop, front and back end with warehouse, seamless integration of development
  • Easy to deploy, both front and back ends are published and deployed together
  • Easy to maintain, back-end code Serverless deployment, low operation and maintenance difficulty

During development, we also provide a number of features to help developers speed up development.

“Zero API calls”

Hooks to support

Inside Ali, we provide two solutions of integration of middle and background and integration of building modules. Among them, the integration of middle and background applications has been implemented in 300+ internal applications, quickly and efficiently supporting the middle and background requirements of each BU.

Amoy model drives the solution

Model-driven is a development mode precipitated in the development process of Taobao shopping guide business, which shows the demand for a large number of recall and completion of shopping guide. The configuration panel combines models, data sources, and plug-in configurations to generate business logic code for business consumption.

The core focus of the entire action panel is on the process canvas on the right, where we want to solve this type of business problem with a fixed process that follows a predefined action path. In the light application outsourcing development mode of cloud market, materials are generated by internal students, modules are developed by outsourcing students, business fields are selected and processes are connected in series, which helps internal students save a lot of costs of process series and module joint adjustment, and improves overall efficiency by about 10% compared with the traditional development mode. This is also an innovative collaborative mode, with more room for improvement after material abundance.

Data Source (recall) --> Model (completion) --> Extension logic (plug-in)Copy the code

Model driven solutions on taobao good solve the problem of the business, but more of the scene is needed is a more flexible template customization capability, so the future will model driven on the template configuration of flexible power, to the nodes on the precipitation of material to build a more perfect mechanism, and the support Web IDE plug-ins, and support the business be born on more scenarios, Let different business scenarios can be more convenient to establish their own “three boards axe”.

Koala Dart integrated solution

Kaola has been trying to apply Flutter since March 2020. Some clients and front-end students have participated in the development of Flutter and are relatively familiar with Dart. Therefore, the Dart integrated solution was initially intended to help students on the client to solve the problem of development and efficiency improvement. Kaola has been using Serverless solutions for Node.js Runtime. Dart is also friendlier to clients than Java Script.

With the help of the FUNCTIONAL Computing FC r&d team, Kaola quickly completed the transformation and reconstruction of the Active Tab of the Kaola App today based on the preliminary test version of Dart Runtime, and went grayscale online at the end of September. In mid and late October, Dart Runtime and DEF platform began to be connected, and finally DEF Serverless panel was created, which will reveal the pure Dart function solution. At present, the basic process of FC side has been adjusted, and the pure Dart solution will be launched soon.

In addition to the Dart Ast generation service, Kaola will launch more business scenarios based on the Dart Serverless solution, such as dynamic delivery of App data model, dynamic configuration of business logic, Flutter dynamic attempts, and cross-app construction capabilities.

In addition to the above three solutions, THE ICBU team developed EaaS micro-application level solution, the Tmall industry team developed for the light store scene of the native small program integration solution, which will not be introduced here.

Function stability guarantee

At the beginning, we focused on how to use Node to complete business logic, such as how to organize data, how to call Java binary package, how to combine Aladdin link, and how to quickly fix online bugs. With so many businesses running online, our focus has shifted from how to fulfill business requirements to how to fulfill business requirements efficiently and consistently.

On-line stability, in essence, is the governance of the problem. Starting from the problem, it can be divided into the following main links: problem prevention, problem discovery, problem positioning and problem solving.

In terms of preventing problems, it is necessary to reduce the probability of the occurrence of problems as much as possible, narrow down the impact area, do a good job of online bayonet, and make corresponding plans. To find the problem, it is necessary to realize the whole link monitoring as far as possible and realize the reasonable and effective alarm distribution mechanism. On locating problems, it is necessary to shorten the locating time of problems as far as possible. On the basis of alarm meta information, it is necessary to do some auxiliary analysis of machines and associate the context, so as to achieve semi-automatic locating or provide more logical context to shorten the time of locating problems manually. In solving problems, ensure that the solution is effective, safe and fast.

We will greatly promote measures to ensure stability

In the big push scenario, the C-end scenario needs to be reprotected. The following stability guarantee means have undergone several big push pressure tests, and the bigger the push state is, the more the whole stability guarantee becomes tense.

The stability is guaranteed, but we completed the on-line process by referring to the above documents before. The process was extremely lengthy and finally deposited into a combat manual. Meanwhile, these contents could not be associated with the application and were scattered in the corner of the document, and the whole process was “smelly and long”.

On-line process -> Integration of operational manual

Therefore, the Serverless RESEARCH and development platform hopes to standardize the whole process, from strong and weak dependence combing -> plan configuration -> monitoring alarm subscription -> single link pressure measurement -> generation of battle manual, record the on-line process of all functions, process traceability, document precipitation; In addition, the plan, pressure measurement, monitoring and other processes are semi-automated, reducing the online time. We define each process node as a SOP unit so that the SOP process can be assembled randomly according to the business characteristics.

Release SOP process

The operation manual produced by semi-automated process, the hard-disk recording method associated with function and operation manual, combined with automatic current limiting, downstream dependency analysis and plan production, for example: Through the playback of the pre-sent traffic recording, it can automatically analyze the strong and weak dependence of the downstream function and input the person in charge of the strong dependence, which is convenient to find the person in charge for troubleshooting in the first time when there are online problems. Based on the requirements of different tenants, the platform can help users to implement multi-room and multi-unit deployment to achieve remote multi-live. These all make the big push of business a little easier.

Amoy business operation manual

Expert emergency response

In order to solve the pain points of slow positioning of online problems, the platform also provides an emergency response system. When the function success rate decreases and triggers an alarm, the platform will automatically pull the function and a number of downstream data information, conduct error analysis, and quickly produce error reports and push them to the function developers. And guide developers back to the RESEARCH and development platform for flow cutting, implementation of the plan and other hemostatic operations. For example, the downstream service is strongly dependent on service A and the success rate of the function itself declines, so it needs to contact the student in charge of service A.

The tenant operations

Each tenant on the platform has a corresponding tenant administrator, who is responsible for the stability of the functions of each tenant, including the unitary deployment rules of the functions under the tenant, promotion and control, self-built gateway configuration, container quota, tenant private solutions, etc., for which the platform provides a series of operation and maintenance tools.

The tenant the market

It helps the administrator to better observe the service quality of the functions under the tenant and the usage of the container limit, provides the function error rate and RT black list, and pushes weekly governance report to the administrator every week to help them better operate and maintain the functions under the tenant.

Function of inventory

It helps the administrator to carefully observe the specific running status of each correspondence line, including the version, number of containers, Runtime version, gray scale, unit deployment status on the correspondence line, and even observe whether function deployment is balanced.

Great for control

The platform also provides the operation and maintenance management and control capability for the big promotion. The administrator can switch the function service participating in the big promotion under the tenant to the big promotion with one click, and carry out additional configuration of the big promotion, such as the capacity configuration, Broker flow limiting, gateway side unified monitoring plan and other capabilities, to ensure the stability of the big promotion.

Some think

Serverless cloud R&D platform will continue to evolve in the future to improve the efficiency of users’ forward and reverse processes. L1 is to enable users to get started at a low cost, L2 is to enable users to conduct research and development at a low cost, so that the front-end can be further developed to the application.

Here are some analyses based on user forward development link time statistics:

  • The technical solution output takes a long time, accounting for 5% of the overall RESEARCH and development cycle. The core reason is that service materials are difficult to retrieve and service availability is difficult to evaluate, and domain model precipitation is insufficient.
  • FaaS overall R&D accounts for 25%~30%; Model-driven visual choreography can improve efficiency in the case of complete material preparation, but does not have large-scale scenes;
  • It takes a long time for joint adjustment, accounting for about 20% of the overall cost, and it excessively relies on the pre-release environment. According to statistics, it takes 50 deployments to complete a project.
  • Pressure testing costs still exist, platform familiarity costs are too high.

Of course, there are also some analysis of monitoring operation and maintenance reverse link:

  • The alarm distribution is not accurate, because it is now impossible to distinguish the alarm is the problem of the bottom framework and the upper business, so it often needs the joint intervention of the architecture team and business students;
  • Low efficiency in locating problems, such as failure rate alarm, may be underlying architecture problems or downstream problems, and may be the machine room or its own problems, often need to go to multiple platforms one by one investigation;
  • Lack of statistical or overall knowledge of service quality;
  • It lacks a standardized process to troubleshoot and solve 80% of online problems and relies on users’ ability to locate and solve problems.

The last

After more than half a year of transformation, Serverless cloud R&D platform has evolved from a simple platform to solve engineering links into a full-life cycle R&D platform for RESEARCH and development, online, operation and maintenance. The following proposition to be solved will focus on the low threshold of users.

We hope that our practice and exploration on Serverless can bring some inspiration to other companies in the industry, reduce the obstacles on the road and make the application research and development easier.