The author | liu yun (to Jane) source | alibaba cloud native public number

Over the past year, Alibaba has made steady progress in the exploration of Service Mesh, not because it believes that Service Mesh will be a key component of cloud computing infrastructure in the future, but also because it needs to use this technology trend to pay off the technical debt accumulated in the past (” technical debt “is not a derogatory term, Is the inherent product of technological development), based on the current technological trend of thought and best practice for the future to make new value and new experience of technology.

When we explore and practice a new technology, most of the cases will step into a period “boring”, is not a new technology how to face every day during the period of interpretation, but how to deal with technical debt first brought about by the fetter, and pragmatic to create new value and new experience for business, promote new technology by means of hand in hand to win-win business. This article summarizes the insights and achievements of Service Mesh’s construction in Alibaba over the past year.

Realizing incremental business value is essential to growth

Serivce Mesh is a new platform based technology that must realize incremental value in the process of development. From the technical point of view, it is easy to understand that the volatile content of SDK under the framework thinking is sunk into the Sidecar in the Service Mesh, which will promote the rapid evolution and upgrade of middleware technology in the form of business insensitive. It is not easy to challenge the value behind using platform-based and system-based thinking to further explore better solutions to distributed application architecture problems instead of the past “mountain” frame thinking.

From a business perspective, the key to adopting a new technology is to solve the current pain points, whether to bring about a significant reduction in machine cost, whether to make a significant improvement in stability, and whether to become more efficient in operation and r&d. These benefits are collectively called business value — the benefits seen from a business perspective. An important part of developing Service Mesh is that it is necessary to return to delivering (incremental) business value, and to refine new technologies around delivering business value, otherwise it will be difficult to achieve long-term results. For teams working on new technologies such as Service Mesh, it is important to continue to reap the benefits of the milestones to maintain team morale. The need to feel “needed” reinforces the value of their work.

In the past year, we have experienced the adjustment of development strategy from “do business first and then realize business value” to “realize business value first and then do large scale”. In the large-scale first phase, the implementation of Service Mesh is challenged by three main problems: first, the incremental business value is insufficient, but the existing capabilities in the Java SDK are moved into the Service Mesh; Second, the cost of resources cannot be ignored. Third, the technology maturity is not enough, failed to let people see the instrumental landing of the problem positioning and troubleshooting means. When these three questions cannot be answered, it becomes difficult to push Service Mesh to mass adoption in core applications, even with a top-down push at the company level. Finally, the development strategy has to be adjusted to cash value first.

On the way to realize the value, some business teams also changed from challenging the above three problems at the beginning to actively thinking about how to make a significant upgrade of the business traffic management ability of their business units by taking the opportunity of Service Mesh. The change of thinking soon enabled the business team to anchor the pain points of the business and create a new solution with Service Mesh. Finally, the cooperative relationship between the two teams changed from party A and Party B to comrade-in-arms relationship with each other, and we all worked together for a win-win situation.

Looking back on the experience of the past year, we can learn from it:

  • No matter what new technology, incremental business value can be better implemented first. No matter how advanced the technology is, it is only a vision before the incremental value is realized. However, the vision is not easy for people to pay for, and the technology implementation still needs to respect the market law. In addition, it is natural that new technologies take time to mature. If incremental business value is not realized in the process of technology maturity, no business is willing to become a pure guinea pig.

  • The development of basic technology can not only rely on the strength of the basic technology team, the business team with a positive attitude to seek to solve the business pain points will become a strong new technology “ripening agent”. The basic technical team does not have a sense of business body, and the full involvement of the business team can make up for this shortcoming. The chemistry formed by the combination of the two can lead to a win-win situation, and the cooperative relationship will be upgraded to “comradeship level”. The base technology team needs to pay special attention to working with the business team to avoid going into a closed-door situation.

Non-intrusion is the key but not the end

As technology evolves, we want to realize value as much as possible without any transformation costs to the business, which is a good reason why Istio has adopted Iptables for traffic hijacking since its launch. Alibaba is well aware of the value of non-intrusive solutions in the process of exploration. As early as the internal implementation, the non-intrusive solutions are also adopted. In the past year, the non-intrusive solutions support the function of transparent traffic transmission.

At the beginning of last year, Alibaba’s technical plan for implementing Service Mesh did not consider 100% compatibility. For historical reasons, Dubbo’s serialization protocol has Hessian2, Java, and other niche options. Considering that Hessian2 is the dominant protocol, Service Mesh only supports this protocol. In the landing process, if an application needs to invoke an application that does not support the serialization protocol, the application cannot be meshed. Further, the overall capacity building of Service Mesh relies on breakthroughs at this technology point to deliver value or lay the groundwork for mass deployment by scaling up to a wider range of scenarios. For example, operations support large-scale falls into the latter category. In addition, when all applications are meshed, the worst-case scenario is that the Hessian2 serialized links will be able to deliver value without the application being meshed making the links that can deliver value shorter.

To do this, the Service Mesh needs to fully “support” all RPC serialization protocols. A further solution is illustrated in the figure below, where application A is meshed. Note that Sidecar (Envoy) adds transparent transmission on top of the existing one, gathering only necessary statistics for protocols other than Hessian2. As additional background information, the RPC SDK used by Service A is fully functional and still has Service governance capabilities. In other words, the packets sent by the SDK after routing can be transparently transmitted to Sidecar to ensure service connectivity.

In the long run, the non-intrusive scheme is definitely not the final state for Dubbo, an RPC protocol with service governance capabilities. The reason is that the Dubbo SDK needs to be able to sense if it should be working in Servcie Mesh mode, and in this mode delegate responsibilities such as service governance to Sidecar to save the MEMORY and CPU overhead of the SDK.

For this reason, over the past year alibaba has also put considerable effort into building the final cloud native solution to make Service Mesh work well with Dubbo 3.0. The following example illustrates the Service Mesh under Dubbo 3.0.

One of the major changes in the final version of the Dubbo 3.0 SDK is improved friendliness for Service Mesh, and the overall design takes into account the cloud native wave as a major technology trend. The main changes related to the Service Mesh are:

The protocol header adopts Triple protocol based on gRPC. By putting what Sidecar needs to know or change in the protocol header, Sidecar completely avoids the need to deserialize and serialize the message body. Sidecar is completely indifferent to the serialization protocol used by the message body.

Disaster recovery when Service Mesh faults occur. The Dubbo 3.0 SDK has Thin and Fat modes for working with Service Mesh and non-traditional modes, respectively. The CPU and memory overhead under Thin SDK is minimized, freeing up the overhead for Sidecar. The Fat SDK mode has comprehensive routing governance capability. When the Service Mesh fails, the SDK is responsible for completing Service invocation routing.

In Service Mesh mode, Service registration and de-registration are performed by Sidecar. In other words, when the SDK is working in Service Mesh mode, the SDK is completely unaware of the backend registry, allowing the Service Mesh to minimize the underlying infrastructure details.

Iptables is not used for traffic hijacking, and SDK directly communicates with Sidecar through local interprocess communication (TCP/IP network Loopback or Unix Domain Socket). The value of traffic hijacking is that the SDK can manage traffic by the Service Mesh without upgrading at all. Since Dubbo 3.0 already had SDK upgrade problems, iptables were removed to avoid introducing new stability and performance problems.

It is worth emphasizing that the SDK can switch back to Fat SDK mode and assume the Service invocation route when the Service Mesh is faulty on the premise that the Service Mesh and the SDK have basic peer capability to meet the basic requirements in disaster recovery scenarios. In the long run, the evolution of Service governance of Service Mesh must be faster than that of SDK. If some functions are related to disaster recovery capability, you need to implement them in SDK. Of course, the Service Mesh should be designed to ensure stability systematically, placing SDK assurance capabilities on a standalone rather than a full application cluster.

Finally, as mentioned earlier, the evolution of Service Mesh requires the participation of the business side to better drive the implementation of new technologies by addressing business pain points. In order to solve a business problem, there is a need for business transformation to some extent, further demonstrating that a no-intrusion solution cannot sustain business value from start to finish. In other words, a business preparing to implement a Service Mesh should not consider the introduction of a Service Mesh as a core consideration. Business transformation is never a problem. The problem lies in whether the transformation solves business pain points and upgrades technology to a higher level to lay a solid foundation for future business development, which should be considered when services are implemented in Service Mesh. Of course, in our experience, it is a highly recommended practice to test the waters of Serivce Mesh with a non-intrusive solution. For your peers, if your organization is exploring Service Mesh or cloud-native technologies that require connectivity between old and new applications and incremental evolution to new technologies, the transition to a non-invasive solution is a good choice.

Incremental business value being realized

In the past year, Service Mesh has found two major incremental business value fulfillment points within Alibaba. As construction work is completed in the coming months, 100,000 applications will be launched on a large scale.

One of the incremental business value realization points, the current regional and multi-lease route governance capabilities of international central Taiwan are sunk into the Service Mesh to realize unified traffic route governance and application-level machine room disaster recovery. In the past, Java applications in the internationalization center used annotations to specify the routing policy that should be used. Whenever the routing policy needed to be changed, the code had to be modified and the application released online again. The whole process was quite difficult. In addition, the disaster recovery of the international center can only be achieved at the machine room level, and the traffic of the entire machine room needs to be cut off.

After the introduction of Service Mesh, the capability of specifying routing policy through Annotation in Java applications is completely removed, which is realized in the form of configuration after being transferred to the Service Mesh, so that each application of routing policy change only needs to dynamically issue a new YAML file. Completely decoupled from the application. Furthermore, because routing policies are application-oriented, the granularity of applications can be used to cut traffic between equipment rooms, improving the agility of DISASTER recovery and reducing the risk of cut traffic.

As the business side, we were very proactive in thinking about how to make the most of this unique technology upgrade opportunity as we worked together with the Service Mesh team to explore business value. The components that used to have single-point Service governance capability are integrated into the Sidecar of the Service Mesh in a distributed manner, which not only removes the burden of previous operation and maintenance, but also makes a big step forward for the overall business stability.

The second point of incremental business value realization is to apply the flexible traffic management capability of Service Mesh to the development environment management of the new retail business group, and dynamically create mutually independent development environment according to the needs of the developers. In order to support the development, Alibaba’s internal practice is to set up a daily environment completely independent from the production environment, and deploy online applications into two environments for the development and debugging of each application. Each application in the normal course of the environment may be because of the needs of the development and changes, for application of decoupling between the mutual influence and further in the daily environment and established a baseline environment, based on baseline each application development work environment were isolated by the development environment to complete the development work, and not directly to the baseline environment for developing alignment. When there are tens of thousands of applications and developers and thousands of application changes every day, it is challenging and valuable to ensure the development environment that students need daily.

In the past, development environment isolation technology was designed based on framework thinking, requiring different traffic (for example, RPC, messages, cache and database) to connect to the same isolation framework from the protocol level, which made evolution and maintenance quite difficult. The ability to build primarily around the Java language doesn’t work when there are multilingual scenarios. In addition, some isolation scenarios are quite difficult to implement without platform technologies such as Service Mesh.

The value of Service Mesh is that it is built for traffic governance, and dynamic and flexible traffic isolation and routing is one of the core capabilities. We have extended VirtualService and DestinationRule in Istio, abstracted TrafficLabel as a new CRD, and delivered YAML files to dynamically mark traffic and application machines. Envoy routes based on traffic markers and machine markers, making it flexible and fast to build a development environment that supports multiple languages well. The following figure shows the application deployment and traffic topology of v1.1 and V1.2 under Service Mesh.

In the preceding figure, YAML files need to be delivered to mark specific traffic and applications. Envoy directs traffic to a similarly marked machine based on the scale, and when the corresponding scale does not have a machine, a fallback mechanism is applied to reverse the flow back to the baseline environment. For example, when application B in development environment 1 calls application C, traffic is directed to the baseline environment because no tag1 machine can be found.

Predictably, this capability built into Service Mesh opens the door for future exploration of Test in Production. In the future, the traffic isolation environment based on Service Mesh will help save the machine cost of building an independent development environment and provide a new idea for exploring a new generation of safe production environment. Of course, we have a long way to go on this road.

By the time of the completion of this paper, the implementation scale of Service Mesh in Alibaba Group has reached tens of thousands of application instances. The capacity construction of data, control and operation and maintenance platforms has achieved large-scale application level.

The software life cycle theory that cannot be ignored

The author takes this opportunity to share his understanding of the software lifecycle theory, hoping that this theory will help readers better understand the development of new technologies and seize opportunities in the process under the rare wave of cloud native technology.

Looking at software development in a static way is most likely to end up with a bloated, error-prone bundle. The reason for this is a failure to understand that software has a life cycle. Software, like people, has four phases of formation, growth, maturity, and decay (as shown below). The vertical coordinate in the figure represents the adaptability of software to new requirements, refers to the friendliness of software to achieve new requirements, and behind it is whether the relationship between concepts is clear and whether people’s understanding of them conforms to intuition and common sense. In essence, it refers to the design quality of software. The straight line in the graph only represents a trend. In reality, it is more represented as a curve with fluctuations.

Software into the mature stage of the symbol, is the degree of its function and the original positioning and use of the scene fit. The decline occurs when the business needs to evolve and new scenarios arise, and the conceptual abstraction of software (also known as “architecture” or “master software design”) is not friendly to implementing the requirements of the new scenario, resulting in newly developed code becoming “plaster on the skin”. As a side effect of long-term software decline, the quality of software continues to deteriorate and the coding experience of developers continues to decline.

Another way of looking at the software lifecycle is that the software engineer’s understanding of requirements increases over time, and it is difficult for the initial software design to meet the long-term needs of the business, which is becoming more complex day by day. In other words, the decline of software is inevitable, and technical debt is a natural product of software development.

The key to getting out of a recession is to start a new cycle of software life, and the most straightforward way to do this is to “pay down technology debt,” which involves refactoring, or solving problems with new ideas and technologies. Small, engineers through continuous refactoring to technical debt is really exercise capacity, this process will be based on individual understanding of the business (or need) to do to the concept of abstraction, grasp the good capability of software design is from the acquisition of the “small”, also only have good software design engineer is likely to rein in the design of the large software system.

Software lifecycle theory tells us that good software does not stay the same all the time, but can withstand changes. Of course, all kinds of changes need to be supported by engineering capabilities. Software quality is guaranteed through comprehensive means such as unit test, integration test and system test. Once these means are missing, it is difficult to establish confidence in changes and will tend to remain unchanged.

Platform-based technologies such as Service Mesh require careful handling of software lifecycle theory. Platform thinking is about solving common problems and finding a balance between common and customized. When platform technology itself does not adapt well to the needs of technology or business development and rapidly evolve, it will naturally become a drag on the road of business development rather than a help.

conclusion

In the coming year, we will continue to explore the value of Service Mesh. In addition to realizing the value of RPC traffic management, we will complete the Service Mesh of RocketMQ and others to further extend the traffic management capability of Service Mesh and realize more incremental business value.

We are committed to solving the Service Mesh of alibaba’s internal basic technology, because Alibaba’s own business scale requires Service Mesh more. In the future, we will also communicate with more colleagues in the industry, looking forward to firmly entering the cloud native era hand in hand with customers by sharing the experience gained.

If readers need to communicate or want to be part of alibaba Service Mesh, please write to [email protected].