Author: Simple

Software evolves in an iterative fashion. To some extent, we don’t worry about software being imperfect, but we worry about software being too slow to iterate. In the field of distributed software, how to verify the new software version quickly and safely is always concerned and explored. The emergence of Service Mesh pushes the exploration of this field to a new level. The concept of “swimlanes” is not new in the field of distributed software, but this time we are based on the service grid technology to build, take full advantage of the cloud native technology with natural flexible traffic governance advantages. This paper shares the capacity of full-link traffic marking and routing precipited in Ali Cloud, which makes a new experience of service grid technology and well fulfills the new value of service grid.

Concepts and Scenarios

Figure 1 illustrates the key concepts in the usage scenario using the Bookinfo sample program provided by Istio as an example. The purple rounded boxes represent the envoys. The nature of all the swimming lanes in the picture is the same, and the different names are only used to distinguish the segmentation scenes or users.

• Baseline: All services of a business are deployed to this environment. Baselines can come from a real production environment, or they can be built for development work that is completely separate from the production environment.

• Traffic Lane: Represents a soft environment isolated from the baseline environment to which machines (i.e., pods in Kubernetes) are added by tagging them. Obviously, the machines added to the swimlane are communicating with the machines in the baseline at the network level.

• Traffic fallback: The number of services deployed in a swimlane is not required to be the same as that in the baseline environment. If no other services depend on in the invocation chain exist in the swimlane, the traffic needs to be rolled back to the baseline environment and flow back to the swimlane if necessary. For example, the Dev1 swimlane in Figure 1 does not have the Reviews service that the ProductPage service relies on, so traffic needs to fall back to the Reviews service in the baseline (shown by the dark blue line). Then the Reviews service in the baseline needs to send traffic back to the Ratings service in the DEV1 swimlane.

• Traffic Label passthrough: All sidecars must be able to automatically add the traffic labels carried in the inbound request to each outbound request bifurcated by the request, so that full-link traffic identification can be transparented and routing by traffic can be identified. Otherwise, traffic cannot be shuttled back and forth between the swimlane and the baseline. • Entrance Service: refers to the first service that traffic touches when entering a swimming lane. The graph representing the service in Figure 1 marked with a triangle on the left side of the border indicates that it is an entry service.

Figure 1

Swimlane technology can be used in the following scenarios:

• Daily development of a single service or daily development coordination between multiple services. The developer establishes the swimlane, deploys the service with new functions to the swimlane, and introduces the test traffic into the swimlane for verification by defining rules based on the characteristics of the traffic. Because swimlanes only need to deploy the new version of the tested service, the trouble of setting up a full-link test environment is eliminated. In this scenario, it is necessary to pay attention to the data falling disk problem existing in the test flow and deal with the dirty data left in the process of development and joint adjustment.

• Full-link gray scale. For multiple services involving major functions on-line, a more comprehensive function verification can be carried out through swimlanes in a full-link gray scale mode. After the all-link function is accepted, release the new version of the service to the baseline.

• Critical service reinsurance. For businesses similar to retail scenarios (such as POS machine cashier), we do not want to cause huge public opinion due to software failure. In this case, we can isolate business traffic through swimming lanes to achieve reinsurance.

The technical implementation

Flow marking scheme and implementation

There are three different schemes according to the marking position of flow when using swimming lane technology. It is worth noting that although the scenarios vary, the technical implementations are exactly the same as far as the service grid is concerned, and the scenarios are listed to help readers better understand them.

Figure 2 illustrates scenario 1. In this scenario, there is a primary gateway, let’s call it an API gateway (for example, Nginx), before traffic enters the Ingress gateway of the service grid. Typically, AN API gateway can mark traffic by adding additional headers before forwarding incoming requests based on traffic characteristics. In the figure, an HTTP header named X-ASM-traffic-Lane: dev1 is added for specific traffic, indicating that traffic needs to be routed to the Dev1 swimlane. In this scenario, there is no need for any traffic marking in the service grid.

Figure 2

Figure 3 illustrates scenario 2. In this scenario, client traffic is routed directly to the Ingress gateway of the service grid. After the Ingress gateway identifies the traffic based on Istio’s native VirtualService matching rules, it adds an HTTP header named X-ASM-traffic-Lane to the forwarding request and routes the traffic to the corresponding swimming lane.

Figure 3

Figure 4 illustrates scenario 3. In essence, this scheme is exactly the same as scheme 2, which uses Istio’s native VirtualService matching rules to identify the corresponding traffic and adds an HTTP header named X-ASM-traffic-lane. The only difference is that the two Envoy is Ingress and the three Envoy is Sidecar.

Figure 4.

Once the traffic is marked, full-link mapping and routing is done by each Envoy in the service grid based on the configuration delivered by the traffic Envoy and control plane.

Traffic id Transparent transmission

The example in Figure 5 illustrates the traffic details between a service and an Envoy (Sidecar) on the side in a service grid.Figure 5

Contains retweets of both incoming and outgoing flows from an Envoy’s perspective. I1 is the incoming traffic, which is forwarded to local Svc A. O1 is the outgoing traffic (caused by the need to call another service to process L1) that is received and forwarded to the external called service. The inflow and outflow are only related to the request, not the response corresponding to the request. Obviously, an incoming request can result in multiple outgoing requests occurring (” forks “), depending on the specific business logic of Svc A.

The core point of the swimlane technology is how to make every outgoing traffic bifurcated with the same label after the incoming traffic is labeled accordingly. The solution we adopt is combining link tracing technology (for example, OpenTelemetry) to solve the problem. The link tracing technology uses traceId to uniquely identify a call chain tree. After the root request is assigned with the unique traceId of the whole network, all new calls bifurcated from it must have the same HTTP header. In other words, the service developer needs to program to ensure that this end is propagated into subsequent service calls (for example, calling the OpenTelemetry SDK to complete the header propagation). In other words, the prerequisite for using swimlane technology is that each service uses link tracing technology, a prerequisite that can easily be met as one of the best practices for microservice architecture. Going back to Figure 5, Svc A needs to make an O1 call when it receives and processes an I2 request, ensuring that the traceId header in I2 is propagated to the O1 request is A detail that the developer of Svc A needs to pay special attention to.

Once all the service requests in the service grid have traceId on them, implementing full-link traffic mapping through an Envoy is simple. Basically divided into the following steps:

• Envoy builds a map table internally to record the mapping between traceId and traffic targets. For example, the traffic label shown in Figure 5 is placed in the X-ASM-traffic-lane HTTP header. X-asm-traffic-lane: dev1 indicates the traffic label dev1. X-asm-traffic-lane: canary indicates the traffic label canary.

• When request I1 enters the Envoy, the Envoy adds a mapping record to the mapping table based on the traceId and traffic criteria carried in the request.

• For each O1 request received, find the corresponding traffic scale from the mapping table based on the traceId in the request, add it to the O2 request and forward it.

The advantage of traceId marking based on service grid is that the traffic marking action and traffic mark transmission can be completely decoupled from the service, and this ability can be sunk into the service grid which is good at traffic governance, so that the flexibility of traffic scheduling can be further unlocked.

Traffic identification and traceld definition

We added TrafficLabel, a new CRD, to Istio’s existing CR. The reason for adding VirtualService rather than extending it directly is that the Design of VirtualService is based on the application dimension at the beginning. When a service is so complex that many applications need to be placed in the swimlane, the VirtualService of each application must be changed. The resulting timeliness and operability can be a problem. Another way to extend VirtualService is to give VirtualService the ability to configure global rules, which requires the rule merge mechanism, which is also problematic from a practical level. There has been a lot of discussion in the Istio community about the need for multiple VirtualServices to merge. Currently, merge is only supported on gateways, not Sidecar, due to concerns about failure due to different order of merge.

Figure 6 illustrates how the CR TrafficLabel can be used to define a globally valid traffic marking method in the istio-System root namespace. The tag x-ASM-traffic-lane is defined as the header of the HTTP request to store the traffic identifier (for example, dev1, dev2, canary, etc.) and the traceId is obtained based on the X-request-ID. Users can set the link tracking system according to the specific implementation of their selection. Set to taken from the X-request-ID header because the X-Request-ID Envoy implements unique identifiers for the entire network. Using x-Request-ID as the mapping key means that we can use the Bookinfo sample provided by the Istio open source community directly to demonstrate the effects of swimming lanes, Because all the services in Bookinfo do the x-request-ID header propagation from call in request to call out request.

apiVersion: istio.alibabacloud.com/v1beta1
kind: TrafficLabel
metadata:
  name: global-traffic-label
spec:
  rules:
  - labels:
      - name: x-asm-traffic-lane
    protocols: "http"
    traceIdHeader: x-request-id
  hosts:
    - "*"
Copy the code

Figure 6.

Label routes by traffic

To support routing by traffic, Istio’s VirtualService needs to be extended to allow the destination field to specify the destination of traffic using variables such as X − ASM − Traffic − Lane, as shown in Figure 7 below. In other words, the traffic with x− ASM − Traffic − Lane: Dev2 header goes to dev2, which is a subset named Dev2 defined by DestinationRule, as shown in Figure 8. Note that a variable like X-ASM-traffic-lane in VirtualService in Figure 7 specifies the destination of the traffic, as shown in Figure 7 below. In other words, the traffic with the X-ASM-traffic-lane: Dev2 header goes to dev2, which is actually a subset named Dev2 defined by DestinationRule, as shown in Figure 8. Note that variables such as X − ASM − Traffic − Lane in VirtualService in Figure 7 specify the destination of the traffic, as shown in Figure 7 below. In other words, the traffic with x− ASM − Traffic − Lane: Dev2 header goes to dev2, which is a subset named Dev2 defined by DestinationRule, as shown in Figure 8. Note that the name X-ASM-traffic-Lane in VirtualService in Figure 7 is the same as the name defined in TrafficLabel in Figure 6.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
    - reviews
  http:
  - match:
    - headers:
        end-user:
          exact: dev2
    route:
    - destination:
        host: reviews
        subset: dev2
      fallback:
        case: noinstances|notavailable
        target:
          host: reviews
          subset: baseline
      headers:
        request:
          set:
            x-asm-traffic-lane: dev2
  - route:
    - destination:
        host: reviews
        subset: $x-asm-traffic-lane
      fallback:
        case: noinstances|notavailable
        target:
          host: reviews
          subset: baseline
  - route:
    - destination:
        host: reviews
        subset: baseline
Copy the code

Figure 7.

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: reviews
spec:
  host: reviews
  subsets:
  - labels:
      version: v2
    name: baseline
  - labels:
      version: v3
    name: dev2
Copy the code

Figure 8.

As you can see from the DestinationRule definition in Figure 8, dev2 is only defined in addition to the baseline, and Figure 7 is the VirtualService definition in the corresponding case. The corresponding usage scenarios are the baseline and DEV2 swimlanes in Figure 1.

Product realization

Ease of use has been put under the spotlight in the context of cloud native technology, and we understand what that means. To this end, when designing product interactions, strive to clear your knowledge, think and optimize in the context of the user, and strive to strike a balance between functionality and ease of use.

Before the user uses the swimlane, we assume that he has built a baseline environment that includes all services. In K8s, a baseline environment is typically deployed in a specific namespace for better operation and management of services within it. When creating a swimlane, you only need to provide the name of the swimlane. The rest of this section begins by creating a swimlane called dev2.

Once a swimlane is created, you need to publish the service to the swimlane. Since the published Service is already stored in the baseline environment and the K8s Service resource is created, the published Service in the swimlane is actually creating a Deployment under the corresponding Service, which intuitively means creating another software version of the existing Service. As you can imagine, this publishing action includes validating the baseline version, the number of instances, and the container image address.

After a service is published to a swimlane, you need to check the service list of the swimlane to ensure that all services are started properly. In this case, no traffic enters the swimming lane. You need to configure traffic diversion rules to divert baseline traffic to the swimming lane.

Traffic diversion rules can be configured based on HTTP headers, URIs, and cookies to accurately select the measured traffic to enter the swimming lane. The following rule directs HTTP traffic with end-user as dev2 to the Dev2 swimlane. When you configure rules, specify the inbound service correctly.

After the traffic diversion rule is applied, you can log in to the web page as user dev2 to view the effect of the dev2 swimlane service. The following two figures illustrate what the page looks like for the full baseline and dev2 swimlanes, respectively. Since the ProductPage and Details services are not deployed in the Dev2 swimlane, they fall back to the baseline, The final result is that The content of The Comedy of Errors and Book Details in The two pictures is exactly The same.

After a service is published to a swimlane, you can view the traffic comparison between each service and the baseline version in the swimlane service list. Helps developers better understand how the swimlane service works.

In addition, the service topology diagram clearly shows the invocation of services in the Dev2 swimlane (lane-dev2 in the figure).

Summary and Prospect

The service grid-based swimlane technology we explored allows developers to create an isolated environment for development testing or business reinsurance in seconds, minimizing the “explosion radius” with precise drainage rules. The new experience and new value of cloud native service grid technology are well realized.

Next, we will go further in a scenarioized way to open up the swimlane and version grayscale functions so that users can intuitively use these functions. At the functional level, we will further improve the protocols supported in the swimlane, such as RocketMQ and Dubbo 3.0, to maximize their value by enriching the application scenarios of the swimlane technology.

Finally, we will continue to build a modern Service governance platform for microservices architecture under the concept of Service Mesh as Infra, and accelerate the development and promotion of this new technology together with industry partners.

Author introduction: Li Yun (name: Zhi Jian), technical director of Hybrid cloud products of Ali Cloud Service Grid. Since 2018, I have led a team in Alibaba Group to engage in the development and construction of service grid technology, and I have shared the technology of cloud native and service grid for many times in QCon. Release the latest information of cloud native technology, collect the most complete content of cloud native technology, hold cloud native activities and live broadcast regularly, and release ali products and user best practices. Explore the cloud native technology with you and share the cloud native content you need.

Pay attention to [Alibaba Cloud native] public account, get more cloud native real-time information!