Author | Zeng Yuxing

Review & Proofread: Zeng Yuxing (Yu Zeng)

Editing & Typesetting: Wen Yan

background

Under the microservice software architecture, it is quite time-consuming to build a complete set of test system for verification of new business functions before they go online. As the number of microservices to be split increases, it becomes more and more difficult. This suite of testing systems is often expensive and must always be maintained separately to ensure efficient functional validation until a new version of the application is released. When businesses become large and complex, they often have to prepare multiple sets, a cost and efficiency challenge that is common and difficult to solve across the industry. If the new version can be verified before it goes live in the same production system, the labor and financial savings can be considerable.

In addition to functional verification in the development stage, the introduction of grayscale distribution in the production environment can better control the risk and explosion radius of the new version of software online. Grayscale release is to allocate production traffic with certain characteristics or proportion to the service version that needs to be verified to observe whether the running status of the new version meets expectations after it goes online.

Ali Cloud ASM Pro (see the end of the article for related links) constructs the full-link gray scheme based on Service Mesh, which can help solve the problems of the above two scenarios.

ASM Pro product Functional architecture diagram:

The core capabilities are the extended traffic marking and routing by label and traffic Fallback capabilities shown in the figure above, which are described in detail below.

The scene that

The common scenarios of full-link gray scale publishing are as follows:

Taking Bookinfo as an example, inbound traffic carries the desired tag group. Sidecar obtains the desired tag in the request Context (Header or Context) and distributes the route to the corresponding tag group. If the tag group does not exist, By default, the fallback route is routed to the Base group. You can configure the fallback policy. The implementation details are detailed next.

Tag tag of incoming traffic. Generally, request traffic is marked on the gateway layer in a way similar to tag plug-in. For example, if the userID is in a certain range, it will be tagged to represent the grayscale. Considering the diversity of gateway selection and implementation in the actual environment, the implementation of gateway is not discussed in this paper.

Below we focus on how to achieve full link traffic marking and achieve full link gray scale based on ASM Pro.

Realize the principle of

Inbound traffic refers to the Inbound traffic of requests sent to the App, and Outbond traffic refers to the outbound traffic of requests initiated by the App.

The figure above shows the typical traffic path of a business application after the mesh is enabled: the business App receives an external request P1 and then invokes the interface of another service that it relies on behind the mesh. In this case, the requested traffic path is P1 -> P2 ->p3-> P4, where P2 is Sidecar forwarding P1 and p4 is Sidecar forwarding P3. To achieve full-link gray scale, P3 and P4 need to obtain the traffic label of P1 before routing requests to the corresponding back-end service instance. P3 and P4 also need to wear the same label. The key is how to make the label transfer completely insensitive to the application, so as to realize the label transparent transmission of the whole link, which is the key technology of the whole link gray scale. The ASM Pro implementation is based on traceId in distributed link tracing technologies (e.g., OpenTracing, OpenTelemetry, etc.) to achieve this function.

In distributed link tracing technology, traceId is used to uniquely identify a complete call chain. Every fanout call issued by an application on the link will bring the traceId from the source through the SDK of distributed link tracing. The implementation of ASM Pro full-link grayscale solution is based on the widely adopted practices of this distributed application architecture.

In the figure above, the inbound and outbound traffic seen by Sidecar is completely independent, and the corresponding relationship between them cannot be sensed. It is not clear whether one inbound request causes multiple outbound requests. In other words, Sidecar does not know whether there is a correspondence between requests P1 and P3 in the figure.

In ASM Pro full-link grayscale solution, p1 and P3 requests are associated with traceId, specifically relying on the X-request-id trace header in Sidecar. Sidecar maintains a mapping table that records the mapping between traceids and labels. When Sidecar receives a P1 request, it stores the traceId and labels in the request into this table. When receiving a P3 request, the system queries and obtains the label corresponding to the traceId from the mapping table and adds the label to the P4 request to implement full-link marking and routing based on the label. The following diagram shows a rough example of how this works.

In other words, ASM Pro’s full link grayscale capabilities require the application of distributed link tracking technology. Applications that do not use distributed link tracking will inevitably require some retrofitting. For Java applications, Java Agent can still be used to implement transparent traceId transmission between inbound and outbound without modification in the way of AOP.

Realize traffic marking

A new TrafficLabel CRD has been introduced in ASM Pro to define where traffic labels that are passed through for Sidecar are fetched. In the YAML file shown below, the traffic label source is defined and the label needs to be stored in OpenTracing (specifically the X-Trace header). Where the trafficLabel is called trafficLabel, The value ranges from getContext(x− Request − ID) to getContext(x-request-id) of the local environment. Obtain the value from getContext(x− Request − ID) in the local environment and localLabel in the local environment.

apiVersion: istio.alibabacloud.com/v1beta1 kind: TrafficLabel metadata: name: default spec: rules: - labels: - name: TrafficLabel valueFrom: - $getContext(x-request-id) // If aliyun Arms is used, x-B3-tracEID - $(localLabel) attachTo: Protocols: "*" protocols: "*"Copy the code

The CR definition consists of two parts, namely label fetching and storing.

  • Obtain the traffic label based on the field defined in the protocol context or Header. If no traffic label is obtained, the traffic label is obtained based on the map of the Sidecar local record, which stores the mapping of the traffic identifier of the traceId. If the mapping is found in the map table, the system marks the traffic with a corresponding label. If the mapping is not found, the system marks the traffic with a localLabel corresponding to the local deployment environment. LocalLabel Corresponds to the label associated with the local deployment. The label name is ASM_TRAFFIC_TAG.

The label of the local deployment environment is “ASM_TRAFFIC_TAG”. The actual deployment can be associated with the CI/CD system.

  • Storage logic: attachTo Specifies the corresponding field to be stored in the protocol context, for example, HTTP corresponds to the Header field and Dubbo corresponds to the RPC context. The specific field to be stored in can be configured.

With TrafficLabel, we know how to label and transfer traffic, but this alone is not enough to achieve the full link gray scale. We also need a function that can do routing based on TrafficLabel traffic identification, namely “routing by label”, and fallback logic. In this way, when the route destination does not exist, the function of degradation can be implemented.

Route by traffic label

This implementation extends Istio’s VirtualService and DestinationRule.

Define a Subset in DestinationRule

A user-defined group subset corresponds to a value of a trafficLabel

apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: myapp spec: host: myapp/* subsets: - name: myproject environmental labels: # project env: ABC - name: the isolation # isolation environmental labels: env: XXX # machine group - the name: Testing-trunk # import labels: env: yyy-name: testing # import labels: env: ZZZ -- apiVersion: networking.istio.io/v1alpha3 kind: ServiceEntry metadata: name: myapp spec: hosts: - myapp/* ports: - number: 12200 name: HTTP protocol: HTTP endpoints: -address: 0.0.0.0 labels: env: abc-address: 1.1.1.1 labels: env: Xxx-address: 2.2.2.2 labels: env: zzz-address: 3.3.3.3 labels: env: yyyCopy the code

Subset supports two specified forms:

  • Labels are used to match nodes with specific tags (endpoints) in the application.
  • ServiceEntry is used to specify IP addresses that belong to a certain subset. Note that this method is different from labels’ logic. The IP addresses can be directly specified through configuration instead of being obtained from the registry (K8s or other types). Applies to Mock environments where nodes are not registered with a service registry.

Based on subset in VirtualService

1) Global default configuration

  • The route section can specify multiple destinations in order, and traffic is distributed proportionally among destinations based on the weight value.
  • A fallback policy can be specified for each destination. The case indicates when a fallback is executed. The value can be: Noinstances (no service resources) and Noavailabled (available service resources but unavailable service). Target Specifies the target environment of the Fallback. If no fallback is specified, the fallback is enforced on the destination.
  • Routing logic by standard, we modify VirtualService, Let’s have a subset that supports the placeholder trafficLabel, that trafficLabel, that trafficLabel that gets the target environment from the requested trafficLabel, Corresponds to the definition in TrafficLabel CR.

The global default mode corresponds to a swimlane, that is, a single environment is closed, and an environment-level Fallback policy is specified. A user-defined group subset corresponds to a value of a trafficLabel

The following is an example:

apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: default-route spec: hosts: HTTP: name: default-route route: -destination: subset: $trafficLabel weight: 100 fallback: case: Noinstances target: testing-trunk-destination: host: */* SUBSET: testing-trunk # Noavailabled target: test-destination: subset: testing # Daily environment weight: 0 fallback: case: noavailabled target: Mock - destination: host: */* SUBSET: mock # Mock center weight: 0Copy the code

2) Customization of personal development environment

  • Go to the daily environment first, and then go to the trunk environment when the daily environment does not have service resources.
apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: projectx-route spec: hosts: HTTP: - name: dev-x-route match: trafficLabel: -exact: dev-x # dev environment: x route: -destination: Host: myapp/* subset: testing # Subset: weight: 100 fallback: case: noinstances target: test-trunk-destination: host: Subset: 0 myapp/* subset: testing-trunkCopy the code

3) Support weight configuration

The trunk environment is marked and the native environment is dev-x traffic, 80% to the trunk environment and 20% to the daily environment. When there are no service resources available in the trunk environment, traffic hits the daily.

SourceLabels indicates the label corresponding to the local workload

apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: dev-x-route spec: hosts: -myapp /* HTTP: - name: dev-x-route match: trafficLabel: -exact: Exact: dev-x # Traffic from a certain project environment Route: -destination: host: myapp/* SUBSET: Weight: 80 fallback: case: noavailabled Target: testing- destination: host: Subset: myapp/* subset: testing #Copy the code

Route by (environment) label

This solution relies on the service deployment application to carry a related label (in this example, the corresponding label is ASM_TRAFFIC_TAG: XXX), which is usually the environment identifier. The label can be understood as the related meta information of service deployment. This depends on the series connection of upstream deployment system CI/CD system.

  • In the K8s scenario, you can automatically label the corresponding environment/group during service deployment. That is, the K8s is used as the metadata management center.
  • In non-K8S scenarios, microservices can be integrated through service registries or metadata management services.

Note: ASM Pro developed its own ServiceDiretory component (see ASM Pro product functional architecture diagram), which realized multi-registry docking and dynamic acquisition of deployment meta information;

Application Scenario Extension

The following is a typical set of development environment governance functions based on traffic marking and routing by standard; The Dev X environment for each developer only needs to deploy services with version updates; If you need to coordinate with other developers, you can configure fallback to flow the service request fallback to the corresponding development environment. Dev Y environment B -> Dev X environment C.

Similarly, it is also possible to equate the Dev X environment with the online grayscale version environment, which can solve the problem of full-link grayscale publishing in the online environment.

conclusion

The capability of “traffic marking” and “routing according to standard” introduced in this paper is a general scheme, which can better solve the test environment governance, online full-link grayscale publishing and other related problems. It is independent of development language based on service grid technology. At the same time, this scheme is suitable for different layer 7 protocols, currently supports HTTP/gRpc and Dubbo protocols.

Corresponding to the full link gray scale, other manufacturers also have some solutions, compared with other solutions ASM Pro solution advantages are:

  • Supports multiple languages and protocols.
  • TrafficLabel is a unified configuration template that is simple and flexible and supports multiple levels of configuration (global, Namespace, pod level).
  • Fallback routing can be degraded.

The “traffic marking” and “routing by label” capabilities can also be used in other related scenarios:

  • Performance pressure test before big promotion. In the online compaction scenario, to isolate the compaction data from the formal online data, a common approach is to use a shadow approach for message queues, caches, and databases. This requires a traffic marking technique that uses tags to distinguish between test traffic and production traffic. Of course, this requires Sidecar to support middleware such as Redis, RocketMQ, etc.
  • Unitized routing. In common unitary routing scenarios, you may need to obtain the corresponding unit based on some meta information, such as uid, in request traffic. In this scenario, we can label traffic by extending the TrafficLabel definition to define a function that gets a “unit label” and then route traffic to the corresponding service unit based on the unit label.

Related links:

1) ali cloud ASM Pro: servicemesh.console.aliyun.com/