Author: Zhang Haili, R&d director of Cloud Platform of Yu Shi Technology

Industry background

UISEE technology (UISEE) is a leading autonomous driving company in China. It is committed to providing AI driving services for the whole industry and the whole scene, and delivering AI drivers enabling new ecology of travel and logistics. Due to the need to ensure the “true unmanned” (i.e. no security personnel in the car or with the car) business operation in each scenario, we pay more attention to the “Cloud Brain” to ensure its high availability and observability of practice.

Let’s assume that there are dozens of unmanned logistics trailers in a factory. Considering the safe operation in the “truly unmanned” environment, we will take the strategy of stopping the vehicle cloud connection after disconnecting for a long time (generally at the level of seconds). If during operation, the cloud fails and lacks high availability, it will cause all vehicles to stop running. Obviously this has a huge impact on business operations, so the stability and high availability of the cloud platform is very important and critical.

Why KubeSphere

Our relationship with KubeSphere can be said to “start from the level of appearance, trapped in talent”. Back when KubeSphere 2.0 was released, we were lucky enough to notice the product in the community news and were immediately attracted by its “small and refreshing” interface. As a result, we began a small trial on private clouds since 2.0, and began using it to manage our public cloud environment after 2.1 was released.

KubeSphere 3.0 is a very important milestone release that brings Kubernetes multi-cluster management capabilities, further enhancements in monitoring and alerting capabilities, and continues to build upon these capabilities in 3.1. As a result, we are beginning to apply KubeSphere to the management of our own and hosted clusters (and the workloads that run on them) on a larger scale, and we are further exploring how to integrate the existing DevOps environment with KubeSphere. The ultimate goal is to build KubeSphere into the core of our internal unified portal and centralized management for cloud native applications, services and platforms.

Because KubeSphere provides such excellent governance capabilities, we have more time to improve the usability of the cloud platform from a business perspective. The two things we share in this post are two usability related practices that we are working on early and now.

High availability practice: Operators that provide hot spare capabilities

On the “high availability” side, we want to solve the problem of how to ensure that the cloud service can be restored to a stable state as quickly as possible when it fails.

Limit the “high availability” appeal of L4 unmanned driving scenarios

“High availability” in terms of time is usually a number of 9-level choices, but the problems and challenges are different when it comes to specific business scenarios. As listed in the figure above, for our “L4 driverless driving scenario in limited area”, the various types of customer private cloud caused by toB business, different tolerance for recovery process, and historical burden caused by customer customization service are several major problems restricting our construction of high availability solutions. In the face of these limitations, we chose a relatively “simple and crude” approach, trying to “simplify” out of the common problems of high availability cost across the cloud and high risk of integration of additional high availability capabilities for services.

A high availability method for switching through Operator real hot spare

As shown in the figure above, the idea of this solution is straightforward — to monitor the status of service Pods and switch between master and standby pods in case of an exception. If we look at the Controller’s observe-analyze – Act system, it does the following:

  • Monitor: Monitor Pod/Deployment/StatefulSet/Service changes (including monitor specific Namespace); Changes detected trigger the Reconcile process (the following two operations)
  • judge: Iterate over all services, get Deployment/StatefulSet, and place itstatusCompare the total number of services in to the number available; If any copy is not available, then the Pod in DP/STS is iterated again, and the unhealthy Pod is found by the container status and restart times. When all the pods under one DP/STS are unhealthy, the service as a whole is considered unhealthy
  • Switch: Two SETS of DP/STS are deployed on the same service. If all the PODS pointed to by the current service are unhealthy (that is, the entire service is unhealthy), if the other DP/STS is healthy, switch to the other DP/STS

In this Operator development framework, we choose the Operator SDK of Kubernetes official community, namely Kubebuilder. Technically, it has a good encapsulation of client-Go, the core component usually needed to write Controller, which can help developers focus more on the development of business logic. From the point of view of community support, the development is also relatively stable, here we also recommend that you can refer to Kubebuilder Chinese document translated by cloud native community organization in use.

Halfway through: The long tail of high availability features landing is testing

High availability testing is especially important because of its unique nature, but conventional testing methods may not be appropriate (an interesting “paradox” is that testing is meant to detect problems, while high availability enables problems to be avoided). Therefore, after completing the development of Operator, we actually spent more time on testing. Here, we mainly implemented three aspects of testing:

  • End-to-end BDD testing: For basic functional verification and testing, we use the Cucumber BDD testing framework Godog project (supporting Cucmber Gherkin syntax). BDD is also suitable for business side direct import requirements
  • Chaos test for the operating environment: ChaosBlade was used to conduct chaos test on the system operating environment of Kubernetes physical nodes to test the high availability performance in case of infrastructure failure
  • Chaos testing at the business level: Here, we use Chaos Mesh to test the Pod level of the primary and secondary services. Chaos Mesh is used on the one hand because it has comprehensive function coverage at this level, and on the other hand because its Dashboard is convenient to manage various test cases used in the test

To summarize, we learned a few things from our early experience with high availability: First, we needed to be familiar with the core mechanics of Kubernetes Controller and the client-Go class library; Second, finding a handy Operator development framework will greatly improve your development efficiency; Last but not least, the high availability part of development is actually “20% time development + 80% all-round testing”, and it is very, very important to do testing well and thorough testing well.

Observable practice: Cloud-to-terminal SkyWalking access

On the “observable” side, we wanted to solve the problem of how to ensure that we could locate the root cause of the problem as quickly as possible after the service failure was recovered so that we could actually eliminate the problem as early as possible.

The “observable” appeal under the cloud integration architecture of driverless vehicles

“Vehicle cloud integration” architecture is an important core of unmanned driving. From the perspective of cloud, one of its great challenges is that the service link is very long, which is much longer than the traditional Internet pure cloud service link. Problems at any point on the ultra-long link may lead to faults, which may cause alarms or even lead to abnormal offline of the vehicle. Therefore, we always hope to receive all the dribs and drabs and all kinds of information on the link in order to locate problems. At the same time, pure log data is not enough, because the link is too long and distributed in different places of the vehicle cloud, it is not convenient to quickly locate the interval where the problem occurs and conduct targeted problem mining only by log.

In order to facilitate you to understand the length of the link more concretely, we give an abstract “car cloud integration” architecture diagram below, interested friends can count the “7 x 2” call link.

Achieve full link tracking of vehicle cloud through SkyWalking

Apache SkyWalking is an excellent and active observability platform project in the community. It also provides the triad of Logging, Metrics and Tracing observability. For some basic concepts of tracing system, please refer to The concepts and terms translated by Wu Sheng. In order to facilitate you to have a better grasp of the follow-up content, but also to make up for the details that cannot be developed in the speech, here are also briefly sorted out a few key points for your reference:

  • Trace: A Trace represents a potentially distributed system with parallel data or execution traces (potentially distributed, parallel). A Trace can be considered a directed acyclic graph (DAG) with multiple spans.
  • Span: The most important content to pay attention to when embedding in the service. A Span represents a logical unit of operation in the system with a start time and an execution time. A logical causal relationship is established between spans by nesting or ordering them. In SkyWalking, spans are divided into:
    • LocalSpan: The Span type created when a method is invoked within the service
    • EntrySpan: Type of Span created when a request enters a service (for example, to process calls to the service interface by other services)
    • ExitSpan: Type of Span created when a request leaves a service (for example, an interface to invoke another service)
    • In SkyWalking, creating an ExitSpan is equivalent to creating a Parent Span. In an HTTP request, for example, you need to encode the context of the ExitSpan and put it in the Header of the request. After the other service receives the request, it needs to create an EntrySpan and decode the context information from the Header to figure out what its Parent is. In this way, ExitSpan and EntrySpan can be connected in tandem.
    • SkyWalking makes no distinction between Span types ChildOf and FollowsFrom
  • TraceSegment: A concept in SkyWalking, between a Trace and a Span. It is a segment of a Trace that can contain multiple spans. A TraceSegment records the execution process in a thread. A Trace consists of one or more TraceSegment, which in turn consists of one or more spans.
  • SpanContext: represents the state passed to the subspan across the process context. In Go, passcontext.ContextDeliver in the same service.
  • Baggage: A collection of key-value pairs stored in SpanContext. It is transmitted globally across all spans on a trace link, including their corresponding SpanContext. Baggage will follow the Trace.
    • In SkyWalking, context data is passed by a namesw8The header item is passed, the value contains eight fields, by-Split (Trace ID, Parent Span ID, etc.)
    • SkyWalking also offers a name calledsw8-correlationExtension header items, can pass some custom information
  • Compared with Jaeger/Zipkin, although both of them are implementations of OpenTracing, the concepts of ExitSpan and EntrySpan are unique to SkyWalking. The advantages of using them are as follows:
    • Use semantic ExitSpan and EntrySpan to make your code logic clearer
    • The reason for wanting to be logical is that sometimes creating a Span is error-prone, especially if you are unfamiliar with the service links. Therefore, the understanding of OpenTracing is the basis and the link of the service should be understood.

SkyWalking’s plug-in architecture is the foundation that allows us to embed ourselves in a vast microservices architecture. There are plug-ins for Java, Python, Go, Node.js, and more. There is plug-in support for HTTP framework, SQL, NoSQL, MQ, RPC, etc. (Java is the most extensive with 50+ plug-ins, other languages may not be as comprehensive). Based on the development ideas of Go and Python official plug-ins, we further extended and made some plug-ins, such as:

  • Go · GORM: GORM supports registering plug-ins for database operations by simply creating ExitSpan in the plug-in
  • Go · gRPC: Write context to metadata (http-like headers) using gRPC interceptors
  • Go · MQTT: No middleware was found to work with, so the function was written directly and called manually when the message was published and received
  • Python · MQTT: Writes context data from Carrier (see OpenTracing Baggage) in the Payload
  • Python · Socket: Due to low-level, HTTP requests and MQTT messages will be recorded after the official custom Socket plug-in is defined, resulting in excessive output information. Therefore, two customized functions are manually called in combination with the business

The future is bright, the road is tortuous — remember some of the pits we’ve stepped on

As the language environment, types of middleware and business demands involved in microservices architecture are usually rich, it is inevitable to encounter various subjective and objective pits in the process of accessing full-link tracing. Here are some common scenarios.

Case 1: The plug-in link of the Kong gateway fails

Not finding official plug-ins is one of the most common access problems. For example, SkyWalking’s Kong plugin had not yet been officially released (it was only released in May) when we entered SkyWalking. We need to be here for business

Kong access a custom permission plug-in, used for API and resource authorization; The plug-in calls the interface of the permission service for authorization. The calls in this plug-in should also be part of the call chain

, so our solution is to directly bury points in the permission plug-in, and the specific link formed is shown in the figure below.

Case 2: Cross-thread/cross-process link access problem

For cross-threading, here’s a hint: you can use the functions capture() and continued(); Use Snapshot to create a Snapshot for the Context Context. For cross-processes, one pitfall we encounter is the Python version problem: when a new process is started in the Python service, the original process’s SkyWalking Agent cannot be used in the new process; You need to restart an Agent to start it. Python 3.9 works. Python 3.5 says “Agent can only be started once”.

Case 3: The official Python Redis plugin Pub/Sub is disconnected

This case is a classic example of an official plug-in failing to cover a real-world business scenario. The Redis plugin is available in the official Python library. At first we thought that if we installed the Redis plugin, all Redis operations would be connected to each other. But in practice, for Pub/Sub operations, the link is broken.

When you look at the code, the plug-in creates an ExitSpan for all Redis operations. But in our scenario, the Pub/Sub operation is required; This causes both operations to create ExitSpan, leaving the link unconnected. In this case, we finally reformed the plug-in to solve the problem. If you encounter similar situations, you also need to pay attention to the function positioning of the official plug-in.

Case 4: Multiple Data Bridge access problems for MQTT brokers

Generally, the trace link to an MQTT Broker is Publisher => Subscriber; However, there are also scenarios where the MQTT Broker receives a message and invokes the interface of the notification center through the rules engine; When the rules engine calls the interface, there’s no way to put Trace information in the Header.

This is a typical problem where advanced middleware capabilities are not covered by plug-ins. Usually, you have to go down the donkey and tailor it to the actual situation. For example, in this case, we agreed on the parameter name and put it into the request body. After receiving the request, the notification center extracted the link Context from the request body and finally realized the link connection as shown in the following figure.

Finally, we summarize some of our observations: first, we need to rely on a mature and continuously evolving tool/platform; Second, rely on it, and grow and improve with it. Finally, road breaking is not terrible, a great man said, “a single spark can start a prairie fire”, a clear goal, persistent efforts, there will be a chance to solve the problem.

Finally, thanks again to the KubeSphere team for contributing such an excellent cloud native product to the Open source community in China and around the world. We hope to participate in the community as much as we can and grow together with the KubeSphere community!

This article is published by OpenWrite!