That commonly combine several systems in the daily work of related functions together to complete a business scenario, at that time usually in general micro service will need to use the distributed transaction to solve, or to solve, by means of this article said the choreography of this article is the introductory article of this series, mainly introduce the author to try in the practical work, We will continue to update some internal principles and more interesting production practices

1. The background

In the operation and maintenance platform I took over, the previous design was to write all the codes to complete a certain business scene together in a large controller. Then, in order to accommodate the problems of various previous platforms and scenes, there were a lot of IF and else and hard coding in the middle, which led to the need for manual intervention and troubleshooting. Scalability and robustness are almost nil. To better understand what we are trying to do in Part 2, here are a few concepts

1.1 Task orchestration in O&M

In traditional operation and maintenance development, task scheduling is usually a very common system. In operation and maintenance systems before K8S, corresponding task scheduling is usually implemented based on master-worker architecture. There are also many open source products, such as StackStorm, Tower, Jeankins, etcThe framework used in the same business has no essential difference, we can write plug-ins to achieve the corresponding interface, can be thrown to the system to run the task, the process of task data, status are maintained by the corresponding plug-in, the system is only responsible for task scheduling, but not responsible for the status and data management

1.2 Status during O&M

Final-state-oriented is a new operation and maintenance model led by leaders in recent years. By describing the final state, the system makes decisions according to the current state and then waits for the corresponding feedback results to make decisions again so as to achieve the positive feedback loop until the target state.But a lot of business logic in everyday work is usually stateful. For example, in a capacity expansion scenario, if you don’t know the status of the current Pod or Server, you essentially can’t decide what to do next. If you do not know the status of the task in a long worfklow, there is no way to know which step is currently available for retry

1.3 Vision of intelligent decision making

I remember when I was in my previous company, we all listened to AIOPS shared by the boss. Based on artificial intelligence and expert experience, we could make automatic decisions when faults occurred, so as to achieve the goal of quick loss stop. It is terrible to think about knowledge graph, root cause analysis and intelligent robot. But as a programmer, I feel like if I write a program that finds a problem before I do, I suspect it’s a BugCompared with intelligent operation and maintenance, the author is more optimistic about event-driven operation and maintenance. By perceiving corresponding events and based on expert experience, process automation of corresponding event processing mechanism can be guaranteed in terms of controllability, stability and certainty

2. Solutions

In fact, there is no plan, mainly a bit of thinking on the ground, in fact, there is no investigation for too long, because time is not allowed. Therefore, after rough selection, we began to design the system according to our business scenarios. Here we will first introduce the selection and architecture

2.1 the selection

According to the analysis of operation and maintenance scenarios, what we need is a stateful, programmable, distributed task scheduling framework with support workflow, fault tolerance and unlimited expansion. At present, task scheduling in cloud native seems to be an unpopular direction, so the author looked at the framework for solving distributed transaction scenarios in the business. Finally, we chose Uber’s Cadence framework to implement it. However, Candance’s author seems to be very opposed to DSL and does not implement the default DSL scheduling function.The main reason why we didn’t make the usual selection such as airflow is related to the author’s environment. Most of the company’s current base services are either based on open source adaptations or are homegrown. Therefore, the default integration of open source is of little significance to the author, who has to write their own Provider anyway. Secondly, the author’s current platform has two languages, Java and Go. In order to facilitate integration, we must choose a kua language.

2.2 System Architecture

Based on the open source Candance, we directly designed our upper business layer V0.1. In order to implement the two core functions mentioned above: Choreography and decision making, we designed six business modules with the following functions:

The module instructions
Atomic components Encapsulates various atomic tasks, providing Workflow and DSLS for task choreography
DSL choreography The system provides basic Workflow decisions and supports DSL choreography
The event Used to listen to or accept events transmitted by the external system to trigger the corresponding decision module
Decision making Implement basic decision making functions such as business hierarchy and machine batching. Decision making triggers corresponding Workflow or atomic components
Service catalog Provides atomic operations and worfklow use externally through the service catalog
The control module The control module mainly manages the results of decisions, avoiding multiple decision modules to deliver the same job and realizing unified control

As you can see, Candance helps us solve a lot of problems at the very low level of distribution. We just need to build the upper level business modules. When the business functions are written, take some time to see the corresponding implementation, and then share with you

2.3 Workflow

The operation of the system is divided into two major stages: choreography and runtime

Choreography, stage

1. Platform R&D is responsible for assembling various platform functions into atomic components and connecting them to the system 2. According to the business scenario, operation and maintenance experts combine the atomic tasks at the lower level and construct the corresponding DSL process. The corresponding WORfKlow is used as the decision-making branch for the decision-making module, and the corresponding mutually exclusive strategy is set at the same time

The runtime

1. Event 2 is generated when the status of an O&M object changes. After receiving the event, the decision module makes decision according to the event type and decision branch, and generates decision result 3. Then the control module is called to confirm whether the decision result can be delivered to the production environment. 4. The control module determines whether the corresponding decision 5 can be made according to the current work task in operation. Send the workflow to Candance, and Candance orchestrates the workflow and atomic tasks 6. Atomic operations will eventually trigger a state change of the operation object, and then the corresponding operation will be performed until the target state is reached

3. The result

I have been busy recently. I read the code of Service Catalog at home one day over the weekend, but I have to summarize it at night because I haven’t written an article for a long time. I have been thinking about it for several hours and I haven’t written anything. I will share some of the code level design in the following section when the code is finished. I will share some interesting things in Service Manager with you tomorrow. Big guys help me share share, going to eat dirt…

Cloud native learning notes address: www.yuque.com/baxiaoshi/t… Wechat: Baxiaoshi2020 public number: graphic source code! [Graphic source]

Micro signal: Baxiaoshi2020 notice number to read more source code analysis article! [Graphic source]