“This is the 20th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Hello, I’m looking at the mountains.

The phoenix project – a legendary story of the IT operations is a more magical book, use storytelling, shows the IT team (development, test, ops) in the development of low efficiency, slow delivery of cases, the three-step execution of work, through practice within the team to accelerate the development of system delivery, improve efficiency, make the team onto the road of the conversation. What’s more, the book is commendable in that it can intuitively discover the hidden problems in the working process of the technical team by analogy with the workflow of manufacturing industry.

A word of caution for developers: be sure to read buddhist books, because the story is told from an operations perspective, and there are some devilish plots. If you are looking for specific DevOps tools, you are advised not to read them. There is no specific tool introduction, but a simple way to describe the advantages and practices of DevOps.

First, the concept:

  • Value stream: An organized series of delivery activities performed by an organization based on customer needs. Or, a set of activities that involve both information flow and material flow in order to design, produce, and provide a product or service to a customer.
  • Technology Value Stream: The process required to transform business ideas into technology-driven services that deliver value to customers. The inputs to the process are requirements, which are developed by the development department, tested as a whole, deployed to production to run properly, and served to customers to generate value.
  • Lead time: starts with requirements validation (development receiving requirements) and ends when the work is done
  • Processing time: from the actual start of processing work to the end of the work
  • Wait time: Starts with requirements validation (development receiving requirements) and ends when the actual processing begins
  • Work in progress/work in progress: Incomplete work in the value stream, work in queues. Partially completed work gradually expires and eventually loses value over time.
  • Constraint point: the bottleneck in the value stream, that is, the upper limit of the flow rate of the entire value stream.

Step 1: The flow principle

The first step of the flow principle is to open up the technical value stream channel to achieve rapid left-to-right flow of work from development to operation and maintenance. By accelerating the flow rate of the technology value stream, shortening the lead time to meet customer needs, improving the quality of work and productivity, and making the enterprise more competitive. Practices include continuous build, integration, test, and deployment, building environments on demand, limiting work-in-process, and building systems and organizations that can safely implement change.

By continuously enhancing the visualization of work content, reducing batch sizes and wait intervals, and building in quality to prevent the transmission of defects downstream,

Make work visible

In the manufacturing industry, the accumulation of raw materials or semi-finished products and the backlog of orders are obvious. Where the obstruction occurs, it is the constraint point. But in the technical value stream, many problems are hidden and there is no obvious way to see the blocking and constraint points. At the same time, because the information is not visible or incomplete, the problem may be passed to the next link, or even the problem occurs when the online, or it cannot be delivered at all. This requires visualizing work as much as possible to identify where it flows, queues, and stalls.

In general, you can use Kanban management (from Japanese, Kanban) or Sprint planner as tools.

One caveat in visual management is to look at global goals, not local ones. The global goal is to increase system quality and improve development efficiency, while the local goal is the completion rate of development, the number of defects tested, the availability of the system and so on. It is not to say that local goals are not important, these local goals need other ways to optimize, we now need to improve the overall efficiency, once we get into the details, it is impossible to see the wood for the trees, there is no way to grasp the whole picture. This is what Jean King calls “not allowing local optimizations to degrade overall performance.”

Limit wIP

After the job is visualized, you can start to find problems with quality.

The first step is to limit parallel tasks. Why is that? Because if there are parallel tasks, we need to spend time switching tasks. There is a saying that if you do two tasks at once, you will spend 20% of your time switching between tasks, such as clearing your mind, getting into the zone, recovering your work environment, etc. With three, that’s 40 percent of the time spent switching tasks. The more parallel tasks, the more time spent switching tasks, resulting in more labor waste. When parallel tasks are reduced, time wasted is reduced, time spent on work is increased, and overall delivery efficiency is correspondingly improved. If kanban management is used, you can limit the number of work-in-progress (parallel tasks) per column or per workcenter and mark the maximum number of cards on each column.

Reduce batch size

This is the agile advocate of small steps, first delivery, first try, can be first trial and error, first correction, problems can be exposed as soon as possible, not the final integration of a big knot, unable to recover.

Continuous deployment must be mentioned here. I believe that many teams use continuous deployment tools such as Jenkins to trigger the Jenkins workflow after the code is submitted to start compiling, testing, deploying, and publishing. All we have to do is submit small batches of code, which is compiled, tested, and if there are problems, it can be found as soon as possible, and if there are no problems, it can be tested and released to the formal environment, so it can be presented to the customer as early as possible.

Reduce the number of hand-overs

In a traditional IT team, the code needs to go through N multiple departments from the completion of development to the launch of deployment. Each department has its own KPI and task schedule. IT takes time for different departments to communicate and approve work orders, which leads to the extension of delivery time. In addition, different departments have their own cognitive traps for a function, and operations may not know at all when development takes it for granted. The isolation of information may result in some known defects not being transmitted downstream in a timely manner and various rework situations.

To reduce this handoff, introduce automation to do most of the work, or restructure the team so that it is less dependent on others.

Continually identify and improve constraint points

Constraint points are bottlenecks, and if there are constraint points across the entire team, there will be bottlenecks in the delivery workflow. As the work is optimized, the work before the constraint point will accumulate to the constraint point, while the roles behind the constraint point may wait because the task has not arrived yet. In order to improve the overall efficiency, constraint points must be found and broadened to increase the throughput of the task. Any optimization without constraint points is an illusion.

Generally, follow these steps to broaden the constraints:

  1. Identify the constraint point. The role with the longest task queue is the constraint point.
  2. According to the constraints found, find ways to broaden the constraints;
  3. According to the decision in 2, consider the overall work;
  4. Improve the constraint points of the system;
  5. If the constraint has been broadened, new constraint points will appear throughout the workflow, repeat the above steps.

Eliminate dilemmas and waste from the value stream

In order to improve delivery efficiency, we need to save money as well as open source. Reduce any materials and resources that exceed customer needs and what they are willing to pay for:

  • Work in progress: Incomplete work in the value stream, work in queues. Partially completed work gradually expires and eventually loses value over time. Such as unconfirmed requirements, changes awaiting review.
  • Extra work: Extra work performed during delivery that does not add value to the customer. For example, review of documents that are not used downstream or that do not add value to the output.
  • Additional functionality: Build functionality during delivery that the organization and customers do not need at all, wasting time on gold-plating before it reaches the gold-plating stage. Gold-plating functionality adds complexity and effort to functional testing and management.
  • Task switching: Assigning people to multiple projects or value streams because task switching takes extra workflow and time in the value stream.
  • Wait: Waiting due to resource competition, which increases cycle time and delays delivery to the customer. Such as waiting for other departments to cooperate.
  • Movement: The amount of work done to move information or data between work centers. For example, for people who need to communicate frequently and are not in the same place, moving people is wasteful. Or work transitions can create mobile waste that requires additional communication.
  • Defects: Due to errors, imperfections, or ambiguities in information, materials, or products, a certain amount of work is required to confirm. The longer the gap between a defect and its detection, the more difficult it is to fix.
  • Non-standard or manual operations: Non-standard or manual work that relies on others, such as manual system deployment
  • Pit buster: Some people and teams have to be placed in situations that don’t make sense in order to achieve organizational goals.

Only by addressing the above eight wastes can the system be improved to reduce or eliminate these burdens and achieve the goal of fast flow.

Step 2: Feedback principle

The first step is to enable work to flow from left to right in the value stream, and the second step is to create a mechanism for quick and continuous work feedback from right to left at each stage. This method can prevent the recurrence of the problem by enlarging the feedback loop, shorten the detection period of the problem, and realize the rapid repair of the fault. Our goal is to control quality at source and embed relevant knowledge in the process; Creating safer working systems that detect and resolve failures or accidents before they occur; Finally establish safe and reliable working system.

Generally speaking, the best time to discover and correct a problem is when a fault occurs. Only when the problem is discovered can the problem be solved. By establishing high-quality feedback mechanisms throughout the workflow and organization, systems can be fixed on a smaller scale and at a lower cost. Eliminate problems before disaster strikes and create an organizational learning environment.

Work safely in complex systems

An important feature of complex system is that the system cannot be regarded as a whole. The various components in the system are usually tightly coupled and closely related, so the behavior of the system cannot be explained only according to the behavior of components. Failures are inevitable in complex systems, so we need to design a safe system in which engineers can work without fear — that is, with a lot of fidgeting — so that errors can be quickly detected before a disaster occurs. You can take the following four steps to make the load system more secure:

  • Manage complex work to identify design and operational issues
  • Work together to solve problems and build new knowledge quickly
  • Apply the new knowledge of the region to the global scope throughout the organization
  • Leaders need to continue to cultivate people with these talents

Identify problems in time

In order to find problems in time, there are generally two approaches: passive waiting and active trial and error.

Usually, we will set up a monitoring system and set multidimensional indicators to monitor the system. When the system fails, relevant personnel will receive alarm information and start to locate and solve the problem according to the alarm information. This way belongs to the practice of passive waiting, because waiting for failure, the failure time is not controllable, may be at work, are more likely to occur when sleeping in the evening and weekend rest, vacation tourism, and will be at the wedding in worship, but this way and we cannot do without, passively waiting for building the monitoring system is the basis of the active trial and error.

Active trial-and-error is to continuously verify the design and assumptions in a safe working system. The two key words of this approach are active and safe. It would be a joke if we crashed the production system during verification. The goal is to increase the flow of information to the system from as many dimensions as possible earlier, faster, and at the lowest cost, and to identify the cause and effect of the problem as clearly as possible. The more hypotheses you can rule out, the faster you can locate and solve problems. At the same time, this process is also a training process, can be good learning and innovation.

Work together to overcome problems and gain new knowledge

This is a follow-up to “find problems in time”, because once we find problems, we need to solve them, and we need to mobilize all the people involved to brainstorm and solve the problems. When something goes wrong, the worst thing to do is to skirt around the problem or say “I don’t have enough time”. What we have to do is stop production altogether, but also to solve the problem.

As for why everyone involved should be involved in solving the problem, there are several reasons:

  • The involvement of relevant personnel in locating and dealing with problems enables people to have a deeper understanding of the system and turn the unavoidable and early stage of ignorance into a learning process.
  • Prevents problems from being carried downstream, where repair costs and work increase exponentially, and technical debt is incurred
  • By preventing new work from starting, new functionality will be introduced without solving the problem
  • If you don’t fix the problem, it will recur and be more expensive to fix

Ensure quality at source

This is mostly for QA and development, a bit like the national policy of “whoever pollutes, cleans up”. On a daily basis, we need everyone in the value stream to identify and solve problems in their areas of control, so that quality control, safety responsibility, and decision making can be placed in the context of the work, rather than relying on the approval of peripheral senior management.

For example, the development process of developers can use automated tests, independent of the test team, so that developers can quickly test their code whenever they need it, and after complete automated tests, the code can be deployed to a formal environment. In this way, you are responsible to yourself, but you are also responsible to others.

Optimize for downstream work

This is responsible for the lean principle: our most important customers are our downstream, and optimizing our work for downstream requires us to empathize with them and better identify design issues that can hinder fast, smooth flow. For example, development needs to optimize its own work for operation and maintenance, such as architecture, performance, stability, testability, configurability, security and a series of features. These optimization work is as important as providing functionality to customers.

Step 3: Keep learning and experimenting

The first step is to establish a left-to-right workflow, the second step is to establish a right-to-left feedback mechanism, and the third step is to establish a culture of continuous learning and experimentation that will translate into assets for the team and organization by improving individual skills.

At the heart of this step is building a culture of high trust, one that emphasizes that everyone is a continuous learner, taking risks in their daily work; Improve your work and develop your product in a safe way, and learn from your successes and failures to identify valuable ideas and reject worthless ones. Individual efforts drive the evolution of the whole, helping the whole team to try and practice new technologies and methods.

This includes creating a culture of innovation, risk taking (as opposed to fear or blind obedience), and high trust (as opposed to low trust and command and control), allocating at least 20% of development and IT operations cycles to non-functional requirements, and constantly encouraging improvement

Establish learning organization and safety culture

In complex systems, it is not practical to accurately predict the outcome. That said, no matter how careful we are, glitches will always happen.

Westrum model proposes three types of organizational culture:

  • Pathological organizations are characterized by a lot of fear and threat and tend to hide failure.
  • Bureaucratic organizations are characterized by strict rules and procedures, with each department taking care of its own affairs. In such organizations, accidents are dealt with through a judgment system, and a combination of kindness and punishment is adopted.
  • The generative organization is an active exploration and sharing of information. In this organization, all employees of the whole team share responsibility, reflect on the accident actively and find the root cause.

The third step is the generative organization. When a failure occurs, the team focuses on how to design a safe system to prevent the recurrence of the accident, rather than on the human problem. As Etsy engineer Bethany Macri said, “No blame, no fear; No fear, can be honest; Honesty is an effective way to prevent accidents.”

Institutionalize routine improvements

In the technical value stream, teams get bogged down in implementing AD hoc solutions to prevent catastrophic failures, leaving no time to complete valuable work. Therefore, the AD hoc solution model leads to the accumulation of problems and technical debt. Therefore, we need to set aside time in our daily work to improve our daily work, such as paying off technical debt, fixing bugs, refactoring, optimizing code, etc. This requires our team to set aside a period of time in the development break to allow team members to solve problems. The effect of one thing is always the cause of another. We solve everyday problems to help identify and solve potential risks, or to have more energy to do more meaningful things.

Turn local discovery into global optimization

The transformation of local discovery into global optimization means that the first rich lead the second rich in the team. When a single team or individual has acquired some unique knowledge or experience, such tacit knowledge (knowledge that is difficult to be transmitted through documents or communication) should be converted into explicit knowledge, so as to establish a global knowledge base and form collective wisdom. When others do similar work, they can quickly find the experience of their predecessors by simply searching the knowledge base.

Inject resiliency patterns into your daily work

The goal is to add resilience and increase vulnerability to the team or system. If you want to fight vulnerability, you need to know where it is. Based on previous experience, we can improve the resilience of the system by shortening deployment practices, improving test coverage, shortening test execution time, and decoupling the system. We can verify the resilience of the system by conducting fault drills, such as random cable unplugging, power off, and process killing (e.g. Netflix Chaos Monkey). We can also test the bottleneck and upper limit of the system by pressure measurement (single interface, full link).

The leadership reinforces the culture of learning

This is for leaders. Good leadership is not about making all the right decisions, but about creating conditions for the team to feel this excellence in their daily work. Because leaders do not personally participate in front-line work, and front-line workers do not know the larger organizational environment or have the right to make changes outside the work area, the relationship between leaders and front-line workers is complementary and must respect each other.

The last

The three-step Method of DevOps, as the basic principle supporting DevOps, also derives the behavior and pattern of DevOps. I’m sure many teams have already started down the DevOps path. Here are the four stages:

  1. Only Dev doesn’t have Ops, developers do everything themselves.
  2. There’s Dev and Ops, and they’re independent of each other, with Ops doing all the work outside of developing code.
  3. Dev+Ops, Ops made some automatic tools to improve efficiency, but mainly for their own use, not development.
  4. DevOps, the development of upstream work is willing to use the system or platform provided by the downstream operation and maintenance, through API self-help, automatic completion of corresponding work.

Take a look at where you are, and if you haven’t reached it, follow the three steps to DevOps step by step.

Recommended reading

  • What are microservices?
  • Microservices programming paradigm
  • Infrastructure for microservices
  • Feasible solutions for service registration and discovery in microservices
  • From singleton architecture to microservice architecture
  • How to effectively use Git to manage code in microservices teams?
  • Summary of data consistency in microservice systems
  • Implementing DevOps in three steps
  • System Design Series how to Design a Short Chain Service
  • System design series of task queues
  • Software Architecture – Caching technology
  • Software Architecture – Event-driven architecture

Hello, I’m looking at the mountains. Swim in the code, play to enjoy life. If this article is helpful to you, please like, bookmark, follow. Welcome to follow the public account “Mountain Hut”, discover a different world.