Summary: In technical work, there is often an understanding of both the product/foundational technology development and SRE roles based on whether to focus on code or not. Moving from product development to SRE often leads to the idea of “moving away from coding” or “moving away from advancing the product/underlying technology.”

preface

In technical work, there is often an understanding based on whether to focus on code or not for both product/foundation-technology development and SRE roles. Moving from product development to SRE often leads to the idea of “moving away from coding” or “moving away from advancing the product/underlying technology.”

Based on the past experience of technology R&D and stability assurance, I will share my understanding of SRE and discuss the collaboration between the roles of “product/base technology R&D” and “stability assurance” to better serve the business.

SRE overview

The earliest discussion of SRE comes from Google’s book Site Reliability Engineering: How Google Runs Production Systems. Key members of Google SRE share how they take a holistic view of the software life cycle and why doing so helps Google successfully build, deploy, monitor, and operate the world’s largest existing software system.

The earliest discussion of SRE comes from Google’s book Site Reliability Engineering: How Google Runs Production Systems. Key members of Google SRE share how they take a holistic view of the software life cycle and why doing so helps Google successfully build, deploy, monitor, and operate the world’s largest existing software system.

Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems.

Here is a vivid description of the work of SRE:

SRE is “what happens when a software engineer is refuge with what used to be called operations.”

The goal of SRE is to build extensible and highly available software systems that solve infrastructure and operations-related problems through a software engineering approach.

In the Google SRE book, there is an accurate description of the daily working state of SRE: up to 50% of the time is spent on operational matters, and 50% more is spent on software engineering to ensure the stability and scalability of the infrastructure.

Based on the above description, my understanding of SRE is as follows:

  • Responsibilities: Ensure the stability and scalability of the infrastructure.
  • Core: Solve the problem.
  • Methods: Accumulate problem experience through operation transactions, and improve problem solving efficiency through coding and other ways.

Software Lifecycle

In Google SRE, there is a vivid description of software engineering from the perspective of life cycle:

Software engineering is sometimes like raising a child: although the process of reproduction is painful and difficult, the process of raising a child and becoming an adult is where most of the real energy is spent.

Forty to 90 percent of the cost of a software system is actually spent on maintenance after development and construction.

The amount of time and effort spent designing and building a software system during the project life cycle is usually less than the amount spent maintaining and managing the system after it has gone live. In order to better maintain reliable system operation, two types of roles need to be considered:

  • Focus on designing and building software systems.
  • Focus on the entire software system life cycle management, from its design to deployment, through continuous improvement, and finally smooth off the line.

The first type of role corresponds to product/basic technology research and development, and the second type of role corresponds to SRE. The common goal of the two roles is to achieve project objectives and cooperate to serve the business well.

Stability guarantee value

In view of the impact of stability, students who are directly involved in dealing with customer issues will feel more physical:

  • Through the degree of influence and urgency of direct feedback from customers when problems occur, we can feel the anxiety brought by stability to customers.
  • Through the customer feedback after the problem processing, feel the customer’s gratitude or anger for the stability guarantee.
  • The influence of stability on business revenue can be felt through the change of revenue status and customer scale after the event.
  • Through the delay of product planning, we can feel the influence of stability on product iteration.

The value of stability assurance is thus highlighted:

  • Ensure customer’s product experience and meet customer’s reliability requirements.
  • Accelerate business iteration, meet business demands for stability, and focus on launching functions that meet customer needs more quickly.

How does the SRE guarantee stability

Stability problems usually have these characteristics:

  • Man-made, relying on expert experience
  • A combination of factors
  • The inevitable
  • A 100% guarantee is not necessary

Online stability problems are highly caused by improper human operations, which are concentrated in the two links of publishing and online operation and maintenance, both of which are high-frequency operations. For complex systems, these two links are highly dependent on expert experience.

The stability problems usually have systematic characteristics, that is, they are not caused by the defects of a single functional component, but by a series of factors. For example, the lack of monitoring alarm leads to the failure of timely perception, the lack of log can not help to quickly locate the problem, and the lack of a good troubleshooting process leads to the reliance on personal ability. Lack of good coordination and communication leads to increased problem handling time and increased customer influence.

Problems are inevitable, caused by traffic surges, server/network/storage damage, uncovered input, etc.

The business has an external SLA, which promises a certain degree of stability to the customer and pays compensation according to the agreement if it fails to achieve it. At the same time, the problem is inevitable. To continue to improve the stability under the premise of satisfying the internal SLO standards will bring higher implementation costs and smaller revenue increment to the business.

SRE requires a deep understanding of the characteristics of the problem, a systematic design and implementation of solutions, and a focus on addressing the major problems over time. An overall solution for reference is as follows:

In the landing process, you can first solve the problem from the following three grip systems:

  • controllability
  • observable
  • Stability assurance best practices

In terms of controllability, there are three main dimensions as follows:

  • Release management focuses on resolving artificial stability issues caused by releases. It includes important change review before release and change action management during release.
  • Operation management focuses on solving the problem of artificial stability caused by black screen operation. Including unified cluster operation entry, cluster operation authority management, cluster operation audit and so on.
  • The design review focuses on solving the best practices of application stability assurance in the design stage of software system. Including cluster scheme review and important function design review.

In terms of observability, it includes the following important dimensions:

  • Monitoring focuses on the perception of the running state of the software system. This includes the setup and maintenance of monitoring and collection/visualization systems.
  • Logging focuses on the troubleshooting capability of the software system. Including log collection/storage/query/analysis system construction and maintenance, etc.
  • Patrol inspection focuses on the active detection of whether the software system function is normal. Including the construction of inspection service, the development and maintenance of general inspection logic, etc.
  • Alerts focus on resolving abnormal timely reach requirements. Including alarm system construction, alarm configuration management, alarm way management, alarm analysis, etc.

Stability assurance best practices abstract awareness, processes, norms, and tools from historical issues and industry practices, incorporate them at the beginning of the system design, and use them throughout the life cycle of the system, such as solidifying best practices through templates:

  • Project quality acceptance criteria
  • Project safety production standards
  • Checklist before project release
  • Project TechReview template
  • Project Kick-off template
  • Project Management Specification
  • etc.

An example:

In order to facilitate understanding, check items can be classified to facilitate communication and project stability assessment:

When the best practices can be standardized through documentation, then tools or services can be provided to apply them at low cost, making the stability assurance best practices the infrastructure. SRE requires constant iteration in stablity-related methodologies and practices, top-down design, bottom-up feedback, and reasonable and reliable stability assurance.

Win-win, service business hand in hand

  • Product/basic technology development: focus on designing and building software systems.
  • SRE: focuses on the entire software system life cycle management, from its design to its deployment, through continuous improvement, and finally smooth off the production line.

These two roles cooperate and serve each other. They share the same goal: to meet the business needs and better serve the business.

SRE usually horizontal support multiple projects, types, solve the practical problems of online has a more comprehensive understanding and thinking, based on the theory of the form of best practices, tools, or services, to provide theory, tools, support for research and development, can also be based on the stability of the transition security solution, for more customer service, create greater value. Product/foundational technology development has a deeper understanding of business requirements and functional/technical details. On the one hand, it can directly bring business value, on the other hand, it can bring practical requirements for stability assurance through practice, and further guarantee stability together with SRE.

Both types of roles need to work side by side toward a common goal, develop together with the business, and achieve a win-win situation.

summary

Due to the nature of its work, SRE will serve a large number of businesses in the horizontal aspect to accumulate a deep understanding of the problem domain of stability assurance and the profound cognition of the importance of stability assurance through practice. In the vertical aspect, it will precipitate and apply the best practices of stability assurance through technical means. At the same time the vision is with research and development, business together forward looking, integrated technology and management to create value.

The above is a personal understanding of SRE and stability assurance, focusing on solving problems and creating greater value.

This article is the original content of Aliyun and shall not be reproduced without permission