Reading notes from the Geek Time SRE Field Manual

Portal: time.geekbang.org/column/intr…

Background of SRE

Internet companies large and small are embracing microservices, containers, and other distributed technologies and products in order to deliver user value more efficiently, and are also introducing advanced concepts like DevOps. But then came the challenge: after introducing so many advanced technologies and concepts, it was difficult to guarantee the stability of this complex architecture. What should we do?

For example, in the microservice architecture, there are more complex upstream and downstream invocation relationships and longer invocation links, and more service indicators need to be monitored. Combined with the frequent changes and releases of DevOps services, stability needs to be guaranteed globally.

The answer is SRE!

In recent years, the industry has paid more and more attention to SRE, and there is almost a consensus that Google SRE is the current best practice in the stability field. In other words, SRE has become a byword for stability.

What exactly is SRE

SRE is a position, and it is a position with full stack capability, we need a SRE god to solve the stability problem. Operation and maintenance brother: SRE is the upgraded version of traditional operation and maintenance, do a good job in monitoring, quickly find the problem, quickly find the root cause of the problem, do a good job in operation and maintenance automation. Infrastructure brother: SRE to strengthen capacity planning, learn from Google to achieve fully automated elastic scaling. The above statements seem to be reasonable, but not comprehensive. In fact:

SRE is a systematic approach, and we can understand it better only from a global perspective.

From the perspective of functional division, the construction of SRE system cannot be completed independently by a single post or department, but must require efficient cross-team cooperation:

(Figure: SRE stability guarantee planning diagram)

Many of these things are common, such as capacity assessment, failure drills, service degradation, service limiting, abnormal fuses, monitoring alarms, and so on:

  • Like capacity expansion and shrinkage, there is bound to be intersection with the operation and maintenance team; If it is more flexible and needs to be combined with monitoring, it is necessary to cooperate with the monitoring team;

  • It is also possible to rely on DevOps to provide basic capabilities such as continuous delivery, configuration changes, and grayscale publishing, where development and performance teams intersect.

The fundamental purpose of SRE is to improve stability

From the perspective of common measures of stability in the industry, there are two key indicators:

  • MTBF, Mean Time Between failures

    • Objective: To improve MTBF, i.e. to reduce the number of failures
  • MTTR, Mean Time To Repair: indicates the average fault Repair Time

    • Objective: To reduce MTTR and to improve the efficiency of troubleshooting and reduce the duration of failure

MTTR

It can be subdivided into four indicators: MTTI, MTTK, MTTF and MTTV.

  • Pre-mtbf stage (Failure free stage) : Make a good architectural Design and provide the service management means of design-for-failure such as current limiting, degradation and fusing, so as to have the condition of quick Failure isolation.

  • Post-mtbf stage: to start the new MTBF stage, we should do fault re-check, summarize experience, find deficiencies, and implement improvement measures.

  • MTTI stage: We rely on the monitoring system to help us find problems in time. For systems with high complexity and large volume, we rely on the ability of AIOps to improve the accuracy of alarms and make accurate responses.

According to the idea of beginning with the end, the goal of SRE is to “improve MTBF and reduce MTTR”. From this perspective, we once again realize that SRE must be built as a systematic project

Understand system availability

Since the goal of SRE is to “increase MTBF and decrease MTTR”, i.e. improve system stability, it is necessary to have a basic understanding of the availability of the system. There are two ways to measure system availability

Time: Availability = Uptime/(Uptime + Downtime) Request: Availability = Successful Request/Total Request

Factors to consider in setting system stability goals

To measure system availability from time and request dimensions, both of them contain three major elements: measurement index, measurement target and influence duration/statistical cycle. The two algorithms will finally settle on “several nines”. It is worth discussing whether the system should determine “several nines” to meet our stable needs

  • Cost factor

    Cost: 9 The more stability the better, but the corresponding costs and costs will be higher

    Think ROI first. At this time will look at the enterprise’s own cost pressure to bear the situation

  • Business tolerance

    For core business or core application: The higher the success rate, the better

    Non-core business or application: Does not have a significant impact on business revenue and user experience

  • Current system stability

    Setting a reasonable standard is more important than setting a higher standard

    Step by step toward a higher standard. It’s also easier to hit the ground running, because if you set too high a goal and fail to reach it, it will hurt your team’s confidence and motivation

How does SRE measure availability

Since “system availability” can be calculated in two ways: the duration dimension and the percentage of successful requests, in SRE practice the second option is usually chosen:

Availability = Successful request / Total request
Copy the code

In other words, the stability is measured according to the proportion of successful requests. The process of “determining successful request conditions and setting the proportion target” is the process of SLI and SLO setting the stability measurement standard in SRE.

SLI & SLO concept

  • SLI(Service Level Indicator) : Service Level indicators that we choose to measure our stability.

  • Service Level Objective (SLO) : A stability goal set by us, such as “several nines”.

SLI is the metric we want to monitor, and SLO is the target of this metric.

There are so many system health indicators, which ones are suitable for SLI?

  • Principle 1: select indexes that can mark whether a subject is stable or not. If they are not the indexes of the subject itself, or cannot mark the stability of the subject, they should be excluded.

  • Principle 2: For business systems with user interfaces such as e-commerce, select indicators that are strongly related to user experience or that users can clearly perceive.

A quick method to identify SLI indicators: VALET

Calculate availability through SLO

After the SLI is defined, the SLO can be set. After the SLO is set, the real availability needs to be calculated in practice to judge whether the SLO is achieved

System Availability: Availability = Successful Request/Total Request

The first is to calculate directly from the definition of success

Successful = (status code is not 5xx) & (delay <= 80ms)

Question:

  1. Determining the success of a single request is too rigid and easy to miskill.

    (For example, we usually set a confidence interval for time delay, such as 90% time delay less than or equal to 200ms, which is difficult to reflect in this way.)

  2. The tolerance of success rate of status code and success rate of delay is usually different, so this method is not accurate enough.

This method is usually used in the Service commitment provided by a third party, that is, Service Level Agreement (SLA), because it is easy to judge.

The second method is the SLO method

Such as:

SLO1:99.95% success rate of status codes SLO2:90% Latency <= 80ms SLO3:99% Latency <= 200msCopy the code

Use the formula: Availability = SLO1&SLO2&SLO3

Only when all three SLOs are up to standard, the stability of the entire system is up to standard. If one SLO is not up to standard, it is not up to standard. Thus, the rationality of SLO setting can be well combined with the final usability.

False budgets: Consensus mechanisms for achieving stability goals

Let’s follow the SLO line and see what we can do with the SLO, or how we can apply the SLO.

Landing SLO, first converted to Error Budget

In fact, the wrong budget is the same as the driver’s license scoring system. The biggest effect is to “remind you how many times you have the chance to make mistakes”. Moreover, the warning effect of the wrong budget is more intuitive than the success rate of statistics, and the sensory impact is stronger.

(Figure: Reverse derivation through SLO)

How to apply Error Budget?

Stable burnout diagram

When you and your team members are able to see how many opportunities there are to make mistakes, the awe of the production system is greatly increased. And when a certain percentage of the wrong budget is consumed, such as 80 or 90 percent, it is necessary to start early warning, control changes, or focus on resolving problems that affect stability.

The fault classification

A more maneuverable way to determine whether a problem is a failure, or to assess the extent of the problem, besides how long it lasts, is to look at the percentage of the wrong budget that the problem consumes.

The error budget corresponding to the successful request rate of a module SLO is 25,000 times. If the number of error requests generated by a problem exceeds 5000 times, that is, more than 20% of the error budget will be consumed at one time, then we can classify the failure as P2. And so on, if you consume more than 30%, we call it P1, if you consume more than 50%, we call it P0, and so on.

Stability consensus mechanism

According to the situation of the remaining budget, the corresponding action measures will be formulated to avoid the failure of our stability goal, namely SLO.

First, there should be a tolerance for problems before the surplus budget is sufficient or used up.

If the budget is adequate and a single problem does not cost a lot of money, then the problem should not be complained about or responded to with a high priority, it should be tolerated.

Second, SRE reserves the right to abort and reject any online changes before the remaining budget is consumed too quickly or nearly.

Work with SRE to resolve stability issues until they are resolved, and return to the normal change rhythm when the next cycle has a new error budget.

(Action to ensure stability cannot be done by one party alone; it requires the consent and cooperation of many.)

Be sure to push the surrounding teams or stakeholders to reach consensus from top to bottom.

An alarm based on an incorrect budget

  1. Alarms of the same type and similar type are merged and sent
  2. Make alarms based on error budgeting, that is, we only focus on alarms that affect stability, see sre.google/workbook/al…

Case: What other factors need to be considered when landing an SLO

The service dependency of a system is very complex, so we need to sort out the core and non-core links first. When setting the SLO of the system, the general principle is to set the SLO of the core link first, and then decompose the SLO according to the core link.

Core link: Determine the core application and strong/weak dependence relationship

  • All related applications are identified by a technique such as full-link tracing, that is, a topology showing the invocation relationship

  • Core applications are determined by business scenarios and characteristics, and basically need to be analyzed and discussed one by one. This process may require a lot of manual effort to complete.

What are the principles for setting up an SLO?

  1. The SLO for core apps should be stricter, while the SLO for non-core apps can be relaxed
  2. Strong dependencies between core applications, SLO to be consistent

For example, the Buy app that placed an order relies on the Coupon promotion app. We require that the SLO of successful placing an order should be 99.95%. If Coupon is only 99.9%, it does not meet the requirements

  1. In weak dependency, the core application depends on the non-core, and there should be degradation, circuit breaker and flow limiting and other service governance means
  2. Error Budget policy, the Error Budget of the core application should be shared

That is, if a core application error budget is exhausted and the SLO is not reached, the whole link, in principle, should be suspended.

(If the error budget for a single application is exhausted, stop the change and resume the change until the problem is completely resolved.)

How do I verify the SLO of a core link?

Capacity of the pressure measuring

The main function of the capacity manometry is to check whether the Volume in the SLO, that is, whether the capacity target can be achieved. For the general business system, we will use QPS and TPS to represent the system capacity, and get the capacity index

Another function of capacity manometry is to verify whether the traffic limiting downgrading strategy described above can work in extreme capacity scenarios.

Why do we do volumetric manometry?

Analysis or easel composition alone cannot be exposed in distributed systems, because business changes are made daily, and the call relationships and call volumes between applications change over time. At this time, it is necessary to have such means as volumetric pressure measurement to simulate and verify, and then expose the dependency problem.

Chaos Engineering- Chaos Engineering

Chaos engineering is a very complex systematic engineering, because to create a fault online, more or less have an impact on the online business, if the simulated fault caused by the real impact exceeds the estimated impact, also need to be able to quickly isolation, and quickly resume normal business.

Chaos engineering is the advanced stage of SRE stability system construction, and it must be considered only when the relative basic and necessary parts of SRE system are perfect, such as service governance, capacity pressure measurement, link tracking, monitoring and alarm, operation and maintenance automation, etc.

When should system validation be done?

  1. Try a false budget: Try to avoid periods of false budget inadequacy. Because under normal business conditions, we are already under great pressure to complete the SLO without adding new risks to system stability.
  2. Assess the impact of fault simulation: If the business impact is large, fine-grained and step-by-step solutions should be introduced to avoid unpredictable losses. (For example, drill in the early morning when the business volume is relatively small)