SRE: Google Operation, Maintenance and Decryption

From slash-and-burn to normalization

[Stage 1] The first internship company was a small start-up company with only a few people at that time and the whole development process was very rough:

A. Compile and package after local development b. Connect to the server through FTP and upload the package product to the cloud host 2. [Stage 2] Worked in the first bytedance internship project. Test environment = local development environment B. Can be on-line at any time, and the code does not need to review C. The online problem discovery approach of the page is basically: user -> operation -> product -> R&D/testing

For example, there was an incident where front-end packaging was injected into the container as a data block through a feature of the cloud server that was unavailable for some time, resulting in the entire page 504. However, due to the lack of monitoring, it was not found in the first time, for example, the error rate of 5XX increased, the page PV decreased, etc. Finally, the way to find the problem became: operation feedback to the product, and the product synchronization to the research and development

[Stage 3] Establish the specification of the development on-line process and start to develop the consciousness of service stability

Construction of local environment, offline environment and product pre-delivery environment, and data differentiation of offline and online environment
Code review, card online permission
Configure monitoring and alarm
.

The definition of SRE

It’s called SRE: Google Operations and Decryption. So what is SRE?

SRE is Site Reliability Engineering, literally translated as Site Reliability Engineering. There is no clear definition in the book. According to my personal understanding, SRE is a series of methodology, which is an operation and maintenance scheme obtained by Google software engineers in the process of system operation and maintenance. (For example, the early monitoring was implemented through script detection, and then gradually evolved into a new model, which used time series information and developed a rich time series information operation language, which is called Borgmon Monitoring system)

When I joined Google in 2003, my mission was to lead a “production environment maintenance team” of seven software engineers…… The team of 7 people at that time has grown into an SRE team of more than 1000 people in the company, but the guiding concept and working mode of THE SRE team still basically keep my original idea

— Chapter One Introduction

Responsibilities of the SRE team

To optimize the

Usability improvements
Delay optimization
Performance optimization
The efficiency of optimization

Change management

The performance of business research and development is largely reflected by business iteration, such as how many requirements are put online, how many UVs are added to the page, while the performance of SRE team depends on the stability of service. Business iteration usually affects service reliability problems, because most of the online problems are caused by the launch of the new version. This led to two conflicts between the business development and SRE teams: should we launch more or less?

In order to solve this conflict, the concept of wrong budget is put forward. The error budget can be understood as the balance, for example, the whole process can be:

1. Product management has set the service availability SLO for this month, say 99.9%, then the error budget is 0.1%

2. Obtain the actual service availability through monitoring system, such as 99.95%

3. If actual availability > SLO, then there is an error budget and a new version can be released

— P31 The purpose of using incorrect estimates

monitoring

Monitoring alarm
Emergency handling

resources

Capacity planning and management

Basic concept

Without a detailed understanding of the importance of the various behaviors of the service, and without measuring the correctness of those behaviors, the system cannot be properly operated, let alone reliably operated. Therefore, whether external services, or internal apis, we need to set a quality of service target for users, and strive to achieve this service target.

— P34 Chapter IV Service Quality Objectives

SLI: Service Level Indicator Service Level Indicator
Request delay
On average,
PCT50
PCT95
PCT99
Error rate
QPS
Resources (CPU, memory, disk)
Utilization throughput
Availability = machine uptime/total time or successes/total requests

PCT99: Sort a group of numbers from smallest to largest, with the value PCT99 in the 99th position.

The point of PCT99 is that some requests for a service may be fast and some may be slow, and taking the average directly may mask the long tail delay. In real development, the higher the QPS, the higher the latency, such as the evening peak for a service and the rest of the time low peak, resulting in a low average value

Actual cases:

Interface QPS

Interface PCT99

SLO: Service Level Objective (SLI
Availability is greater than 99%
The average latency is less than 100ms

Establishment of SLO

The development of an SLO is business-related

It shouldn’t be 100%

A. Theoretically impossible B. There is no significant difference between 99.99 and 100 for users

Avaliability level Allowed unavailability window
	per year	per quarter	per month	per week	per day	per hour
90%	36.5 days	9 days	3 days	16.8 hours	2.4 hours	6 minutes
95%	18.25 days	4.5 days	1.5 days	8.4 hours	1.2 hours	3 minutes
99%	3.65 days	21.6 hours	7.2 hours	1.68 hours	14.4 minutes	36 seconds
99.5%	1.83 days	10.8 hours	3.6 hours	50.4 minutes	7.20 minutes	18 seconds
99.9%	8.76 hours	2.16 hours	43.2 hours	10.1 minutes	1.44 minutes	3.6 seconds
99.95%	4.38 hours	1.08 hours	21.6 hours	5.04 minutes	43.2 minutes	1.8 seconds
99.99%	52.6 minutes	12.96 minutes	4.32 minutes	60.5 seconds	8.64 minutes	0.36 seconds
99.999%	5.26 minutes	1.30 minutes	25.9 minutes	6.05 seconds	0.87 seconds	0.04 seconds
3. The value should be to achieve, not more than too much, set too low meaningless, set too high waste of machine resources

SLA: Service Level Agreement, an Agreement between Service users that describes the consequences of meeting or not meeting the SLO (business related)
- Ali Cloud ECS SLA
- AWS SLA

monitoring

implementation

Collect, process, summarize, and display real-time quantitative data about a system.

Taking Opentsdb as an example, the specific content is: Time-series data + tags

For example, if the dot name is app.home.page and the time range is 2020.11.12 00:00:00 to 2020.11.12 16:00:00, the query result is

{
    dps: {
        "1605110400": 1.1."1605110430": 1.9."1605110460": 1.0."1605110490": 1.3."1605110520": 1.4. },"tags": {
        "method": "home". }}Copy the code

meaning

Analyze long-term trends, such as database capacity issues
Data comparison, after the launch of delay, error rate is higher
Alarm, set reasonable alarm, reduce false alarm rate.

Call the police

Definition: calculate the monitoring data within the alarm time window (alarm operating frequency) and get the result as true/false.

Alarm rules specification:

Set alarms with the right threshold where you really care. If the alarm threshold is too low, the alarm is meaningless; if the alarm threshold is too high, the sensitivity to the alarm will be reduced and the real alarm situation will be ignored.
The alarm should be operational and some kind of operation should be performed immediately after the alarm is received. If it is mechanized operation should be made into automatic assembly line.
Repeat the alarm for aggregation.

A link to the

Opentsdb doc:opentsdb.net/docs/build/…

Author: CAI Yupeng

SRE: Google Operation, Maintenance and Decryption

From slash-and-burn to normalization

The definition of SRE

Responsibilities of the SRE team

Basic concept

monitoring

implementation

meaning

Related Posts

Handwriting algorithm and memorizing it: Quicksort (Easiest to Understand version)

Flat () and flatMap()

Vue2 response formula principle analysis (II) : calculation attribute disclosure