From slash-and-burn to normalization

  1. [Stage 1] The first internship company was a small start-up company with only a few people at that time and the whole development process was very rough:

A. Compile and package after local development b. Connect to the server through FTP and upload the package product to the cloud host 2. [Stage 2] Worked in the first bytedance internship project. Test environment = local development environment B. Can be on-line at any time, and the code does not need to review C. The online problem discovery approach of the page is basically: user -> operation -> product -> R&D/testing

For example, there was an incident where front-end packaging was injected into the container as a data block through a feature of the cloud server that was unavailable for some time, resulting in the entire page 504. However, due to the lack of monitoring, it was not found in the first time, for example, the error rate of 5XX increased, the page PV decreased, etc. Finally, the way to find the problem became: operation feedback to the product, and the product synchronization to the research and development

  1. [Stage 3] Establish the specification of the development on-line process and start to develop the consciousness of service stability
  • Construction of local environment, offline environment and product pre-delivery environment, and data differentiation of offline and online environment
  • Code review, card online permission
  • Configure monitoring and alarm
  • .

The definition of SRE

It’s called SRE: Google Operations and Decryption. So what is SRE?

SRE is Site Reliability Engineering, literally translated as Site Reliability Engineering. There is no clear definition in the book. According to my personal understanding, SRE is a series of methodology, which is an operation and maintenance scheme obtained by Google software engineers in the process of system operation and maintenance. (For example, the early monitoring was implemented through script detection, and then gradually evolved into a new model, which used time series information and developed a rich time series information operation language, which is called Borgmon Monitoring system)

When I joined Google in 2003, my mission was to lead a “production environment maintenance team” of seven software engineers…… The team of 7 people at that time has grown into an SRE team of more than 1000 people in the company, but the guiding concept and working mode of THE SRE team still basically keep my original idea

— Chapter One Introduction

Responsibilities of the SRE team

  • To optimize the
  1. Usability improvements
  2. Delay optimization
  3. Performance optimization
  4. The efficiency of optimization
  • Change management

The performance of business research and development is largely reflected by business iteration, such as how many requirements are put online, how many UVs are added to the page, while the performance of SRE team depends on the stability of service. Business iteration usually affects service reliability problems, because most of the online problems are caused by the launch of the new version. This led to two conflicts between the business development and SRE teams: should we launch more or less?

In order to solve this conflict, the concept of wrong budget is put forward. The error budget can be understood as the balance, for example, the whole process can be:

1. Product management has set the service availability SLO for this month, say 99.9%, then the error budget is 0.1%

2. Obtain the actual service availability through monitoring system, such as 99.95%

3. If actual availability > SLO, then there is an error budget and a new version can be released

— P31 The purpose of using incorrect estimates

  • monitoring
  1. Monitoring alarm
  2. Emergency handling
  • resources
  1. Capacity planning and management

Basic concept

Without a detailed understanding of the importance of the various behaviors of the service, and without measuring the correctness of those behaviors, the system cannot be properly operated, let alone reliably operated. Therefore, whether external services, or internal apis, we need to set a quality of service target for users, and strive to achieve this service target.

— P34 Chapter IV Service Quality Objectives

  • SLI: Service Level Indicator Service Level Indicator
  • Request delay
  • On average,
  • PCT50
  • PCT95
  • PCT99
  • Error rate
  • QPS
  • Resources (CPU, memory, disk)
  • Utilization throughput
  • Availability = machine uptime/total time or successes/total requests

PCT99: Sort a group of numbers from smallest to largest, with the value PCT99 in the 99th position.

The point of PCT99 is that some requests for a service may be fast and some may be slow, and taking the average directly may mask the long tail delay. In real development, the higher the QPS, the higher the latency, such as the evening peak for a service and the rest of the time low peak, resulting in a low average value

Actual cases:

Interface QPS

Interface PCT99

  • SLO: Service Level Objective (SLI
  • Availability is greater than 99%
  • The average latency is less than 100ms

Establishment of SLO

  1. The development of an SLO is business-related
  2. It shouldn’t be 100%

A. Theoretically impossible B. There is no significant difference between 99.99 and 100 for users

Avaliability level Allowed unavailability window
per year per quarter per month per week per day per hour
90% 36.5 days 9 days 3 days 16.8 hours 2.4 hours 6 minutes
95% 18.25 days 4.5 days 1.5 days 8.4 hours 1.2 hours 3 minutes
99% 3.65 days 21.6 hours 7.2 hours 1.68 hours 14.4 minutes 36 seconds
99.5% 1.83 days 10.8 hours 3.6 hours 50.4 minutes 7.20 minutes 18 seconds
99.9% 8.76 hours 2.16 hours 43.2 hours 10.1 minutes 1.44 minutes 3.6 seconds
99.95% 4.38 hours 1.08 hours 21.6 hours 5.04 minutes 43.2 minutes 1.8 seconds
99.99% 52.6 minutes 12.96 minutes 4.32 minutes 60.5 seconds 8.64 minutes 0.36 seconds
99.999% 5.26 minutes 1.30 minutes 25.9 minutes 6.05 seconds 0.87 seconds 0.04 seconds
3. The value should be to achieve, not more than too much, set too low meaningless, set too high waste of machine resources
  • SLA: Service Level Agreement, an Agreement between Service users that describes the consequences of meeting or not meeting the SLO (business related)
    • Ali Cloud ECS SLA
    • AWS SLA

monitoring

implementation

Collect, process, summarize, and display real-time quantitative data about a system.

Taking Opentsdb as an example, the specific content is: Time-series data + tags

For example, if the dot name is app.home.page and the time range is 2020.11.12 00:00:00 to 2020.11.12 16:00:00, the query result is

{
    dps: {
        "1605110400": 1.1."1605110430": 1.9."1605110460": 1.0."1605110490": 1.3."1605110520": 1.4. },"tags": {
        "method": "home". }}Copy the code

meaning

  • Analyze long-term trends, such as database capacity issues
  • Data comparison, after the launch of delay, error rate is higher
  • Alarm, set reasonable alarm, reduce false alarm rate.

Call the police

Definition: calculate the monitoring data within the alarm time window (alarm operating frequency) and get the result as true/false.

Alarm rules specification:

  • Set alarms with the right threshold where you really care. If the alarm threshold is too low, the alarm is meaningless; if the alarm threshold is too high, the sensitivity to the alarm will be reduced and the real alarm situation will be ignored.
  • The alarm should be operational and some kind of operation should be performed immediately after the alarm is received. If it is mechanized operation should be made into automatic assembly line.
  • Repeat the alarm for aggregation.

A link to the

Opentsdb doc:opentsdb.net/docs/build/…

Author: CAI Yupeng