Site reliability engineering tour

After practice and promotion by Google, SRE has been adopted by many Internet companies. If you want to practice SRE and become an SRE engineer, what kind of knowledge should you have? This article introduces SRE related technology, provides a lot of useful resources, students who are interested in this direction can use it as a technical development roadmap. 译文 : A Journey To The Site Reliability Engineering^[1]

Mukuko Studio @ Unsplash

Many organizations have adopted Site Reliability Engineering (SRE) practices in place of traditional operations and maintenance. The latest job search on LinkedIn shows more than 190,000 SRE engineer positions available worldwide.

LinkedIn Job Search

If you’re not familiar with SRE, how does Google describe it

SRE is what happens when you ask a software engineer to design an operations team.

SRE is defined by seven important principles

Operations are a software problem.

Managed by Service Level Objectives

Work to minimize the Work.

Automate this year’s job away.

To Move fast by reducing the cost of failure.

Share Ownership with the Developers

Use the same tooling, regardless of function or job title

SRE engineering is a great career path for anyone with a background in operations support, systems administration, infrastructure, DevOps engineering, etc.

In this article, I’ll provide resources to help you get started as an SRE engineer.

Mastering the Art of Service Level Objectives(SLOs)

For a smooth journey, it is necessary to start by understanding the concepts of ** Service Level Indicators (SLIs) and Service Level Objectives (SLOs)**.

SLI: Quantifiable measures of service reliability SLO: Sets reliability goals for the SLI

There are many resources available on SLI and SLO, but I recommend using the SLO Art Workshop [2] to gain an in-depth understanding of this concept.

If you are part of an organization that is trying to adopt SRE practices, THEN I recommend conducting this workshop for aspiring SRE within the organization.

The workshop aims to introduce you to how to measure and manage service reliability through **SLO and Error Budgeting ** in a data-driven, objective, and user-centric manner.

Workshops can guide us in choosing the right SLI and help us gain practical experience in defining SLI/SLO through case studies.

Keep an open mind and fresh perspective as you learn, because I’ve seen a lot of people think SLI/SLO is similar to the infrastructure monitoring they do with the APM tools they use, but it’s not!

Cloud Expertise

According to a Gartner report [3], more than 75% of enterprises have a cloud first strategy.

Source – https://www.gartner.com/en/information-technology/insights/cloud-strategy

Therefore, familiarity with cloud services such as AWS, GCP, and Azure is essential.

Many organizations are actively using cloud technologies to modernize their applications, and SRE has been asked to play an important role in this transformation.

There are a lot of websites on the Internet like Udemy, PluralSight, Coursera, CloudGuru and so on to improve our knowledge base.

Infrastructure as Code(IaC)

As organizations migrate workloads in the cloud, the need for an efficient, dynamic management infrastructure becomes even more acute. Therefore, an SRE should have IaC tools like the following:

Terraform
Ansible
Chef
, etc.

Even if all cloud service providers have their own SDK/Shell to manage their services, there are still many benefits to using IaC tools.

The following is quoted from Quickly Deploy Applications Using Terraform With Kubernetes on GCP[4] :

Terraform’s ability to show the difference between the current state and the expected state means that once we edit the Terraform configuration file, we can see the changes that will be made.

Terraform is not only responsible for initial deployment, but also maintenance. We can easily create, update, and delete traced resources using commands.

Cleaning up everything Terraform builds is pretty easy. If we use scripts, we also have to write a cleanup script. For Terraform, you can simply use the “Terraform destry” command.

Terraform can check the order of actions declared in a configuration file. This means that if we want to run a Kubernetes-based service or deployment, Terraform will still create the cluster first, even if we incorrectly declare the order of operations.

You can check out the following links for more information on this topic.

learn.hashicorp.com/terraform
www.ansible.com/resources/g…

Containers & Container Orchraction Platforms

Because SRE plays a key role in application deployment, it is important to understand containers and container choreography platforms.

Many organizations use Docker and Kubernetes platforms for service deployment, and there are plenty of resources available online on this topic.

Here are some links to get started:

www.docker.com/101-tutoria…
kubernetes.io/training/

Continuous Integration & Continous Deployment(CI/CD)

SRE needs to automate as much work as possible, and providing the application with the proper CI/CD pipeline is an important part of fast delivery. Many organizations use platforms such as:

GitLab
GitHub
Azure DevOps
Jenkins
, etc.

Therefore, expertise in building CI/CD pipelining is an essential skill. Many of these platforms support free services that allow you to teach yourself without spending a penny.

Here are some learning resources:

about.gitlab.com/learn/
lab.github.com/
Azure.microsoft.com/en-us/overv…

Release Strategies

Source – https://sre.google/workbook/canarying-releases/

As part of the SRE role, we need to constantly deploy new features for users. While doing this, you also need to ensure that you don’t consume Error budgets when deploying new features, so familiarize yourself with the following release strategies:

Canary Release [5]
Blue green Release [6]
, etc.

Familiarity with feature-flag [7] development strategies will be an advantage. If you use a container choreography platform like Kubernetes, you can use Kubernetes’ definition file to describe these policies [8].

The process of canary’s release is described in depth in Google’s SRE workbook [9].

Incident Response & Blameless Postmortems

Being on call is another important SRE responsibility. Therefore, SRE requires a very good understanding of the incident response process.

The PagerDuty Accident Response course [10] covers the following topics:

What is an accident?
The accident level
Various roles in incident management
Accident Telephone Etiquette
, etc.

It is important to document the response process, because people can better manage emergencies if they know what to do when an accident happens.

PagerDuty also has another course on how to foster a non-blame culture in SRE teams [11], which provides some very detailed templates for performing non-blame postmortems.

These two courses are highly recommended.

Security

Because SRE is responsible for the entire application, it is always good to have a basic understanding of application security.

It is strongly recommended that you become familiar with the following concepts:

OWASP Top 10^[12]
Application Threat Modelling^[13]

For automated deployment, SRE needs to manage various service credentials, so you should be familiar with credentials management tools such as HashiCorp Vault[14] or cloud native encryption management solutions such as Azure Keystore, Google Encryption Manager, etc.

Documentation

SREs needs to ensure that all important documents are updated regularly and easy to follow, so it should focus on producing high-quality documents such as:

Operation Runbooks
Release/Rollback Documents (Release & Rollback Documents)
, etc.

Google offers a free technical writing course [15] that recommends learning and applying the principles in your daily life, although you can also sign up for a tutorial if you have the time.

I’ve also written about Best Practices for technical writing for Engineers, “Best Practices When Documenting Your Code for Software Engineers” [16].

Disaster Recovery Testing/Chaos Engineering

To test the robustness of the platform, the SRE is also responsible for performing disaster recovery testing. Google uses disaster recovery testing as part of its robust service, and “The Unexpected” [17] is a detailed article about the Google DiRT project.

Recently, the Chaos Engineering concept of Netflix has become very popular. I have also written about Chaos Engineering in Why Every Software Developer Needs to Learn Chaos Engineering [18].

Non-abstract Large Scale Designs(NALSD)

When we start talking about large, complex, distributed systems, Google has designed a process [19] that can help SRE develop capabilities to evaluate, design, and measure large systems.

The NALSD process includes problem statements, requirements gathering, and iterative system design to help assess the tolerance of large-scale systems to different failure modes.

Google also provided a workshop that took us through the system design of distributed message queues (such as PUB/SUB) and explained how to implement the NALSD principle for them.

I personally learned a lot from it.

community

In order to learn more from others and keep abreast of the latest developments in the industry, we suggest joining the following online communities:

www.reddit.com/r/sre/
LinkedIn – School of SRE – www.linkedin.com/groups/1249…

conclusion

Overall, the SRE engineering process is very interesting and is being adopted by many organizations.

References: [1] A Journey To The Site Reliability Engineering: deshpandetanmay.medium.com/a-journey-t… [2] The Art of SLOs: Sre. Google /resources/p… [3] The Latest Cloud Computing Technology and Security: www.gartner.com/en/informat… [4] Quickly Deploy Applications Using Terraform With Kubernetes on GCP: medium.com/google-clou… [5] Canary Release: martinfowler.com/bliki/Canar… [6] Blue Green Deployment: martinfowler.com/bliki/BlueG… [7] Feature Toggles: martinfowler.com/articles/fe… [8] Kubernetes Deployment: Kubernetes. IO/docs/concep… [9] Canarying Releases: sre. Google/workbook/ca… [10] PagerDuty Incident Response: response.pagerduty.com/ [11] PagerDuty Postmortems: Postmortems.pagerduty.com/culture/bla… [12] OWASP Top 10: owasp.org/www-project… [13] Application — kyoui Modelling: deshpandetanmay.medium.com/threat-mode… [14] Vault: www.vaultproject.io/ [15] Technical Writing Courses for Engineers: developers.google.com/tech-writin… [16] Best Practices When Documenting Your Code for Software Engineers: betterprogramming. Pub/Best – practi… [17] Risk the Unexpected: queue.acm.org/detail.cfm?… [18] according to Every Software Developer Needs to Learn Chaos Engineering: betterprogramming. Pub/according to Every – s… [19] Introducing Non-abstract Large System Design: Google/Workbook/No…

Hello, MY name is Yu Fan. I used to do R&D in Motorola, and now I am working in Mavenir for technical work. I have always been interested in communication, network, back-end architecture, cloud native, DevOps, CICD, block chain, AI and other technologies. The official wechat account is DeepNoMind