A friend and I recently had a conversation about the direction of my career and talked a lot about DevOps and SRE. I used to confuse these two concepts when I first encountered them a few years ago, but there was no one to answer my questions. It’s all the rage now, but I noticed my friend had some misconceptions about the two positions. So I gave some insights and compiled them into an article for the following public.

The most common mistakes:

  • New concept of DevOps, so advanced
  • SRE is advanced DevOps
  • Operations can easily turn DevOps engineers around

Let me explain them all to you.

image via YouTube

DevOps and SRE definitions

DevOps is literally a combination of Dev/Ops operations, and strictly DevOps (via DevOps – Wikipedia) :

DevOps (a portmanteal of Development and Operations) is a culture, movement, or practice that values communication and cooperation between “software developers” and “IT Operations technicians”.

SRE, Site Reliability Engineering, was first proposed by Google and carried forward in its Engineering practice. They also published a book of the same name, “Site Reliability Engineering,” which spread the idea among Internet engineers.

SRE (Via Site Reliability Engineering – Wikipedia)

Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems.

I translated it into Chinese:

Website stability engineers are software engineers who are committed to building “high scale, high availability systems” and implementing them as principles.

By definition, DevOps is culture, movement, and practice, while SRE is a position with strict employment requirements. Culture is a soft definition, and culture has more concepts that can be fabricated, while SRE definition is accurate, less space for imagination (SRE may also have a high threshold 😄). According to Google, SRE engineers practice the DevOps culture. This is true, but DevOps in China is increasingly independent of DevOps engineers, so in this article, I will focus on the comparison between DevOps engineers and SRE engineers.

Both produce background and history

The need for the Internet gave rise to DevOps. In the most traditional software enterprises, Dev is the only one without Ops, and Ops may still be technical support people. Development follows a cascade of requirements analysis, system design, development, testing, delivery, and operation. Traditional software release is a heavyweight operation. Once released, Dev almost never operates directly. The post-80s may remember that there is a big version of QQ released every year, QQ 2000/2003/2004 and so on. At this time, Ops does not have direct and frequent contact with Dev, and even for some pure offline businesses, Ops is not set up at all.

After the Internet wave, software has evolved from traditional desktop software to web and mobile applications. The core business logic, such as transactions and social interactions, is not done on the user’s desktop, but on the back end of the server. This gives Internet companies a great deal of room for manoeuvre: business logic can be changed at any time, which facilitates rapid and iterative business change. But even so, Dev and Ops are extremely split. Ops doesn’t care how the code works, Dev doesn’t know how the code works on the server.

When the industry was basking in the glow of weekly releases, Flicker shocked the industry in 2009 by introducing the concept of 10+ releases per day. Flicker offers a few core ideas:

  • Business is growing fast, embracing change and taking small steps
  • The goal of Ops is not to make your site stable and fast, but to drive your business fast
  • Improved Dev/Ops linkage based on automation tools: code versioning, monitoring
  • Effective communication: IRC/IM Robot (Those ChatBot routines were played by Flicker 10 years ago)
  • A communication culture of trust, transparency, efficiency and mutual assistance

SlideShare in this 10+ Deploys Per Day: Dev and Ops Cooperation at Flickr

It’s hard to believe that these DevOps concepts are being touted today by training companies and some of the biggest names in the industry, and that they were presented in a 2009 slide show. Classics are always out of date, shining with wisdom under the dust. Some people equate DevOps with operations automation, but that’s just scratching the surface. The goal of DevOps is to speed up the delivery of business systems and provide them with tools, institutions, and services. Some individuals or training institutions embellish and derive meanings around the nature of DevOps.

Let’s talk about the history of SRE. SRE came a little bit later. In 2003, Google’s Ben Treynor recruited a team of software engineers to help Make Google’s production environment services more stable, robust, and reliable. Unlike small and medium-sized companies, Google serves more than a billion users, and temporary service unavailability can be fatal. So Google was ahead of its time, and SRE was born.

This position is intended for large clusters and is not required for small teams (and may not be able to recruit true SRE 😊). After years of exploring Google, the SRE team began Posting their findings online and published the book in 2016.

They have different functions

DevOps culture, then, does not have a specific functional requirement. Many companies now separate out DevOps functions and call them DevOps engineers. Let’s look at what DevOps engineers care about: The DevOps culture is about delivering speed, and DevOps engineers naturally care about the entire software/service life cycle.

A simple formula: Speed = total/time, to add engineering jargon, that is, delivery speed = ((features * engineering quality)/delivery time) * delivery risk.

Functionality is left to product and project managers, and DevOps engineers are left with a few remaining factors: engineering quality/delivery time/delivery risk. DevOps engineers perform the following functions:

  • Manage the full application lifecycle (requirements, design, development, QA, release, run)
  • Focus on the whole process efficiency improvement, find bottlenecks and solve them
  • Automatic operation and maintenance platform design and R&D work (standardization, automation, platformization)
  • Supports o&M systems, including virtualization, resource management, monitoring, and network technologies

SRE keywords are “high scalability” and “high availability”. High scalability means that when the number of service users increases rapidly, the application system and its supporting services (server resources, network systems, and database resources) can be expanded by increasing the number of instances without adjusting the system structure or enhancing the performance of the machine. High availability means that when any link in the application architecture becomes unavailable, such as the failure of application services, gateways, databases, etc., the whole system can be recovered and restored within a foreseeable period of time. Of course, since it is “high” available, this time is generally expected to be at the minute level. SRE functions can be summarized as follows:

  • Provide selection, design, development, capacity planning, tuning, troubleshooting for applications, middleware, infrastructure, etc
  • Provide usability and scalability considerations for business systems, participate in business system design and implementation
  • Locate, troubleshoot, and manage faults, and optimize components that cause faults
  • Improves the resource utilization of each component

Different job content

Different responsibilities lead to different job descriptions for the two positions. I will list the functions of DevOps engineer and SRE engineer as follows:

  • DevOps
    • Set up the application life management cycle and reverse the process
    • Develop and manage development engineer /QA engineer using development platform system
    • Develop and manage the release system
    • Development, type selection, management monitoring, alarm system
    • Develop and manage permission system
    • Develop, select and manage CMBD
    • Managing change
    • Management of failure
  • SRE
    • Managing change
    • Management of failure
    • Set SLA service standards
    • Develop, select and manage all kinds of middleware
    • Develop and manage distributed monitoring system
    • Develop and manage distributed tracking systems
    • Develop and manage performance monitoring and detection system (DTRACE, flame map)
    • Develop, select and train performance tuning tools

It’s an interesting comparison that both DevOps and SRE care about the application life cycle, especially changes and failures within the life cycle. But DevOps works primarily for development links, and a DevOps Team typically provides a chain of tools that includes: development tools, version management tools, CI continuous delivery tools, CD continuous release tools, alarms, and troubleshooting tools. On the other hand, SRE Team focuses on change, failure, performance and capacity-related issues, which involve specific businesses. The output tool chain includes capacity measurement tool, Logging tool, Tracing call link tracking tool, Metrics performance measurement tool, monitoring and alarm tool, etc.

DevOps and SRE relationship

DevOps is first a culture, then a job; SRE is clearly a position from the beginning; Many students confuse DevOps with SRE because they seem to have similar tool attributes and automation requirements. Some developers even understand this kind of operation and maintenance work as: server + tools + automation. It’s a blind man touching an elephant.

In terms of skills, both require strong operation and maintenance skills. On the career ceiling, DevOps may lack SRE’s expertise in some areas: computer architecture capabilities; High throughput and high concurrency optimization capability; Scalable system design capability; Complex system design ability; Service system troubleshooting capability. Both require soft power, but SRE is more complex, more challenging, and more demanding:

  • Ability to analyze and solve problems
  • Determination to overcome difficulties
  • Enthusiasm for challenges
  • Learning from the flooding

DevOps is universal. Modern Internet companies need DevOps, but not all teams have a need for high availability, high scalability, and they don’t need SRE. DevOps engineers have the opportunity to become SRE engineers once they have mastered the relevant skills. A qualified SRE engineer, given the choice, I believe, will not become a DevOps engineer.

From the perspective of professional background, both DevOps and SRE engineers need r&d background, the former needs to develop tool chain, and the latter needs strong architectural design experience. If there are operations engineers who want to transition to DevOps or SRE, they need to have some technical knowledge. After all, you can’t call yourself a DevOps/SRE engineer by building a Jenkins + Kubernetes set.

What about these common misconceptions? I hope you can suddenly see the light here. Finally, I attach two skill points of engineers. I hope those students who are interested in becoming these two kinds of engineers can work hard.

Appendix: Skill points

The conversation:

  • The Operator skill
    • Linux Basis
      • Basic Command Operations
      • Linux Filesystem Hierarchy Standard (FHS)
      • Linux Systems (Differences, History, standards, Development)
    • The script
      • Bash / Python
    • Basic services
      • DHCP / NTP / DNS / SSH / iptables / LDAP / CMDB
    • Automation tool
      • Fabric / Saltstack / Chef / Ansible
    • Basic monitoring tools
      • Zabbix / Nagios / Cacti
    • virtualization
      • KVM management/XEN management/vSphere management/Docker
      • Container Choreography/Mesos/Kubernetes
    • service
      • Nginx/F5 / HAProxy/LVS load balancing
      • Operate (Start, close, restart, expand capacity)
  • Dev
    • language
      • Pytho
      • Go (optional)
      • Java (Understand Deployment)
    • Process and Theory
      • Application Life Cycle
      • 12 Factor
      • Microservice concept, deployment, life cycle
      • CI Continuous integration/Jenkins/Pipeline/Git Repo Web Hook
      • CD release system continuously
    • infrastructure
      • Git Repo / Gitlab / Github
      • Logstash/Flume log collection
      • Configuration file management (applications, middleware, etc.)
      • Nexus/JFrog/Pypi package dependency management
      • Development/QA oriented development environment management system
      • Online permission assignment system
      • Monitoring alarm system
      • Based on Fabric/Saltstack/Chef/Ansible automation tool development

SRE:

  • Language and engineering implementation
    • Deep understanding of the development language (Java, for example)
      • The business uses a development framework
      • Concurrency, multithreading, and locking
      • Resource model understanding: network, memory, CPU
      • Troubleshooting ability (analyze bottlenecks, familiarize yourself with related tools, restore the site, and provide solutions)
    • Common Business design scenarios and pitfalls (e.g., Business Modeling, N+1, remote calls, irrational DB structures)
    • MySQL/Mongo OLTP type query optimization
    • Multiple concurrency models, and associated Scalable designs
  • Problem locating tool
    • Capacity management
    • Tracing link
    • Metrics Tools
    • Logging Logging system
  • Operation and maintenance architecture capability
    • Linux proficiency, understanding of Linux load model, resource model
    • Familiar with general middleware (MySQL Nginx Redis Mongo ZooKeeper, etc.) and able to tune
    • Linux network tuning, network IO model and implementation in the language
    • Resource Orchestration System (Mesos/Kubernetes)
  • The theory of
    • Capacity Planning Scheme
    • Familiar with distributed theory (Paxos/Raft/BigTable/MapReduce/Spanner, etc.), able to make appropriate solutions for scenarios
    • Performance models (e.g. Pxx understanding, Metrics, Dapper)
    • Resource models (e.g. Queuing Theory, load scenarios, avalanche issues)
    • Resource Orchestration System (Mesos/Kurbernetes)

Ref

  • DevOps – Wikipedia, the free encyclopedia
  • Site reliability engineering – Wikipedia
  • StuQ Skill map
  • The Twelve-Factor App (Simplified Chinese)
  • Google – Site Reliability Engineering
  • What’s the Difference Between DevOps and SRE? – YouTube

DevOps and SRE-Log4D

Welcome to follow my wechat public account: Peep Leopard

3a1ff193cee606bd1e2ea554a16353ee