DevOps is a term that is increasingly familiar to developers, but not many people have really thought about what DevOps is and why DevOps is important to Internet companies. With that in mind, this article takes you to the origins, principles and practices of DevOps. To help you figure out what DevOps is.

The origins of DevOps can be traced back to 2008, when it was mentioned in the Agile Infrastructure topic group at an Agile conference.

The conversation is defined

As DevOps has evolved over the years, its definition has changed. Let’s take a look at three wiki definitions of DevOps.

  1. DevOps 2017-2020

    DevOps is a software engineering culture and Practices designed to integrate software development and software operations. A key feature of the DevOps movement is a strong advocate for comprehensive automation and monitoring of all aspects of building software, from integration, testing, and release to deployment and infrastructure management. The goal of DevOps is to shorten development cycles, increase deployment frequency, and deliver more reliably, consistent with business goals.

  2. DevOps 2021

    DevOps is a set of Practices that integrate software development and software operations activities. The goal is to shorten the software development lifecycle and deliver high-quality software using continuous delivery.

    The other:

    DevOps and Agile software development are complementary, and many aspects of DevOps derive from agile methodologies.

  3. DevOps is defined in Chinese wiki

    DevOps (a combination of Development and Operations) is a culture, movement, or practice that emphasizes collaboration between software developers (Dev) and IT Operations technicians (Ops). Build, test, and release software more quickly, frequently, and reliably by automating the “software delivery” and “architecture change” processes.

Extracting what’s in common among these three paragraphs, you can see that no matter how the definition changes, the goal of DevOps is the same — to shorten the software development lifecycle and deliver high-quality software using continuous delivery. Since continuous delivery activities include build, test, and release activities, I prefer to use this definition to better reduce the length of the definition.

In addition, Chinese wiki, after some translation or localization, becomes “culture, movement, or convention”. It also emphasizes the communication and cooperation between development, operation and maintenance. Therefore, we combine the latest English wiki definition with the Chinese wiki definition. Can help us understand DevOps better, so it’s up to the reader to figure out what it really is.

Background on DevOps

The reason why DevOps is so popular and often mentioned is due to its development background. The main reasons can be summarized as follows:

  1. The increase of sensitive state demand, that is, the increase of exploratory work;

    • Way to agile software development from the traditional waterfall development, now the agile development put forward higher request, the application of innovative constantly emerging in recent years, with small step in the process of research and development in these applications run fast, fast, try the wrong way, the exploratory work requires operations can have the ability to release many times in a day and have to finish by steady state enterprises to sensitive state transition.
  2. The proportion of software development activities in business activities is increasing;

    • Business development relies on software from mild to moderate dependence to the current heavy dependence.
  3. There is a need to eliminate waste.

    • Software development activities in the enterprise position is more and more important, and like business activities, software development activities also exist a lot of waste, enterprise management must be there to identify and eliminate the waste demand.
    • Waste in software development includes unnecessary and necessary waste, unnecessary waste includes: unused functions, software bugs, waiting for testing, waiting for approval, etc.; Necessary waste includes: work item handover, testing, project management, etc.

The above explains the development of DevOps mainly from the perspective of enterprises. This is a deep-seated reason. The superficial driving factors include: the development of container technology, the development of micro-service architecture, etc. These technological innovations provide good conditions for the development of DevOps to solve these problems faced by enterprises.

DevOps principles and practices

Now that we know what DevOps is and why, and how to practice it in detail, we use the Golden circle rule to think about it.

The principles of DevOps are the general guideline, and the practice is the specific implementation method. DevOps is a dynamic process, so when you practice it, you can see what principles are applied, and when the principles are violated, you need to think about the rationality of the practice.

The principle of the conversation

There are three principles of DevOps:

  1. Flow principle: speed up the process from development, operation to delivery to the customer;
  2. Feedback principle: building a safe and reliable working system;
  3. Principles of continuous learning and experimentation: adopt a scientific way of working, and make improvements and innovations to the organization as part of the work.

Flow principle

  1. Stick to do less

    • Use MVP principles at the beginning of product development.
    • Subtraction should be done during product iteration.
  2. Continuous decomposition problem

    • Large changes or requirements are broken down into a series of small changes that can be quickly resolved.
  3. Work visualization

    • Use Sprint Kanban to visualize the work.
  4. Control the number of tasks

    • Reduce lead time and reduce testers’ waiting time.
    • The more tasks, the less accurate the estimate.
  5. Reduce the number of handover

    • Reduce unnecessary communication and waiting.
  6. Continuously identify and improve constraint points

    • Identify the main precursors to the flow, such as setting up the environment, requirements documentation.
    • QA, development, operation and maintenance, products continue to improve productivity.
    • Set aside 20% development time for non-functional requirements to reduce technical debt.
  7. Eliminate dilemmas and waste in the value stream (a major contributor to delivery delays)

    • Semi-finished product — work that is not completely finished.
    • Extra work — documentation that is never used, repetitive documentation of interfaces, etc.
    • Extra features – features that users don’t actually need.
    • Task switching — assigning people to multiple projects or distinct work tasks.
    • Wait, move, defect, non-standard manual operation.

Feedback principle

  1. Work safely in complex systems

    • Manage complex work and identify design and operational issues;
    • Work together to solve problems and build new knowledge quickly;
    • Apply regional knowledge to global scope throughout the organization;
    • Leaders need to continually develop people with these qualities.
  2. Identify problems in time

    • Rapid, frequent, and high quality information flow – each process’s operations are measured and monitored.
    • Establish rapid feedback and feedforward loops at each stage of the technology value stream (product management, development, QA, security, operations), including automated build, integration, and test processes.
    • Omni-directional telemetry system.
  3. Ensure quality at source

    • Excessive review and approval processes, where decisions are made far from where work is performed, result in less effective processes and reduce the strength of feedback between cause and effect.
    • Make the developers responsible for system quality, rapid feedback, and speed up the learning of the developers.
  4. Optimize work for internal customers

    • The non-functional requirements of operations (such as architecture, performance, stability, testability, configurability, and security) are just as important as user functionality.

Principles of continuous learning and experimentation

  1. Establish a learning organization and safety culture
  2. Institutionalize daily improvements
  3. Transform local discovery into global optimization
  4. Inject elastic mode into daily work
    • Reducing lead time for deployment, improving test coverage, reducing test execution time, and even decoupling the architecture if necessary are some of the ways to introduce similar tension into the system.
  5. Leadership strengthens learning culture
    • Leaders help front-line workers identify and solve problems in their daily work.

Practice the conversation

Based on the relevant principles of DevOps, there are corresponding practices, including: flow technology practice, feedback technology practice, and continuous learning and experimentation technology practice. Before applying these practices, the organization structure should be carefully designed to facilitate the development of the practice.

Design organizational structure

  • Use Conway’s Law to design the team structure.
    • Conway’s Law: The architecture of the software is consistent with the structure of the software team.
    • The architecture of the software should ensure that small teams can operate independently and decoupled from each other to avoid too much unnecessary communication and coordination.
  • The harm of excessive function orientation (cost optimization).
    • The people doing the work often don’t understand how their work relates to the goals of the value stream (” I’m configuring this server because someone else told me to “).
    • If each functional team in operations is serving multiple value streams (i.e., multiple development teams) at the same time, the problem is compounded because all teams’ time is at a premium.
  • Build a market-oriented team.
    • Embed engineers and their expertise (such as operations, QA, and information security) in each service team, or provide the team with a self-service platform that can configure a product-like environment, perform automated tests, or deploy.
    • This enables each service team to deliver value to customers independently without having to submit work orders to other departments such as IT operations, QA or information security.
  • Make functional orientation effective.
    • Quick response.
    • A culture of high trust.
  • Integrate testing, operations and information security into daily work.
    • Ensuring quality, availability, and security is not the responsibility of one department, but part of everyone’s daily routine.
  • Make your team members generalists.
    • Train full stack engineers.
    • Provide engineers with the opportunity to learn the skills necessary to build and run the systems they are responsible for.
  • Loosely-coupled architecture improves productivity and security.
  • Keep it small (” The two pizzas rule “).

To make functional orientation effective, it is necessary to shift from traditional centralized operation and maintenance to providing operation and maintenance services.

Integration of operation and maintenance into project development

  • Create shared services (production environment, deployment pipeline, automated test tools, production environment monitoring console, operation and maintenance service platform, etc.) to improve development productivity.
  • Operation and maintenance engineers integrate into the development team.
    • Make the product team self-sufficient and fully responsible for service delivery and support.
    • Send engineers to the project development team (the interview and hiring of o&M engineers is still done by the centralized o&m team).
  • Assign an o&M contact person (dispatched o&M engineer) to each project team.
    • The centralized o&M team manages all environments. Dispatched O&M engineers need to understand the functions of the new product, development reasons, how the program works, o&m, scalability, monitoring capabilities, architecture patterns, infrastructure requirements, and product feature release plans.
  • Invite operations contacts to development team meetings, daily station meetings, and retrospective meetings.
  • Use kanban diagrams to display o&M work.

The flow of technology practices

This section contains the following:

  • The foundation for running the deployment pipeline.
  • To achieve fast and reliable automated testing.
  • Continuous code integration.
  • Automated and low-risk releases.
  • An architecture that reduces the risk of publishing.

The foundation for running the deployment pipeline

  • Set up automation environment (development, test, formal).
    • Using Shell, IaC (Puppet, Ansible, Terraform), Docker, K8S, OpenShift and other technologies.
  • All content is versioned.
    • Application code versioning;
    • Database code version control;
    • O&m configuration code version control;
    • Scripts for automated and manual tests;
    • Scripts that support code packaging, deployment, database migration, and application configuration;
    • Project related documents (requirements documents, deployment process, release notes, etc.);
    • Scripts for configuring firewalls and servers.
  • Extend the completed definition.
    • Development work is considered complete when it is done as expected in a production like environment.

To achieve fast and reliable automated testing

  • Build, test, and integrate continuously.
    • The code branch is continuously integrated into the trunk and ensures that it passes unit, integration, and acceptance tests.
    • Common tools: Jenkins, TFS, TeamCity, GitLab CI.
    • Collaboration with continuous integration: Automated test tools; A culture where failure must be addressed immediately; Code continues to merge into the trunk, rather than working on the feature branch.
  • Build a fast and reliable automated test suite.
    • Unit tests: JUnit, Mockito, PowerMock
    • Unit test measure: Test coverage.
    • Acceptance tests: Automated API tests, automated GUI tests.
    • Parallel testing: security testing, performance testing, unit testing, automated testing.
    • Test-driven development: TDD, ATDD.
  • Keep the deployment pipeline green at all times.
    • When the deployment pipeline fails, everyone fixes the problem immediately or rolls back the code immediately, and subsequent code submissions should be rejected.

Continuous code integration

  • Continuous integration code.
    • The longer developers work independently on their branches, the harder it becomes to incorporate changes into the main work.
  • Develop in small batches.
  • Development based on trunk.
    • Frequently submits code (via merge requests) to the trunk.

Automated and low-risk releases

  • Automated deployment steps: build, test, deploy; Related processes include:
    • Code packaging, building;
    • Upload Docker image;
    • Creating a preconfigured K8S service;
    • Automated unit testing, smoke testing;
    • Database migration automation;
    • Configuration automation.
  • Self-service deployment of application automation
    • Developers focus on writing code, hit the deploy button, monitor metrics to see that the code is working in production, and get error messages to quickly fix when the code fails.
    • Control deployment risk by code review, automated testing, automated deployment, enabling developers to deploy as necessary, and testers and project managers to deploy in certain environments.
  • Decouple deployment and publication
    • Deployment refers to the installation of a specific version of software in a specific environment.
    • Release means making product features available to all or some customers.
  • Environment-based publishing patterns
    • Blue green deployment
    • Gray (Canary) release
  • Application based publishing pattern
    • Features switch. Benefits: Easy rollback, performance relief, and masking of service dependencies.
    • Implement black start: When a potentially risky new feature is released, it is called implicitly and only the test results are recorded.
  • Continuous delivery practices
    • Continuous delivery means that all developers work on the trunk in small batches, or on short-lived feature branches, and periodically merge into the trunk, while always keeping the trunk releasable and able to release on-demand during normal working hours. Developers get quick feedback when any regression errors are introduced (defects, performance issues, security issues, availability issues, etc.). As soon as such problems are identified, they are addressed to keep the trunk in a deployable state.
  • Practice of continuous deployment
    • Continuous deployment refers to the periodic deployment of quality builds to production by developers or operations on a continuous delivery basis, which usually means at least one production deployment per person per day, or even an automated deployment triggered every time a developer commits a code change.
  • Most teams adopt continuous delivery practices.

An architecture that reduces the risk of publishing

  • Loosely coupled architecture
  • Service-oriented architecture
  • Safely evolve the enterprise architecture
    • Strangler application pattern: API encapsulates existing functionality, implements new functionality according to new architecture, versioning API.
  • Cloud-native Architecture

Technical practice of feedback

This section contains the following:

  • Establish telemetry system
  • Intelligent alarm
  • Use feedback to implement security deployment
  • Apply A/B tests
  • Establish review and collaboration processes

Establish telemetry system

  • What is Telemetry?
    • Telemetry includes monitoring, which enables real-time, high-speed and more sophisticated monitoring of the network.
    • Compared with traditional network monitoring technology, telemetry actively pushes data information to the collector through the push mode, providing more real-time, high-speed and accurate network monitoring function.
  • Three dimensions of telemetry
    • Tracing, Metrics, and Logging.
  • observability
    • A system can infer the extent of its internal state from its external output (telemetry data).
    • Can identify, predict and solve problems.
  • Centralized monitoring system (available: Prometheus, SkyWalking)
    • Data is collected at the business logic, application, and environment layers.
    • An event router that stores and forwards events and metrics.
  • Application log telemetry (ELK, Audit log, Metrics)
  • List of major application events:
    • Results of authentication/authorization (including exit);
    • Access to systems and data;
    • System and application changes (especially privilege changes);
    • Changes to data, such as adding, modifying, or deleting data;
    • Invalid input (possible malicious injection, threat, etc.);
    • Resources (memory, disk, CPU, bandwidth, or any other resource with hard/soft limits);
    • Health and availability;
    • Start and close;
    • Failures and errors;
    • Circuit breaker trip;
    • Delay;
    • Backup succeeded/failed.
  • Build production telemetry into daily development work.
  • Use telemetry to guide problem resolution.
  • Establishment of self-access Visual telemetry Information System (Information radiator)
    • Grafana
    • SkyWalking
    • Kibana
  • Discover and fill telemetry blind spots (establish full and complete telemetry)
    • Business level: order volume, number of users, churn rate, AD display and click, etc.
    • Application level: transaction processing events, application failures, and so on.
    • Infrastructure level: server throughput, CPU load, disk usage, etc.
    • Client software level: application errors and crashes, client transaction processing events, etc.
    • Deployment pipeline level: pipeline status and deployment frequency.

Intelligent alarm

  • Resolve alarm fatigue
    • Full and complete telemetry introduces the problem of alarm fatigue and requires more intelligent alarm.
  • Use statistical analysis instead of static thresholds to set alarms
    • Using mean and standard deviation (for normally distributed data) : A large standard deviation between the measure and the mean is alerted.
  • Use alarms that prevent faults, not just alarms that occur after faults occur
    • Try asking what indicators can predict failure.
  • Anomaly detection technique
    • Smoothing statistics: Using moving averages to transform data by averaging each point against all other data in the sliding window.
    • Tools that support advanced exception detection: Prometheus, Grafana.

Use feedback to implement security deployment

  • Secure deployment through telemetry – problems can be detected immediately after deployment.
  • All owners in the value stream (developers, development managers, architects, operations teams, etc.) share the downstream responsibility for operations incidents.
    • Jointly undertake duty work and solve production environment problems.
  • Let developers track the impact of their work on operations.
    • It facilitates the deployment of developed applications and improves the happiness of operation and maintenance personnel.
  • Let the development team manage production services.
    • It is managed first by the development team and then by the centralized operations team.
    • O&m engineers move from production support to consultants or join teams to help prepare for deployment and establish service release guidelines (including: support effective monitoring, reliable deployment, and architecture that supports rapid and frequent deployments).
    • Assign SRE personnel to the team. SRE position: SRE is the software development engineer responsible for the operation and maintenance work, SRE is very rare, only the most important team assigned.

Apply A/B tests

  • Integrate A/B testing into functionality
    • The user is randomly presented with one of two versions of a page.
  • Integrate A/B testing into the release
    • Use the feature switch.
  • Integrate A/B testing into functional planning
    • Not only to rapidly deploy and release software, but also to continuously improve in terms of experimentation, and to proactively achieve business goals and customer satisfaction through experimentation.

Establish review and collaboration processes

  • Prevent “over-controlling change”
    • Counterfactual thinking tends to attribute accidents to a lack of approval procedures.
  • Establish peer review and shorten the approval process
    • High-performing organizations in DevOps rely more on peer review and less on external change approval (layers of approval).
  • Code review
    • Everyone’s code must be peer-reviewed when submitted to the trunk;
    • Everyone should keep an eye on the submissions of other members;
    • Define high-risk changes to determine whether they need to be reviewed by a domain expert;
    • Break large committed changes into smaller batches.
  • Use pair programming to improve code changes
    • The study showed that paired programmers were 15 percent slower than two programmers working independently, while the amount of ‘error-free’ code increased from 70 percent to 85 percent.
    • The cost of testing and debugging programs is often many times higher than the cost of writing the initial code.
  • Evaluate the validity of the merge request
    • Has nothing to do with the results produced in the production environment.
    • Essential elements of an effective merge request: Sufficient detail must be provided on why the change was made, how the change was made, and any identified risks and responses.

Continuous learning and experimentation with technical practice

This section contains the following:

  • Incorporate learning into your daily routine
  • Translate local experience into global improvement
  • Set aside time for organizational learning and improvement

Incorporate learning into your daily routine

  • A culture of justice and learning
    • Human error is often not the root cause of problems, but may be caused by unavoidable design problems in complex systems.
    • There should be no “naming, blaming and shaming” of those responsible for the breakdown, and the goal is to maximize opportunities for organizational learning.
    • Look at mistakes, errors, errors, lapses, etc from the perspective of learning.
    • Related practice 1: In the post-mortem analysis, do not blame, judge fairly, so that engineers are willing to take responsibility for things, and enthusiastic to help others to avoid the same mistake; Make the results of post-mortem meetings widely available.
    • Related Practice 2: Introduce controlled human failures (monkey monkey) into the production environment to drill for the inevitable problems.
  • Reduce accident tolerance and look for weaker fault signals
    • With the improvement of organizational capacity, the number of accidents is greatly reduced, and the less faults should occur.
    • In complex systems, amplification of weak fault signals is important to prevent catastrophic failures.
  • Redefine failure
    • High-performing DevOps organizations change 30 times as often as the average, and even if the failure rate is half that, it clearly means more total failures.
    • Encourage innovation and accept the risks it entails.
  • Create a drill day
    • Help the team to simulate and drill the accident, so that it has the actual combat ability.
    • Expose potential flaws in the system.

Translate local experience into global improvement

  • [ChatOps] Use chatbots to accumulate organizational knowledge
    • Automated tools are integrated into the chat, such as @bot depoy owl to Production;
    • The results are sent back to the chat room by the robot, and everyone can see what’s happening;
    • The new engineer can also see the team’s daily work and implementation;
    • People are more likely to ask for help when they see others helping each other.
    • Use topic groups to establish organizational learning and gain rapid accumulation of knowledge.
    • A culture of transparency and collaboration has been strengthened.
  • Translate standards, processes, and specifications into a form that is easy to implement
    • ArchOps makes engineers builders, not bricklayers;
    • Convert manual operations into code that can be executed automatically;
    • Express compliance in code.
  • Use automated tests to record and disseminate knowledge
    • Automated interface testing so that users know how to use the system;
    • Unit tests that let the caller know how the method API is used.
  • Non-functional operational requirements are included in project development
    • Adequate telemetry for various applications and environments;
    • The ability to accurately track dependencies;
    • Services that are resilient and can degrade normally;
    • Forward and backward compatibility between versions;
    • Ability to archive data to manage production data sets;
    • The ability to easily search and understand log information for various services;
    • The ability to track user requests through multiple services;
    • Simple, centralized runtime configuration using function switches or other methods.
  • Integrate reusable o&M user stories into your development
    • The repetitive operation and maintenance work is implemented by coding.
  • Operation and maintenance factors should be considered in technology selection
    • Cannot slow down workflow;
    • Consider: TIDB VS MySQL how to choose.

Set aside time for organizational learning and improvement

  • Institutionalize technical debt repayment
    • Regular “spring cleaning”
    • Development and operations are optimized for non-functional requirements across the entire value stream.
    • Value: Gives frontline staff the ability to constantly identify and solve problems.
  • Let everyone learn
    • Skills are increasingly needed by all engineers, not just developers.
    • More and more technology value streams are adopting DevOps principles and models.
    • [Weekly study culture] A weekly study period in which each peer learns by himself or herself and teaches others.
  • In-house consultants and coaches
    • Establish an internal coaching and consulting organization to facilitate the spread of expertise within the organization.

Practice the key

The DevOps practice includes many things, but these are some of the highlights:

  • Practice of the flow principle
    • The basics of the deployment pipeline (everything is versioned and done as expected in a production environment)
    • Fast and reliable automated testing (automated operation, always keep the pipeline in a green state)
    • Continuous code integration (small batch development)
    • Automated and low-risk releases (self-service, deployment and release decoupling, continuous delivery)
    • Publish risk mitigation architecture (cloud-native Architecture)
  • Practice of the feedback principle
    • Establish telemetry system (Tracing, Metrics, Logging)
    • Intelligent alarms (alarms that use statistical analysis and fault prevention)
    • Apply feedback to secure deployment (find problems immediately after deployment and share responsibility)
    • Application of A/B testing (integrated A/B testing in functional planning, using feature switches)
    • Establish review and collaboration processes (peer review, reduced approval process, pair programming)
  • Practice of the principles of continuous learning and experimentation
    • Integrate learning into daily work (look at accidents from the perspective of learning and look for weaker fault signals)
    • Translating local experiences into global improvements (ChatOps, making specifications easy to implement, non-functional operational requirements)
    • Set aside time for organizational learning and improvement (regular repayment of technical debt, teaching and learning, internal coaching)

conclusion

The development of DevOps is complementary to the development of technology, and provides more learning paths and development directions for technologists. To conclude this article, I would like to borrow a quote from the leader of DevOps.

For all professionals who love innovation and change, a bright and dynamic future lies ahead.


This article is organized from the PPT shared by the author, the original and PPT address: github.com/lcomplete/T…