Lyft micro service practice | 4. R&d performance based on automatic deployment of entrance guard of the acceptance tests

How can we improve r&d efficiency? Do you rely on separate local development test environments, or do you rely on full end-to-end testing? This series of articles describes the history and evolution of Lyft’s development environment and helps us think about how to build an efficient development environment for large-scale microservices. This is the fourth article in a four-part series. 原文 : Scaling productivity on microservices at Lyft (Part 4): Gating Deploys with Automated Acceptance Tests^[1]

This article, the fourth and final in a series, focuses on how we’re expanding our development practices as Lyft deals with a growing number of developers and services.

Part I: History of development and test environments
Part two: Optimizing rapid local development
Part three: Extending the service grid in a pre-delivery environment using coverage mechanisms
Part IV: Deployment of access Control Based on Automatic Acceptance Test (This paper)

In previous articles, we described how context propagation can be leveraged to allow multiple engineers to conduct end-to-end testing in a shared pre-release environment. Now let’s look at the other part, automated end-to-end testing, where we’ll show you how to build a scalable solution that will give engineers more confidence before deploying to production.

Rethink end-to-end testing

Part 1 of this series covered many of the challenges encountered when running integration tests in CI. The explosion in the number of services and engineers makes it difficult to scale the remote development environment (Onebox) where tests are run, and it takes a lot of time to run tests. Integration testing for each service also became unwieldy, taking more than an hour to complete and having a very low signal-to-noise ratio. Engineers distrust failed tests and often ignore them, or they waste more debugging time, which makes things worse.

In thousands of integration tests covering more than 900 services, there is a small group of truly valuable end-to-end scenarios that we believe are critical to maintain. For example, users can log in, request a ride, and pay a fare. The problems in these scenarios are internally referred to as SEV0(Highest severity Event). These problems will prevent passengers from getting where they need to go, or drivers from earning an income, and must be addressed at all costs.

The acceptance test

When we look at the end-to-end tests that have the highest value we want to maintain, they look a lot like acceptance tests, even at a cursory glance. These tests describe how users interact with the Lyft platform without needing to know the details of the internal implementation.

With this in mind, we decided to move from a distributed model, where each service defines its own integration test set, to a small centralized acceptance test set. This has two advantages. Technically, putting scenarios together helps us eliminate duplication and share test code between related services. Organizationally, a single owner is responsible for coordinating the overall health of these tests (which are still written and modified by different people) and designing better isolation to avoid runaway control.

Another key decision point is when to run these tests. We want to change the habit of running end-to-end testing as part of an “internal” development cycle (described in Part 2), where developers used to run end-to-end testing multiple times during development, in place of strategies such as unit testing or invoking specific service endpoints. Instead, we want CI to run fast and encourage people to become more comfortable with putting off end-to-end testing until later in the process. For these reasons, we chose to run acceptance tests after deployment to a pre-release environment as one of the entry control strategies for production deployment.

Building a framework

engine

First, you need an engine that provides a simple interface to use Lyft’s API like a real user. Fortunately, we’ve built something similar to generate traffic in both pre-release and production environments (see Pre-release environments in Part 1). The library consists of several key concepts:

Actions:Interact with the Lyft API, for example,RequestRideThe Action calls the Lyft API, provides the required origin and destination, and begins the search for a driver.
Behaviors: The brain decides what to do next based on certain probabilities. For example, if you ask for a ride, should you cancel it or wait for the driver?
Clients: Represents devices that interact with the platform, typically a phone running the Lyft app, to store states and coordinate actions/ Behaviors.

The combination of these three simple concepts has been the foundation of our strategy of automated testing in pre-release and production environments for the past five years and has served us well. However, reusing these strategies in acceptance testing still requires consideration of one important difference — the probabilistic nature of the behavior. When building behavior with load tests, considering that randomness is very helpful in eliminating unexpected bugs, we designed it to be something like fuzzer [2], which is not suitable for deterministic acceptance testing of a particular flow.

So we closed the gap by updating the library to allow clients to follow a series of steps as an alternative to behavior. The steps can be any of the following:

Actions: As mentioned above, just perform an API call.
Conditions:Block the execution of the next step until an expression is true with an optional timeout, for example, the driver may wait until the starting point is reached and passPickedUpThe motion notifies Lyft that a passenger has been picked up.
Assertions: Ensure that the client state looks as expected, for example, we want to ensure that quotes are completed before requesting a ride.

Define the test

With the Actions, conditions, and assertions building blocks in place, it’s time to decide on the format to define tests in the new centralized master system. Previously integration testing was done in code, but we decided to switch to defining acceptance tests using custom configuration syntax. Despite the pros and cons, we found that defining tests in this way provided a compelling feature to keep the tests simple and consistent, so that more people could read/write the tests, and the tests could be better maintained. Limited interfaces are exposed in the configuration, and most of the logic is implemented into the previously mentioned libraries, which can be better shared with other acceptance or load test runners.

Put all this together and look at the following sample test scenario:

# test_scenarios/standard_ride.yaml
description: A standard Lyft ride between 1 driver and 1 passenger
clients:
  - role: passenger
    steps:
      - type: Action
        action: Login
      - type: Action
        action: SetDestination
      - type: Assertion
        assertions:
          - ["price_quote"."between".10.20]
      - type: Action
        action: RequestRide
      - type: Condition
        conditions:
          - ["ride_status"."equals"."completed"]
      - type: Action
        action: TipDriver
  - role: driver
    steps: 
      - type: Action
        action: Login
      - type: Action
        action: EnterDriverMode
      - type: Condition
        conditions:
          - ["ride_request"."exists".true]
      - type: Action
        action: AcceptRideRequest
      - type: Action
        action: PickUpPassenger
      - type: Condition
        conditions:
          - ["location"."equals"."destination"]
      - type: Action
        action: DropOffPassenger
Copy the code

It is worth noting that it might be better to build on an existing testing framework such as Cucumber/Gherkin[3] if you were starting from scratch. In our case, it is much easier to extend existing traffic generation tools than to try these techniques.

The deployment of entrance guard

We moved end-to-end testing from pre-PR merge to pre-post-merge deployment, greatly increasing developer productivity. While a typical PR might contain an average of four commits (each of which runs the test suite), typically only one or two PR’s are deployed at a time, so developers are almost 10 times less likely to be blocked by unstable tests. Furthermore, we expect that without the false security afforded by end-to-end testing for PR, developers will devote more resources to unit testing and will establish more secure publishing policies such as feature flags (we discussed in Part 3 that coverage can now be configured on a per-request basis).

To achieve this, we extended the concept of deploy Gate for an internally deployed system. The deployment phase can be blocked by one or more access controls that indicate conditions that must be met before allowing deployment to proceed to the next phase. A typical example of a gate is the Bake Time we include in every pre-deployment. This gate ensures that the deployed system runs for a specified length of time so that continuous simulated traffic has a chance to trigger an alarm if there are any problems.

Each acceptance test adds access control as a dependency of the service under test, and once the pre-release environment is deployed, the corresponding access control starts the test run and reports success or failure. In order not to slow developers down, the goal of acceptance testing is to complete in less time than the default Bake Time (10 minutes).

practice

Testing what?

Arguably, one of the most difficult parts of transitioning to acceptance tests is determining the rules that make up acceptance tests and applying those rules across a large number of integration tests. After sifting through hundreds of integration tests and discussing them with service owners, we settled on the following criteria:

Acceptance tests should only represent critical business flows and should describe end-to-end interactions with the Lyft platform from a user perspective.
We obviously don’t want to test all scenarios, so acceptance testing must be business-critical. As the most expensive test in the suite, we could not afford to test edge cases or business flows where a short interruption would not cause significant damage to the business (i.e. SEV0).

While these standards may seem obvious from the standpoint of the test pyramid [4], they are still more difficult to apply than expected. Developers are uncomfortable with the consequences of removing integration tests, whether they are well understood or have ever caught bugs. To simplify the transformation, we worked with the team to analyze each test against the above criteria. About 95% of tests are either redundant or can be rewritten as unit tests with mocks. The remaining tests, with redundancy removed, are combined into about 40 total acceptance test scenarios that will replace all integration tests.

The results of

It’s been about six months since we replaced integration testing in CI with pre-release environment acceptance testing. The number of scenarios has remained relatively stable, and we have expanded our coverage to include transportation and bike & scooter products, conducting several thousand tests per week. The main benefits we see are:

Most PR passes unit tests and is ready to merge in 10 minutes (previously 30 minutes when end-to-end tests were included).
Thousands of integration tests are removed from the service without spending a lot of time maintaining and debugging them.
Acceptance testing is faster and more reliable to iterate on, taking less than a minute to prepare a local environment that targets pre-delivery (using the previously mentioned local development workflow, compared to Onebox’s initial setup time of 1 hour).
Since the removal of end-to-end testing from PR, there has been no significant increase in the number of bugs leaking into production.
Acceptance tests catch problems weekly before they leak into production.
We haven’t seen the kind of additional investment in unit testing that we’d like to see. Further investigation is needed to understand why, and whether more can/should be done to change this.

Future work

We’re excited about the productivity gains we’ve seen from these changes so far, but still look forward to many improvements ahead.

Pretest isolation

Currently, running tests immediately after deploying changes to the pre-delivery environment can disrupt other pre-delivery environment users if problems occur. We want to build on the pre-release coverage work discussed in the third article in this series by running acceptance tests on the new version of the service before exposing it to other users. This will add additional delays to deployment, so you need to assess whether the benefits outweigh the costs.

Test coverage

Given that the main goal behind the tests was to improve reliability, we wanted to do more and more direct improvements than just maintain the tests. We know that there are gaps in today’s tests that were built from previous integration tests and discussed with service owners which business flows are important and need to be covered by the tests. To close the gap and improve reliability, you need to make sure that all of the most common API calls made by real iOS and Android clients show up in these tests. One idea is to do more analysis of the increments between real and simulated traffic flowing through our system, perhaps through further investment in distributed tracking tools.

Test the health of the scenario

Initially, the platform team manually planned each acceptance test and closely monitored its stability. As we continue to scale out more lines of business, we want each operation (API call) to have more fine-grained observability so that failures can be automatically sent to the appropriate team. This does not mean decentralizing ownership (we think it is important to keep a central owner for a wider range of tests and platforms), but simply alerting the product team to their service failures more quickly and minimizing manual effort.

conclusion

In this series of articles, we’ve taken a closer look at how Lyft has evolved its development and testing methods over the years and pursued continuous improvements in developer productivity. We covered the history of Lyft’s development environment (Part 1), the first r&d environment transition to locally-first (Part 2), isolating test services with envoy coverage in the pre-launch environment (Part 3), and replacing pr-triggered heavier integration tests with a set of acceptance tests during deployment (this article).

While this approach may not work in all environments, it has been very successful in shortening the feedback loop for developers and greatly simplifying the infrastructure that supports the test environment, which helps developers keep their code out.

References: [1] Scaling productivity on microservices at Lyft (Part 4): Gating Deploys with Automated Acceptance Tests: Eng.lyft.com/scaling-pro… [2] the Fuzzing: en.wikipedia.org/wiki/Fuzzin… [3] Gherkin: cucumber. IO/docs/gherki… [4] The Pratical Test Pyramid: martinfowler.com/articles/pr…

Hello, MY name is Yu Fan. I used to do R&D in Motorola, and now I am working in Mavenir for technical work. I have always been interested in communication, network, back-end architecture, cloud native, DevOps, CICD, block chain, AI and other technologies. The official wechat account is DeepNoMind