Lyft micro service r&d performance practice | 1. The history of the development and test environment

This article has participated in the activity of “New person creation Ceremony”, and started the road of digging gold creation together.

How can we improve r&d efficiency? Do you rely on separate local development test environments, or do you rely on full end-to-end testing? This series of articles describes the history and evolution of Lyft’s development environment and helps us think about how to build an efficient development environment for large-scale microservices. This is the first in a four-part series. 原文 : productivity on microservices at Lyft (Part 1)^[1]

In late 2018, the Lyft engineering team completed splitting the original PHP singleton into Python and The Go microservice, and over the next few years, Microservice has largely succeeded in helping the team run and publish the service independently. The separation of concerns that microservices bring allows us to experiment and deliver features more quickly (hundreds of times per day), and provides enough flexibility to adopt different programming languages where appropriate, more stringent or less stringent requirements depending on how critical the service is, and so on. However, as the number of engineers, services, and tests increased, development tools struggled to keep up with the explosion of microservices, dragging down productivity growth.

This four-part series looks at the development environment used as Lyft’s engineering team grew from 100 engineers and a handful of services to more than 1,000 engineers and hundreds of services. We’ll discuss the scaling challenges that led us to abandon these environments, and the shift from a testing approach based primarily on a large number of integrated tests, often close to end-to-end, to a locally-first approach centered on standalone test components.

Part I: History of development and test environments (this article)
Part two: Optimizing rapid local development
Part three: Extending the service grid in a pre-delivery environment using coverage mechanisms
Part FOUR: Deployment control based on automated acceptance tests

History of development and test environments

Our first major investment in the integrated development environment began in 2015, when we had 100 engineers, and almost all of our development work was focused on a single PHP system, with some microservices starting to appear in certain use cases (such as driver logins).

The expectation that the number of engineers and services required will continue to grow makes it necessary to adopt containerization solutions. We planned to build a Docker-based container choreography environment (Docker was still in its infancy at the time) that would first serve developers’ testing efforts and then expand to production environments where we would benefit from lower costs and faster scaling for multi-tenant workloads.

Use Devbox for local development

Devbox, Lyft’s out-of-the-box development environment, was released in early 2016 and was quickly adopted by most engineers. Devbox works by managing a local VIRTUAL machine on behalf of the user so that engineers don’t have to install or update dependencies, configure Runit [2] to start services, add shared folders, and so on. Once the VM is running, it takes just one command and a few minutes to get the latest version of the image, create/initialize the database, launch envoy Proxy Sidecar [3], and all the dependencies needed before you can start sending requests.

This upgrade was great compared to the previous one, as we manually provided an instance of EC2 for each developer and the service they were responsible for, which made setting up and keeping updated tedious. For the first time, we have a consistent, repeatable, and simple way to accomplish development across multiple services.

Use Onebox for remote development

Soon there was a need for a sustainable environment that could be shared with other engineers or teams, such as design teams, so we built Onebox. Onebox is essentially Devbox on an EC2 instance, and it has a number of advantages that will entice users away from Devbox. We deployed it on an R3.4 Xlarge instance with 16 vcpus and 122G ram, which is much more powerful than the MacBook Pro engineers carry around. Onebox can run more services, download container images faster (because it’s based on AWS), not to mention avoid VirtualBox making your laptop’s fan sound like a jet engine.

We have two different development environments, each capable of running multiple services

Integration testing

In addition to unit testing, Onebox’s cloud infrastructure is also ideal for running integration tests on CI. Services can simply define the required dependencies in the manifest.yaml file, and a temporary Onebox will launch these services and perform tests on each pull request. Many services, especially service portfolios close to mobile clients, require large integration test suites to be built to deal with abnormal service failures, and each incident analysis usually ends with the addition of new integration tests. With such flexible and powerful testing capabilities, unit testing is taking a back seat.

Examples of services defining integration tests to run in CI:

name: api
type: service
groups:
  - name: integration
    members:
     - driver_onboarding
     - users
tests:
  - name: integration
    group: integration
Copy the code

Staging Environment

Lyft’s pre-launch environment is virtually identical to its production environment (except it uses fewer resources and has no production data), and all services are deployed in the same process as the production environment delivery. Although not a development environment, the pre-release environment is also worth discussing as it plays an increasingly important role in end-to-end testing.

Shortly after Devbox and Onebox launched in early 2017, we also addressed another growing class of issues: load testing. Events that cause spikes in carpooling traffic (like New Year’s and Halloween) expose bottlenecks in our systems and often lead to outages. To address these issues, we built a framework to simulate large-scale traffic. The framework coordinates tens of thousands of simulated users with different configurations for our production environment (for example, simulating a driver in Los Angeles who frequently cancels) and treats Lyft like a black box.

As a by-product of the staged test simulation framework itself, we realized that the generated flow was also valuable for general end-to-end testing. Constant testing of common interfaces in a pre-launch environment can provide a good signal for actual deployment. For example, if the deployment breaks the interface to drop off passengers, the initiator of the deployment can see the error log and alerts almost immediately. The simulation also continuously generates up-to-date data on users, vehicles, payments, and so on, reducing the setup time of manual tests that must be done during development. With load testing efforts, pre-release environments are becoming more realistic and useful than ever, and it has become common for teams to deploy PR branches there so they can get consistent feedback on real data.

The new problem

Fast forward to 2020 (four years after Devbox and Onebox were introduced as containerized development environments) and despite our best efforts, the “Lyf-in-a-box” style environment is still hard to keep up with. The number of engineers using these environments has increased tenfold, and there are now hundreds of microservices supporting more complex businesses. While it is still quite efficient to develop on services with small dependencies, most of the development is done on services that have already built a large dependency tree, making it very slow to start the environment or run tests on a CI.

While these environments and tests are very powerful and convenient, they reach a point where they do more harm than good. We built a system optimized for testing a small number of services, and when the number of services grew from 5 to 50, from 50 to 100, or more, we didn’t reevaluate our strategy. Not only does this require a large number of services to maintain and extend, but it also reduces developer productivity by forcing developers to constantly think in terms of the entire system rather than one component.

Let’s look at some of the details of this problem in more detail:

Scalability problem

Scaling the Onebox environment became impractical due to the large number of resources involved and the divergence from a production-like environment. For example, it is not feasible to run the same observability tool across hundreds of environments. When something goes wrong, it’s hard to pinpoint the exact cause (which of the 70 services running might have a problem?). , people tend to hit the “reset” button a few times before giving up and pre-testing.

Pre-built environments, on the other hand, are easier to scale and more faithful to the production environment. It provides the same logging, tracing, and measurement capabilities to aid debugging. The main disadvantages of deploying to a shared pre-release environment are :(1) experimental changes may break the environment for others to use, (2) only one service can make a change at a time to test effectively, and (3) more time (minutes) to build and deploy due to the need to synchronize code and hot load.

Maintenance difficulties

Due to the scalability challenges described above, maintaining and optimizing these environments took a lot of time, resulting in technology backwardness. Production and pre-release environments have used Kubernetes for container choreography, while switching to smaller single-process container images. The development uses a heavier multi-process image bundled with sidecars and other infrastructure components (metrics, logs, and so on), making it slower to build and download images.

Every week, there are changes that cause problems that do not affect the pre-release or production environment, but do affect the development environment. Since most developers need to run most services, problems with one service can have a big impact. Some teams have further exacerbated the problem by moving all of their end-to-end testing to the pre-launch phase, leaving their services weaker and weaker during development.

Problem ownership is unclear

In a development environment, the ownership of a problem is unclear. Who should be responsible for fixing the particular service that caused the problem? Is it the person who started the Onebox, the person in charge of the service, or the developer infrastructure team? In practice, this often falls to the developer infrastructure team, but they are unable to diagnose and resolve application-related problems (for example, configuration changes that cause the application to crash at startup).

Bloated tests

Cumbersome integrated test suites have become a drain on productivity. Hour-long test suites are ubiquitous, running on complex sharding infrastructures and compensating for unstable environments with automatic retry. There are two main drivers of this problem, dependency inflation and testing itself. Because of the transitivity of dependencies, dependent services can grow without the service owner noticing, consuming test time. The test suite itself is also growing steadily, and while we add tests when problems arise, they are rarely removed on the assumption that existing tests work.

So why do we have to wait hours to merge a PR? Because you can catch bugs before they go into production, of course! However, further tests in practice show that this theory does not hold. Analysis of integration tests for some of our most actively developed services found that 80% or more of the tests were either unnecessary (for example, outdated or copies of existing unit tests) or could be rewritten so that they could run in a shorter time without relying on outsiders. When tests fail, most are false positives, which can take hours of debugging, and the rest of the tests are usually caught before they pass through the pre-launch or Canary environment and cause production problems.

# 2013 (monolith), duration: 1 minute
def test_driver_approval():
	""" Requires: - api """
	user = get_user()
	approve_driver(user)
	assert user.is_approved

# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- #

# 2015 (mostly monolithic, a few services), duration: 3 minutes
def test_driver_approval():
	""" Requires: - api (monolith) - users - mongodb - driver_onboarding - mongodb - redis """
	user = user_service.create_user()
	user = driver_onboarding_service.approve_driver(user)
	assert user.is_approved

# -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- #

# 2018 (post-decomp, microservices), duration: 20 minutes
def test_driver_approval__california():
	""" Requires: - users - redis - experimentation - fraud - dynamodb - messaging - mongodb - driver_onboarding - messaging - email - experimentation - dmv_checks - vehicles - payments """
	user = user_service.create_user()
	user = driver_onboarding_service.approve_driver(user)
	assert user.is_approved

def test_driver_approval__newyork():
	#...
def test_driver_approval__montreal():
	#...
Copy the code

As we continue to isolate new microservices, integration testing becomes increasingly unwieldy.

The change process

After starting to move our development environment to Kubernetes about a year ago, the change in engineering resources became the catalyst for us to scale back and rethink our direction. Maintaining the infrastructure to support an on demand environment becomes too expensive and gets worse over time. To address this situation, we need a more radical change in the way we develop and test microservices, and it’s time to replace Devbox, Onebox, and integration testing on CI with sustainable alternatives for systems made up of hundreds of microservices.

Taking a closer look at how developers use the existing environment, we identified three key workflows (shown in purple in the figure below) that are important for maintenance and require investment:

Local development: For any given service, it should be simple and quick to run unit tests or start a Web server and send requests.
Manual end-to-end testing: Testing how specific changes perform in the larger system is a key workflow that many engineers rely on. We want to extend pre-launch testing to make it easier and safer for developers to test independently.
Automated end-to-end testing: Despite our over-reliance on this type of testing, we cannot continue to deliver hundreds of changes every day without the confidence that automated end-to-end testing provides. We will reserve a small number of valuable tests as acceptance tests to run when deployed to production.

Future articles in this series will delve into all three areas, and we’ll discuss the issues, how to deal with them, and what we’ve learned.

References: [1] Productivity on Microservices at Lyft (Part 1): eng.lyft.com/scaling-pro… [2] runit – a UNIX init scheme with service supervision: smarden.org/runit/ [3] Envoy Proxy: www.envoyproxy.io/

Hello, MY name is Yu Fan. I used to do R&D in Motorola, and now I am working in Mavenir for technical work. I have always been interested in communication, network, back-end architecture, cloud native, DevOps, CICD, block chain, AI and other technologies.

The official wechat account is DeepNoMind