Today we’re excited to announce the open source release of Clutch, Lyft’s infrastructure tool extensible UI and API platform, which enables engineering teams to build, run, and maintain user-friendly workflows with domain-specific security mechanisms and access controls. Clutch is compatible with a variety of management platform capabilities (such as AWS, Envoy, and Kubernetes) with an emphasis on extensibility, so it can host any component in the stack.

The dynamic nature of cloud computing significantly reduces the cost of new infrastructure adoption. The CNCF Cloud Native Computing Foundation Panorama tracks more than 300 open source projects and more than 1,000 commercial products. While organizations are quick to adopt these projects and vendors, each new technology comes with its own set of configurations, tools, logs, and metrics. Most organizations fail to take into account the significant up-front and ongoing investment in tools required to allow developers to change quickly and safely across the stack. So, while new infrastructure is becoming easier to adopt, the scale of ever-expanding new components is difficult to manage, especially as the complexity of the entire platform and the size of the engineering team grows. Clutch solves this challenge by allowing infrastructure teams to provide an intuitive and secure infrastructure management interface to their entire engineering organization.

Clutch is the result of a year-long development cycle to address the lack of experience and tools for Lyft developers. Clutch is made up of two core components. The Go backend is designed as an extensible infrastructure control plane, piecing together a single ProtoBuf-powered API into a system with universal authorization, observability, and audit logging. The React front end is a pluggable and workflow-oriented UI that allows users and developers to create new functionality behind a single pane with little code and little JavaScript knowledge and less maintenance effort.

1 Design and architecture

In terms of design and architecture, Clutch offers a developer’s tool space that is different from any other solution. At the beginning of the project, we did an in-depth analysis of the existing tools before building our own. The main objectives of the tool are:

• Reduce average maintenance time. When the infrastructure responds to alarms, engineers spend too much time reading runbooks and operating complex tools while on standby.

• Eliminate unexpected interrupts. When performing maintenance tasks, severe outages can occur when users using RunBook miss warnings or remove the wrong resources (for example, resources that they think are unused but that are taking up a lot of traffic).

, strengthen the fine-grained permissions and audit all activities in general format, some authority is too broad, because the supplier’s access control does not support fine-grained control, in addition, when we collected from a variety of tools for security purposes the audit log, it is hard to those data extraction become executable insights on how to help improve our tools.

• Provide a platform that greatly simplifies future tool development. At the size of Lyft, it’s hard to succeed on a large scale without taking into account contributions from outside the team, and we don’t have the resources to build every feature Lyft needs, let alone support it.

At first we saw the inadequacy of the existing vendor UI: vendor tools were slow (and in some cases dangerous) due to a lack of specialization. They require unnecessary steps to perform common tasks and provide more information than is necessary. Beyond simple access controls, there is often little protection, and as a result, operators may perform actions that seem harmless but actually degrade system performance. In addition, they may not be familiar with the tool and may delay remediation. Ideally, engineers come in only once every four to six weeks. It’s easy to forget how to use this tool, especially when you consider performing multiple tasks without a specific interactive system.

The consequence of relying on vendor tools is high cognitive load due to fragmentation and information clutter.

Clutch, by contrast, is a vendor-neutral tool that integrates disparate systems into a clear, consistent user experience and provides dedicated functionality to perform common tasks with very little click and training.

Then we turned to the open source community and found that open source infrastructure management tools are often still limited to a single system, are not designed for broad customization, and do not address the cognitive load and fragmentation issues. In addition, while there are other front-end frameworks for building consoles, none contain an integrated back-end framework with authentication, authorization, auditing, observability, API schemas, and a rich plug-in model. There is a popular continuous delivery platform that addresses the same primary issues as Clutch (for example, reduced MTTR, user-friendly UI) but requires a significant investment to run microservices and migrate applications to architectures that are different from our own. The Clutch back-end features are easy to develop and are free for each API endpoint through the integrated core features listed above. It is also developed as a single binary with very little operational input.

Finally, we wanted a platform that we could invest in, and it needed to be easier for other internal teams to understand and build on. Clutch provides an integrated and navigational development model that makes functionality development a straightforward process. In addition to top-notch back-end functionality, Clutch’s front end provides unique abstractions for state management and multi-step forms, making front-end development easier for infrastructure teams without significant JavaScript experience.

2 features

“Control plane” model

The Envoy is created by Lyft. Today, it is one of the most popular agents, deployed at many large Internet companies and advancing cloud networking standards. Our team has learned a lot from working with the larger community to maintain Envoy. One of the most popular topics that Envoy users discuss is the development of controlling planes. In particular, how to systematically integrate the various components so that Envoy can route and report Internet traffic efficiently. Envoy is like Clutch. It integrates different infrastructure systems into a unified API.

Clutch takes the core patterns of many of the Envoy proxies that have emerged from years of work on the Network Control Plane.

Like Envoy, Clutch is configuration-driven, pattern-driven, and utilizes an architecture based on modular extensions to make it suitable for a variety of use cases without affecting maintainability. The Clutch extension does not require branching or rewriting, and custom code can be easily compiled into your application from custom public or private external repositories. These same patterns enable large and small organizations with unique technology stacks to cluster on a single broker solution, promising to focus similar and unique organizations on an infrastructure control plane such as Clutch.

3 Safety and Security

In addition, Clutch comes with authentication and authorization components. The OpenID Connect (OIDC) authentication stream is used for single sign-on, RBAC, and automatic auditing of all operations, with the ability to run additional output receivers, such as Slackbot.

Clutch also has features that reduce the risk of accidents. The guardrails and heuristics typically documented in the RunBook can be implemented programmatically. For example, we would never allow users to shrink the cluster by more than 50% at one time, as this has caused unexpected outages during normal maintenance. In the near future, we plan to capture CPU and other usage data that can be displayed with cluster information, even limiting the lower limit of downsizing if we determine that it could lead to downtime. By implementing guardrails and heuristics directly into the tool, it avoids relying solely on training and operation manuals to prevent accidents.

Deployment and user boot

Clutch transfers as a single binary containing a front and back end, making deployment easy. Many changes can be made through configuration rather than recompiling new binaries.

Other systems that provide infrastructure lifecycle tools require a complex set of micro-services or migrate to an inherent way to manage and deploy applications. Clutch aims to improve existing systems, not replace them.

5 Frameworks and Components

Clutch is driven by the Go back end and the React front end. It provides a fully functional framework for both back-end and front-end development. All of Clutch’s components are composable, allowing for partial framework functionality or full customization.

Such a component and workflow architecture allows developers with limited front-end experience to replace bulky tools or command-line scripts with a clean and easy-to-use step-by-step UI in less than an hour of development.

Clutch’s front-end package provides components that make it easy to build a step-by-step workflow for a consistent and continuous user experience, including:

•DataLayout: is a workflow-local state management control that handles user input and data from API calls.

•Wizard: UI plug-in for displaying step-by-step forms to the user, customizing elements, and displaying rich information in a consistent manner with minimal code.

• The back end of Clutch relies heavily on code generated from the ProtoBufapi definition.

The ProtoBuf tool also generates a front-end client, which keeps the back-end and front-end in sync as the API evolves. The back-end components include:

• Modules: Implementation of code generated API stubs

• Services: Used to interact with external data sources

• Middleware: Used to check request and response data, application auditing, authorization, etc.

• Parsers: A generic interface for finding resources based on free-form text search or structured queries

The parser is a Clutch abstraction that we hope will have a major impact on the way functionality is abstracted into multiple organizations. The parser is easily extensible with custom resource location code, allowing operators to locate resources (such as K8S POD or EC2 instances) by common names that are customary with the organization, rather than common canonical identifiers. For example, if a developer calls his application “myService-Staging”, it would be easy to add code “$application\_name=-${environment}” that interprets such queries as a structured element. In addition, the front end automatically generates user input forms from the back-end definition.

The front end has a single line of code: 1

<Resolver type=”clutch.aws.ec2.v1.Instance” />

The rendered form is as follows:

Configuring additional search dimensions on the back end will automatically map the render form to the front end.

6 Clutch is at Lyft

Before Clutch, Lyft engineers relied on a hodgepodge of command-line tools, Web interfaces, and RunBook to perform simple tasks. The most common Lyft alerts require addressing as many as six different sources of information: alerts, other service dashboards, runbooks, other document sources, vendor consoles or scripts, and configuration Settings. As Lyft expanded in terms of teams, products, and stacks, we realized that the tools hadn’t kept pace. We had no way to solve these problems with the existing framework, which led to the first iteration of Clutch.

Over the past year, Clutch has had an incredible rate of internal adoption in terms of usage and development. Clutch withstood thousands of infrastructure-management risks, each with the potential for surprise or delay that could lead to a loss of trust.

At the time of this writing, seven internal engineering teams have plans to add new features by the end of 2020, at least half of which are open source. Engineers (including our amazing interns) are able to develop meaningful features with limited guidance. Most importantly, we can finally see a path to deliver our internal platform through a single virtual management platform, making the Lyft infrastructure a product that meets customer needs rather than a patchwork collection of systems and tools.

We received a lot of positive feedback internally, such as:

“I’m glad it exists, otherwise I would still be waiting for the TAB to load into the cloud provider’s console.

More details on Lyft Clutch can be found in the Lyft Case Study article.

7 roadmap

Over the course of building Clutch, the product has evolved, and our internal and external roadmap now includes all of Lyft’s developer experience. Our long-term vision is to build a situation-aware developer portal that not only provides developers with a set of tools, but also provides the most valuable tools and information when users log in to the portal.

Upcoming features include:

Envoy UI, provides users with a real-time dashboard to monitor the network performance and configuration of their distributed applications. Chaos testing, integrated with Envoy to allow predetermined fault injection and squeeze testing with automatic shutdown conditions. Auto fix to automatically respond to alerts with the appropriate Clutch operation. Security enhancements, including performance upgrades, inspection modes, and two-stage approvals. Additional infrastructure lifecycle management capabilities, looking at the status of the cluster to find outliers, and performing long-running maintenance tasks. Service health dashboards, which use configurable reporting mechanisms to provide developers with feedback on their service status (such as code coverage, costs, active emergencies). Generic configuration management, which allows users to manage complex configurations through a guided UI or otherwise reflect changes in the infrastructure as code declarations. A topological map that associates users with the services they own and shows them the relevant data and tools on the landing page.

Author: Daniel Hochman & Derekschaller

Translator: Time