Author: Ma Shima

Source: https://madao.me/goodbye-microservices/

The copyright of this article belongs to the author

————————————————

Translated From Goodbye Microservices: From 100s of Problem Children to 1 Superstar by Alexandra Noonan

The content describes how the Segment architecture goes from “single app” -> “micro service” -> “140+ micro service” -> “single app”. Translation is rough, if there are omissions, please do not hesitate to advise.

Note: The destinations mentioned below are the corresponding data platforms (e.g. Google Analytics, Optimizely).

Unless you lived in the Stone Age, you know that “microservices” are the most popular architecture around. We at Segment started implementing this architecture back in 2015. This makes us feel good in some ways, but it soon becomes clear that he occasionally gives us a hard time in other scenes.

In short, the main pitch for microservices is: modular optimization, reduced testing burden, better feature composition, environment independence, and development team autonomy (because the internal logic of each service is self-consistent and independent).

Individual apps at the other end: “Huge and difficult to test, and services scale only as a collation (if you want to improve the performance of one service, you can only improve the server as a whole)”

In early 2017, we hit a plateau where the complexity of the microservice tree was causing our development productivity to plummet, and our defect rate skyrocketed as every development team found themselves stuck in huge complexity with every implementation.

We ended up using three full-time engineers to keep each microservice system running. This time we realized that change had to happen, and this article describes how we can take a step back and align the team’s needs with the product’s.


Why did microservices ever work?

Segment’s client data infrastructure absorbs hundreds of thousands of events per second, returning API request results for each partner service to the corresponding server “destination” one by one.

There are hundreds of categories for destinations, such as Google Analytics, Optimizely, or some custom Webhooks.

A few years ago, when the product was initially released, the architecture was simple. It is simply a message queue that receives events and forwards them.

In this case, the event is a JSON object generated by a Web or mobile application, as shown in the following example:

{
  "type": "identify"."traits": {
    "name": "Alex Noonan"."email": "[email protected]"."company": "Segment"."title": "Software Engineer"
  },
  "userId": "97980cfea0067"
}Copy the code

Events are consumed from the queue, and the client’s Settings determine which destination the event will be sent to. The event is sent to each destination’s API, which is useful.

Instead of implementing dozens of project integrations yourself, developers only need to send their events to a specific destination, namely the Segment API.

If a request fails, sometimes we retry the event later. Some failed retries are safe, but some are not. Retried errors may not change the event destination.

For example: 50x error, rate limit, request timeout, etc. A non-retried error is typically a request that we are sure will never be accepted by the destination. For example, the request contains invalid authentication or is missing a required field.

At this point, a simple queue consisting of new event requests and several retry requests, crisscrossed by event destinations, leads to an obvious result: queue head blocking.

This means that in this particular scenario, if a destination is slow or fails, the queue will be flooded with retry requests and the entire request queue will be slowed down.

Imagine that we have a destination X that encounters a temporary problem causing every request to time out. Not only will this result in a large number of requests that have not yet reached destination X, but every failed event will be sent to a queue for retries.

Even though our system would scale elasticity based on load, the sudden increase in the depth of the request queue would outstrip our ability to scale, resulting in a delay in the new time push.

The time to send to each destination will increase because destination X has a short outage (due to temporary problems). Clients rely on us for real-time, so we can’t afford to be slow at any level.

To solve this problem, our team implemented a separate queue for each destination

This new architecture consists of an additional router process that receives an inbound event and distributes a copy of the event to each selected destination.

Now if a destination has a timeout problem, only that queue will block and not the whole population. This “microservice-style” architectural separation separates destinations from each other, which becomes critical when one destination is always going wrong.

The example of a personal Repo

Each destination API has a different request format and requires custom code to transform events to match the format.

A simple example: once again, destination X has an interface to update a birthday. The format field for the request is doB, and the API will ask you for a birthday field. The conversion code would look like this:

Const traits = {} traits.dob = Segmentevent.birthday Many modern destination endpoints use the Segment request format, so conversion can be simple. However, these transformations can also be complex, depending on the structure of the destination API.Copy the code

Initially, when the destination was split into several split services, all the code would be in a single REPO. A great frustration is that the failure of one test often leads to the failure of the entire project test. We’re probably going to spend a lot of time just trying to get him to run as well as he did before and pass the test.

To solve this problem, we split each service into a separate REPO, with all destination test errors affecting only themselves, a very natural transition.

Isolating each destination with a split REPO makes it easier to implement tests, and this isolation allows the development team to quickly develop and maintain each destination.


Extension microservices and Repo’s

As time shifted, we added more than 50 new destinations, which meant 50 new REPOs.

To reduce the burden of developing and maintaining these codebases, we created a shared codebase to implement some common transformations and functions, such as handling HTTP requests, with more consistent code implementation across different destinations.

For example, if we want the name of the user in an event, event.name() could be a call from any destination.

The shared library will try to determine the name or name attribute in the event, and if it doesn’t, it will look for first name, so go back to first_name and FirstName, and then last Name will do the same thing. First name and last name make full name.

Identify.prototype.name = function() {
  var name = this.proxy('traits.name');
  if (typeof name === 'string') {
    return trim(name)
  }
  var firstName = this.firstName();
  var lastName = this.lastName();
  if (firstName && lastName) {
    return trim(firstName + ' ' + lastName)
  }
}Copy the code

The shared code base allows us to quickly implement new destinations, their similarity gives us consistent implementation and reduces maintenance headaches.

Nevertheless, a new problem began to emerge and spread. Testing and deployment of shared library code changes affect all destinations.

It started to take a lot of time and effort to maintain it. To modify or optimize the code base, we had to test and deploy dozens of services, which was a huge risk. When time is tight, engineers will only update a particular version of shared library code at a particular destination.

Subsequently, versions of these shared libraries began to diverge in different object code libraries. The initial benefits of microservices began to reverse as we made custom implementations for each destination.

Ultimately, all microservices were using different versions of shared libraries — we could have automated the latest changes. But at this point, it’s not just the development team that’s getting stuck, we’re also running into microservices in other ways.

The additional problem is that each service has a specific load pattern. Some services process only a few requests per day, while others process thousands of requests per second.

For destinations that handle fewer events, operations must manually scale services to meet demand when there are unexpected spikes in load. (Editor’s note, there are solutions, but the authors highlight complexity and cost.)

When we implemented the auto-scaling implementation, each service had a distinct mix of CPU and memory resources required, which made our auto-scaling configuration more of an art (or rather, a fudge) than a science.

The number of destinations is growing rapidly, and the team is growing at a rate of three per month, which means more REPOs, more queues, more services.

The operating costs of our microservices architecture are also increasing linearly. So we decided to take a step back and rethink the whole process.


Dig deep into microservices and queues

The first thing on the list was how to consolidate the current 140 + services into one service, and the cost of managing all of them became a huge technical liability for the team. Operations engineers have little sleep because they have to be on line to deal with sudden spikes in traffic.

However, turning the project into a single-service architecture was a huge challenge at the time. To make each destination have a separate queue, each worker process needs to check whether each queue is running or not. This complicated implementation of adding a layer to the destination service makes us feel uncomfortable.

This was the main inspiration for our Centrifuge, which will replace all of our individual queues and be responsible for sending events to a single unit service.

Note: “Centrifuge” is actually an event distribution system created by Segment. The relevant address

Move to a single Repo

So we started combining all of our destination code into a single REPO, which meant that all of our dependencies and tests were in a single REPO, and we knew we were going to have a mess.

120 dependencies, and we all submitted a specific version to make each destination compatible. When we finished moving the destination, we started checking to see if each corresponding code was using the latest dependencies. We guarantee that each destination will work correctly under the latest dependent version.

With these changes, we no longer have to track dependent versions. All destinations use the same version, which significantly reduces codebase’s code complexity. Maintaining destinations is faster and less risky.

On the other hand, we also needed tests to run easily and quickly, and one of the conclusions we reached earlier was that “the main obstacle to not modifying shared library files is running all tests once.”

Fortunately, destination tests all have a similar architecture. They all have basic unit tests to verify that our custom transformation logic is correct and that the HTTP returns are as expected.

Recall that our innovation was to separate the codebase for each destination into the respective REPO and separate the problems for each test.

Still, the idea now looks like a false advantage. HTTP requests still fail with some frequency. Because the destinations are separated into their respective REPO, there is no incentive to handle such failed requests.

Which leads us into something of a depressing downward spiral. Small changes that should only take a few hours often take us days or even a week.


Build a resilient test suite

Failed HTTP requests to destinations are our primary cause of test failure, and irrelevant issues such as expired credentials should not fail our tests.

We also found that some destinations were slower than others. Some destinations took five minutes to run, and our test suite took an hour to complete.

To solve this problem, we created a “Traffic Recorder”, which is a Yakbak implementa-based tool for logging and saving requests.

Whenever a test runs for the first time, the corresponding request is saved in a file. The results will be reused in subsequent test runs.

The result of the request also goes into the REPO so that it is consistent in the tests. This way, our tests no longer rely on network HTTP requests, paving the way for subsequent single REPO.

I remember that after integrating “Traffic Recorder” for the first time, we tried to run a whole test and completed the whole test of 140+ destinations in a few milliseconds. It used to take a few minutes to test a destination. It was magic.


Why does monomer work

As long as each destination is integrated into a REPO, it can run as a single service. With all destinations in one service, the development team becomes significantly more efficient. Instead of deploying 140+ services because we changed the shared library, an engineer can redeploy them in a minute.

The speed was significantly improved, and during our microservices architecture period, we optimized 32 shared libraries. We’ve done 46 since we became monomers, and we’ve done more optimizations in the last 6 months than we did in all of 2016.

This change also greatly benefits our operation and maintenance engineers, each destination is in a service, we can very good service scale. Large process pools can also absorb peak traffic easily, so we don’t have to worry about sudden traffic from small services.


The bad

While there are huge benefits to switching to monolithic applications, here are the downsides:

1. Failure isolation is hard, everything runs in a single application, and if a bug in one destination causes the service to crash, that destination will crash all the other destinations (because it’s a service).

We have fully automated testing, but testing only gets you part of the way. We are now working on a more robust approach so that the crash of one service does not affect the entire monolithic application.

2. Memory caching becomes less effective. Whereas each service corresponds to one destination, our low-traffic destination has only a few processes, which means its in-memory cache can keep a lot of data in the hot cache.

The cache is now spread over 3000+ processes so the cache hit ratio is much lower. Finally, we can only accept this result under the premise of operation and maintenance optimization.

3. Updating versions of shared library code may crash several destinations. When integrating the project, we resolved previous dependency issues, which meant that each destination could use the latest version of the shared library code.

But the next shared library code update means we may need to change some of the destination code as well. In our opinion, this is worth it because the optimization of the automated testing process allows us to find new dependency issues more quickly.


conclusion

Our initial microservices architecture was in line with the situation and addressed the performance issues and isolated implementation between destinations.

However, we were not prepared for the change of service surge. We lack proper tools to test and deploy microservices when batch updates are required. As a result, our r&d efficiency has declined.

Moving to a monolithic structure allowed us to get rid of operational issues while significantly improving developer productivity. We didn’t make the transition lightly until we were sure it would work.

1. We need proper test suites to get everything into a REPO. Without it, we might end up breaking it up again. Frequent failed tests have hurt our productivity in the past, and we don’t want that to happen again.

2. We accept the inherent disadvantages of some monolithic architectures and make sure we get a good result in the end. We are satisfied with the sacrifice.

There are several different factors that we consider when making a decision between a singleton application and a microservice. In some parts of our infrastructure, microservices work well. But on our server side, this architecture is also a perfect example of how it really hurts productivity and performance.

But at the end of the day, our ultimate solution is monomer applications.

End

End



Scan the qr code below to try reading

JVM Master from Scratch