Three years ago, at age 22, I got my BACHELOR’s degree in Computer Science (CS) and joined LinkedIn. In my last academic year, a recruiter found me through my LinkedIn profile and introduced me to SRE positions. I didn’t know what SRE was, but I decided to give it a try. I passed the interview and got the first job in my life. I know people should be happy working at a company like LinkedIn, but what exactly is SRE and how will I perform here?

While SRE has been around for years, there are still a lot of people who are not familiar with the character, as there were when I first graduated from school. At LinkedIn, we define SRE through three core principles:

  • Site operation and security: We need to ensure that the site is performing as expected and that user data is secure.

  • Give developer ownership: Teamwork is needed to ensure that the LinkedIn code is reliable and that the system is scalable.

  • Operations is an engineering problem: people tend to think of operations as manual, but LinkedIn is committed to automating day-to-day operations.

All of these definitions and core principles are fine, but what does SRE mean to me? Soon, I found a few things that did surprise me.

First of all, in terms of “site operation and security”, how do we ensure that the site is always up and running? We have engineers on call in shifts who can solve field problems and provide 7 by 24 service. I will soon have to fill this shift standby role for my team. If there was a problem at 3am, I would get a phone call and fix it in time. Having never been in a situation like this before, I was very hesitant about being on call for shifts. LinkedIn also has a number of custom tools that I have never used before, and the amount of knowledge required to operate them is daunting.

In order to give developers ownership, I have to have good relationships with team members and developers. When I looked around at the first team I joined, I felt like an outlier.

There were very few people my age on the SRE team, very few with CS degrees, no one as inexperienced as I was, and no female engineers. My peers who graduated with me did not become SRE engineers, most of them became developers. It got me thinking about where I should be, why I’m taking on a role I haven’t played before, and how other people are different from me. Finally, I thought of something. “Operations is an engineering problem” means THAT I can solve engineering problems by writing code, and I enjoy writing code.

After working at LinkedIn for a few months, I started to get used to a role like SRE. My team is responsible for mobile applications and the desktop home page, so we’re dealing with a lot of user traffic. I settled down to learn all the custom tools and began to feel comfortable with them. To my surprise, I found myself very efficient when I was on call for shifts. During my first shift on-call experience, I diagnosed a problem and fixed it. At the time, the VP of SRE praised me, and I still have that chat screen shot.

I began to become more and more confident, less experienced in self-awareness than my colleagues, and I was able to establish good working relationships with them. I continued coding automation and was able to deploy a brand new mobile API (Voyager) into production. This is a complete overhaul of our mobile application and the strongest evidence that I can do the SRE role. I can show new apps to my parents and friends: “Look, I did it”! I really started to feel like I was getting my bearings, until the accident.

After working at LinkedIn for about a year, a developer on the Voyager team asked me to deploy the application to a production environment. At the time, this was very normal behavior. As an SRE, I know our deployment tools so well that I can easily help developers do this well. When I deployed the code to production, we found that the new code caused the profile page on the mobile application to be abnormal. Since viewing other people’s profiles is an important application scenario for LinkedIn, I want to fix this problem as soon as possible. I issue a custom rollback command to move the error code out of production. After the rollback was successful, I checked Voyager’s health check, and everything looked fine.

There is a saying in SRE that “every day is Monday for operations”. It means that our systems are always in a state of flux, and our teams need to be on call to solve problems that can crop up at any time. The accident that happened today is a good example, because no one can access the LinkedIn page. Further investigation revealed that no one was able to access any pages from a URL that included Linkedin.com. We soon found that the traffic layer was dead. The traffic layer is responsible for routing requests made by browsers or mobile applications to servers on the back end. Because it was down, it could not route traffic or complete any requests. While the issue did affect the services my team was responsible for, we had no control over the traffic layer, so we stepped back and let them debug the problem.

After the traffic team had been debugging the problem for more than 20 minutes, I noticed that Voyager was behaving very strangely. The health check returns to healthy, but switches back to unhealthy seconds later. You should usually be in one of these states rather than switching back and forth between them. I logged into the Voyager host and found that Voyager was completely overloaded and unresponsive — it was affecting the entire traffic layer.

How does an API that provides data only for mobile applications affect the entire site? There is some kind of trust agreement between the traffic layer and other LinkedIn services, and if a service indicates that it is healthy, the traffic layer will provide a connection to that service and expect it to withdraw the connection within a reasonable amount of time. Voyager, however, says it’s healthy but isn’t, so when the traffic layer gives it a connection, it can never reclaim it and ends up hoarding connections from the entire connection pool. All of the flow layer’s eggs were in Voyager’s basket, and Voyager couldn’t return them, making the flow layer useless.

We know we have to restart Voyager to return all connections back to the traffic tier. After the restart command is issued, the deployment tool confirms that the restart is successful, but in fact, the tool cannot restart the service. Since we couldn’t trust the deployment tools, we had to manually log in to each Voyager host to terminate the service. Finally, the flow layer recovered, and our remedy worked.

Hundreds of LinkedIn engineers began asking the question, “Why is this happening?” This problem is the worst I’ve seen in my three and a half years at LinkedIn. For an hour and 12 minutes, no one was able to access the site through Linkedin.com. After several hours of investigation, we found that the root cause of the problem was me.

Earlier that day, when I issued the rollback command, my first priority was to remove the bug-containing code from production as quickly as possible. To do this, I rewrote the rollback command to make it roll back faster. Typically, deployments occur in 10% batches — so if you have 100 Voyager hosts, only 10 are deployed at a time. Then, after the first batch is deployed, the next 10 are deployed, and so on. I rewrote the command to set the batch to 50%, which meant that half of the main machines would be down at a time. The other half of the host couldn’t handle all the traffic and went into a state of overload and total untouchability, simultaneously becoming the catalyst for a perfect storm that took down the rest of the site.

I made a big mistake issuing this rollback order. I was nervous about bringing problematic code into production, and I let that pressure influence my decisions. But if I hadn’t modified the rollback command, it would only have caused an outage of five minutes at most. It actually takes a lot of factors to bring down an entire website, but unfortunately there are a lot of other external factors that contribute to the problem.

First, we had some tools that caught problems in the code, but since it had been returning unreliable results recently, the developers decided to bypass it and deploy the code directly into production. Then, once the code was in production, our deployment tool reported that it successfully restarted Voyager, when it did not. All in all, instead of helping us that day, our tools dug a hole for us.

Voyager, as I mentioned, had just been introduced to LinkedIn. We’re using a new third-party framework that hasn’t been used in many other parts of LinkedIn. As it turns out, there are two serious bugs in the framework that exacerbate this problem. First, the health check mechanism does not work properly when the application enters the overloaded state. That’s why Voyager kept reporting that he was healthy when he wasn’t. Another bug is that when the application is overloaded, stop and start commands will not execute properly, but it will report that the command is in effect. This is why the tool reports that the restart was successful, but it wasn’t. This incident has exposed issues that we hadn’t considered before, issues that may not have existed before, but will arise as the technology stack becomes more complex and grows in size.

Finally, if there had not been a misdiagnosis failure earlier in the day, the duration of the problem might have been much shorter. For example, if my team hadn’t backed off and let the traffic team diagnose the problem, the results might have been different.

The fact that I was responsible for the crash was hard for me to accept. But I had to start getting my confidence back, and it was like hitting my head against a wall. Fortunately, LinkedIn’s culture is about problems, not people. Everyone understands that if one person can bring down a website, there must be many other problems. So instead of blaming me, our engineering organization made some changes to prevent this from happening again.

The first is to suspend changes to the entire site. No code is allowed to deploy unless it is a critical fix. After months of engineering efforts, our site became more resilient. We also did a complete re-evaluation of the tool, because not only did it not help us that day, it caused us trouble.

We ended up with yellow codes for both tool systems. The code yellow refers to our internal statement: “Something went wrong and we need to be careful.” The team that declared code yellow focused all their efforts on fixing problems rather than developing new features. This is an open and honest way of solving problems, rather than sweeping them under the carpet. We have since had a new deployment system that is easier to use and more reliable.

Of course, this experience also changed my personal mind. At the beginning, I was very disappointed with myself. After setting up the website, I didn’t know how to face my colleagues and still get understanding. But the team supports me, and I have learned how to stay calm when accidents happen. Before the accident, I would have felt flustered and stressed in the face of such an event. I now realize that it is much better to think twice than to act rashly, which could result in a bigger site going down. If I had paused to catch my breath before issuing the rollback order, I might have thought twice about using such a large batch. Since the accident, I have learned how to stay calm in a tense atmosphere.

Since the incident, I’ve also found more community colleagues at work — especially other women in SRE — to talk to about my doubts and concerns. It grew into what is now the Women in SRE (WiSRE) group on LinkedIn. Because of these women SRE and seeing myself find a place in SRE, I feel like I really belong in SRE.

Finally, I realized that sometimes it pays to be disruptive. This problem has led to a lot of technical tweaking, which has made LinkedIn more reliable. I’ve taken this idea to heart and joined another SRE team at LinkedIn, Waterbear. The team deliberately introduced failures into applications to see how they reacted, and then used that information to make them more resilient. I felt very excited as I turned the lowest ebb in my job into a resilient passion.

English text: https://venturebeat.com/2018/10/13/what-i-learned-by-bringing-down-linkedin-com/