How to use Go to build a hundred-million-level real-time distributed travel platform

preface

Grab is The largest travel platform in Southeast Asia, serving 39 cities in seven countries and attracting 36 million APP downloads. With the continuous growth of business volume, Grab switched from Rails and NodeJS to Go in order to solve the system performance bottleneck and create a set of highly available system architecture. In addition to all backend services, the streaming data system that supports trillions of data processing every day is also built on Go. This post is from Grab’s Gopher China 2017 presentation by Senior engineer Gao Hao. Here is a summary of his speech.

Grab is a Southeast Asian mobility platform that addresses mobility issues for people in Southeast Asia. Grab’s main market is Southeast Asia, and we have no plans to enter The Chinese market. We have a strategic partnership with Didi. Founded in 2011, Grab now operates in 39 cities in seven southeast Asian countries. There are currently 710,000 drivers and 36 million APP downloads. We are a very young startup with r&d centers in Singapore, Beijing, Seattle, Vietnam and Indonesia. I am currently working in Seattle R&D center as the Tech Lead of Grab’s big data r&d team.

Grab’s former technology stack

Figure 1

Like other startups, we started our first version development using some mature methods. We started with Rails and NodeJS. From day one, all of our services ran on Amazon Cloud, and we also used Travis CI and MySQL.

Grab’s current technology stack

Figure 2

As the business volume continued to grow, the previous set of framework gradually could not bear the scale of our business, so we made a big transformation. The current technology stack looks like this, as shown in Figure 2. We’re still running our services on The Amazon cloud, and we’re doing a lot of work on big data and machine learning. The most striking one, and one most relevant to the conference, is Go on the far right. Grab’s backend services are now built with Go.

Why the switch to Go?

First of all, there was a lot of performance pressure on our service and it went through a dark period. At that time, the service would break down several times a day, so we thought about which direction to transform to achieve a relatively high performance overall architecture, and finally chose Go. There are several reasons. First, the language specification of Go is very simple and easy to get started. Our existing programmers can quickly switch to Go development, and the learning curve is not too steep. Go has a very good toolchain and its own testing framework, all of which can help you write very formal code. Go also has a very convenient deployment process, which is very easy to deploy once it’s packaged, compared to when you had to deploy a bunch of things. At the same time, the performance of Go is also very good. According to practical experience, after using Go, the number of elastic cloud machines decreases by 90% and the response delay can be reduced by 80% on average.

What does Grab do with Go?

We write a lot of stuff in Go, and all of the back office services that I just mentioned are currently built in Go. There are now about 50 microservices, and the number is growing. We now have 300 engineer Gophers and aim to recruit 800 this year. If you are familiar with taxi-hailing application service, you will know that it has a lot of real-time information to deal with, which is what supports you to make scheduling and decision-making. The streaming data system is doing this, which is also written entirely in Go. There are ALSO API Gateway, RPC & RESTful Framework, etc. At the same time, we also wrote ORM by ourselves. Why did we write ORM by ourselves? As we all know, a common bottleneck in back-end services is data system. The business logic of the service does not need to be concerned with the choice and transformation of specific database technologies, and such an architecture gives us a great deal of freedom as we migrate to the microservices framework.

We also built a CI system with Go and a very important machine learning platform. Machine learning is one of the hot topic now, in our business field is also very important topic, you can imagine, every day there are hundreds of thousands of passengers want to take a taxi, you how to reasonable distribution distance closer to the driver, he depends on the distance between the driver and passengers, drivers, passengers, grading, etc., these are the machines can help us to solve the problem. The next thing we are doing is Serverless Platform, which is similar to Amazon Lambda. You don’t need to care about a function, the machine operation and maintenance of the whole service, the running environment and the deployment. What you care about is your own business logic. All you care about is your business logic, which only supports a few languages. We came up with the idea of building our own system for Go. This platform helps us iterate on product features faster, making developers more efficient.

Grab’s Go practice

Let me share with you some of the lessons we’ve learned with Go for two and a half years. There are several important points: first, how to manage your code; second, how to manage the quality of your code, which is a hot topic in the field of microservices, Distributed Tracing; then, how to test Go; and finally, how to manage the Bugs and problems we wrote in Go.

Code management

Our code management approach may be different from that of many companies, where all of our back-end services’ Go code is stored in a Repository. This has some advantages. First of all, you have A consistent version, and you can refer to the public library. After the public library is upgraded, service A and service B get different versions. If they are all in a Repository, this problem does not exist because all versions are visible. The other is the extreme sharing and reuse of code that you can see written by other services. You may have had the experience that programmers like to build wheels in different teams, and there may be a dozen versions of a time-formatting function in a company that you didn’t know someone else already had, so you wrote another one. But all in one Repo is visible.

Another benefit is that dependency management is very simple; if you have a public library in a different Repo and you want to update it, you need to deploy different dependency management. In a code organization, you can always package into a public library. Code changes can be done atomization at the same time, for example, when we started writing Go, no experience, A lot of best practices are not familiar with, write found A pattern after A period of time is very nice, we want to put A lot of public library functions to add this code, we have to change the public library, then change to service A code, the user code to also want to change, After the service B code changes, we have to change the B code, so that any changes or anything can cause a big problem, but our code is all in a Repo. This way, you can support large scale refactorings, code base updates. At the same time, teams can cooperate better, because communication barriers can be removed. There are also more flexible team boundaries, as you may have experienced moving one service to another, or a folder change if you put it in a Repo. Code transparency is naturally maximized, so if you have a dozen services under the dispatch team and all ten services are in the dispatch team folder, the ownership of the code is very clear.

That’s how we manage our code. This approach has so many benefits, but it also has its drawbacks. One of the drawbacks is that all CI systems on the market are not optimized for this kind of code organization, so when you use these systems, it’s very inefficient. That’s why we took the time to build our OWN CI. All in all, such an approach has its own benefits, but you need to have a strong enough tool chain to make the benefits stand out.

Distributed Tracing

Figure 3

Figure 3 shows the relationship between services in our systems in February this year, which shows the chaos. The evolution of a startup architecture, from a single application to a large-scale microservice architecture, results in a messy picture. Previously, in a single application, the different functions were functions, functions became services, and each service added more and more dependencies. Distributed Tracing helps you quickly discover and diagnose service problems when you encounter them. There are several scenarios: a request that takes three seconds to complete, how to diagnose where it takes the most time; How to locate Single Point of Failure How to locate and discover cyclic dependencies; How to locate Fan IN, Fan Out. These are the issues that Distributed Tracing can help solve.

Its implementation principle is as follows: When a request comes into your system, generate a globally unique traceID in the API Gateway and inject it into the Header of the request, generate a spanID for each time node of the request, index it with traceID+spanID, and record other metadata. Tracing information is passed along the flow of requests. When the request ends, all diagnostic information is aggregated using traceID as the key.

Figure 4.

Figure 4 shows a basic tracing. As you can see, the first thing to do with a request from a client may be to make an Auth, then request the billing information of the user, and then read other resources. We do an aggregation when we consume three nodes and then request to flow out of the system, and that’s the basic workflow.

Figure 5

Figure 5 is a real picture in our system. This is a request for driver rating. After he enters, another back-end service will be called, which will call the driver’s rating. With a map like this, you can quickly narrow down the diagnosis if you want to diagnose your performance information or system bottlenecks. And you can control the intensity, we only do these, there may be some calculation of the day time and other functions, you can add. At the same time, you can also do some analysis based on the protocol between services, such as the protocol between service A and SERVICE B, service B must return the request within 100 milliseconds, service B actually took 120 milliseconds, according to the figure, you can go to the maintainer of service B, why this is slow, you can do density monitoring.

Figure 6.

In Go, relevant information is passed through the Context. So the advice here is to use Context a lot, so if you’re starting out with Go and you don’t have a lot of Context, take time, because if you have a lot of code and you want to add Context, it’s going to take a lot of effort. The Context can provide very useful auxiliary function, specific I have listed in the PPT, you also can learn the Context the Pattern to Go the website of https://blog.golang.org/context.

Here I also introduce Open Tracing, which is not a framework but a library that can be aggregated in the same way. You don’t have to do anything, you just have to turn on the Open Tracing switch, which is really great, and mastering the standards is like mastering the world. Ideally, we would use a database that could do all of these things, so that we could have very transparent monitoring of every aspect of every service.

Go Testing

There are two types of tests we write most often at Grab. The first is unit testing and the other is end-to-end testing.

Let’s start with unit testing. The code in Figure 7 should be familiar to everyone: table-driven testing. Is the most basic you define a string of test scenarios, and then define you have to test the function of the input and output and look forward to an error, in the side can use a loop for testing of different scenarios, scalability is very good, when you want to test another scenario, you only need to in the definition of the scene of struct to add. All of us can use the Potency package, which is a handy function and also an internal specification, when we do our unit tests.

Figure 7.

Unit testing brings up another topic, when we did unit testing, when we started writing Go we were young and naive and that led to a lot of other bad habits and bad practices. One of the things in Figure 7 is that the database is a global singleton. I don’t want to write the database when TESTING. First of all, THE CI database itself is unstable. I don’t know whether the database is wrong or my program is wrong. As with global singletons, we’ve also written a lot that define functions as variables, and then just replace the quantifications of functions when you’re testing somewhere else, and your code structure gets ugly, which is a bad practice.

Figure 8.

After our continuous hard work, we now have some slightly better practices, as shown in Figure 8. There are two ways to constrain a database in an interface, injecting the database as a dependency into the Server. The advantage of this is that I can actually customize the different servers in each test so that they don’t affect each other, the code is much easier to write, and the architecture is much cleaner.

End to end testing because I don’t have enough time to get into it, just to mention it. We have different clusters. We have a Staging cluster and a Production Staging cluster where the end-to-end testing takes place. The important thing about end-to-end testing is that you have to have a real environment, if your database is real, everything has to be real. We have two different approaches to end-to-end testing, Postman and Go’s own testing framework to write end-to-end tests.

Code quality Management

We talked a lot about testing, and testing is about finding problems in code. The reason for these problems is that the code is not written well, and the most important issue of all is how to manage the quality of your code. Our point is that Code Review is very important, but its importance is often overlooked. I believe that no company will list Code Review as KPI. When your colleague’s code is not well written, how can you remind him? When you see a piece of messy code, I will think that if the word is too heavy, it will affect the relationship between colleagues; if the word is not clear, it will affect the quality. Therefore, the importance of Code Review is often overlooked by trivial matters.

Figure 9.

To solve this problem, we should use good tools to improve the efficiency of Code Review and reduce the friction between people caused by Code Review. We mainly use these three tools, Phabricator, Jenkins and Slackbot. Phabricator is Facebook’s open source Code Review tool. How exactly do these three tools work together? First of all, when a code is written, the first thing I want to deploy the code, we are engineers and we want to create value, but whether the quality of the code can be deployed is also a problem. Tools are there to make sure that every piece of code is at least viewable. How do you do that? When our engineer has written code, he will have a command, and it will first run Linter, your code base condition is not guaranteed that the edge condition is not checked. We used GoLniter, and we’re working on our own. What does Linter check for? For example, if you have an error, it will say that the error cannot be ignored, and that the error must be addressed explicitly. Your code may be readable, but it’s full of bugs, so we’re going to run the tests, run the code locally and change all the package tests, is your code going to introduce any other bugs, run it all. As you run this test, there is a report on test coverage. If your test coverage is not what we want, your code review will not be submitted. Because test coverage is an important measure of code quality, we try to ensure that the code we see is as good as possible.

Figure 10.

After the code is submitted, we will put our code into CI, which will run larger scale tests, including simple integration tests. We have a list of CI scripts written in Go. The code goes up, the request comes in and says I’m going to test, I’m going to run unit tests and end to end tests, and five minutes later, the CI succeeds, the code succeeds. The little robot will send you a message saying congratulations, your code test is successful. Phabricator will clearly show you which parts of your code are covered by tests and which are not. For example, the code block error check area is not covered by the test, blue is covered. This can improve efficiency, first of all, you write reasonable and legal test, test covered code, do not need to spend too much energy to check for bugs, the test did not cover, or submitted not covered, may itself have problems, this is a success.

Figure 11.

If your CIBuild fails, three things happen: Slackbot sends a message that the author didn’t pass; The Phabricator automatically rejects Diff; The Phabricator Code Review tool shows which tests failed. I think you’ve all had the experience that somehow one day the code gets changed, and in any case you change your public library code, you want to be able to get a message about what’s going on. When you look at the screen, it proves that your Diff is not up to standard, and you should spend some time retesting the Diff and writing the submission. All of this stuff happens on the server side, and one of the things that we do is, the failure rate of your tests is counted, and you get an email if you do it for a long time telling you that your code has been very poor lately and needs to be improved, and you get a freeze period, you get a notification.

Another thing you do during a CI run is run code coverage. As you can see, this is something that we wrote in Go and integrated into CI, and the red font is code coverage that doesn’t meet our standards, and when the coverage doesn’t meet our standards, your Diff is also rejected. Through these tools, we hope to adjust engineers’ mentality towards code better, because engineers and programmers are the most important thing is code, if you can’t write code well, it is useless to talk about architecture. We want to improve the coding rigor of engineers through a series of things.

The problems we’ve had with Go

Figure 12

There are two common problems in Go, one is Nil Pointer and the other is Dns Resolution. Nil Pointre. This one on the right should be familiar to you, so you can see if there are any problems with this code. So we have A structA that has A member function of its own called Test, and we have A function that getA generates that returns nil, and we execute Test when we call that function to getA, and the weird thing is that the printed message from A to Test is visible. When you have A getA that returns A pointer type, you can call Test. So if you have a function like getA that has a bunch of branches, it’s going to return nil if it’s very marginal, you have to be careful. Let null values, someone said, useful, if you want to withdraw nil might you need to add this line, this code has appeared in many a repository, you have interest, can go to take a look at this video (https://www.youtube.com/watch?v=PAAkCSZUG1c&t=6m25s), It is about the application and reasonable treatment of Go null value. This is something that has been troubling us for a long time.

Figure 13

We also have a problem with DNS Resolution. AWS ELB requests are not evenly distributed across different machines. The RFC is to define your service program by IP address, DNS sent me how to make a choice to return to your application, Go first written in C and in C libraries, if see you inside the external interface only IPv4, it is not on the list for sorting, it will respect the DNS sorting choose to return to your application, But Go’s library implements only a subset, resulting in Go sorting the DNS results by the same rules every time. As you can imagine, the IP addresses received within the same subnet are selected from the first IP address in the list, and the first IP address in the list will naturally receive more requests. As a good user of the Go community, we got the immediate reply. This will be fixed in 1.9.

Figure 14

What else is Grab doing with Golang

Figure 15

With all that said, what are we still doing with Go? Here we are doing data processing, integrating information about hundreds of millions of passengers, drivers and traffic, doing functional computing and machine learning platforms.

Cloud Native Go: Building Cloud Native Web Applications and Micro Services based on Go and React is a new book for the top three readers.

Asta, Hao Lin, Zhang Xin, Fei Lianghong and others jointly recommend books in the Go field, with four advantages:

1. Guide readers to understand the generation, application scenarios and advantages of cloud native concepts

2. Collection of many hot technology today

3. Master Go language assistant cloud development perfect implementation methods

4. The process is complete and the example is detailed

Deadline for collecting likes: 14:00, August 21

How to use Go to build a hundred-million-level real-time distributed travel platform

Related Posts

Maven plugin automatically generates Mock test code

Talk about linear tables, stacks & columns in data structures

JVM three constant pools