Wang Xiaobo/Chief Architect of Tongcheng Tourism

Focus on high-concurrent Internet architecture design, distributed e-commerce transaction platform design, big data analysis platform design, high availability system design, basic cloud related technology research, in-depth practice of Docker and other containers.

preface

With the continuous expansion of the company system and the increasing number of services, programmers often need to worry about many things beyond the code, such as resource application, environment configuration, performance security, etc., resulting in low development efficiency. At the ECUG 10th anniversary conference, Wang Xiaobo, chief architect of Tongcheng Travel, shared some experience in the process of transformation from traditional architecture to microservice architecture, and then from microservice architecture to Serverless architecture, aiming to make programmers’ development more pure and happy, focusing on code.


A lot of people say that OTAs are not the Internet, but it’s really hard to fully Internet travel. If you go to buy a power bank and a hat, whether on Taobao or JINGdong, there must be a system to help you. But on the same trip, if you go to buy a hotel or buy a ticket, there might be five or six systems working for you. That’s because the data, the focus, the behavior of the users, are completely different. As a result, thousands of programmers in the company seem to be doing the same thing every day. How do you change that? Make programming a little happier, a little purer.

Concurrent practice Serverless background

How far is it from a SQL line to a service?

Many of you who are doing business-level development work on a daily basis, pulling data out of a database or other data store, or writing it in, and providing it as a service. How far is such a thing from a service? It’s really far away. In other words, when you want to develop a feature, when will it go live? When we do an SQL, we first need to know where the DB is, we need to know how it performs, and then we need to apply for a project to put the code in. After applying for the project, you have to build the development environment; After building the development environment, you can apply for online resources; After applying for resources, apply for deployment; After deployment, apply for operation and maintenance. After all this is good, also have to think of which day if the pressure can not bear, or which day after the flow of how to expand. When you look at the problems this raises, how far is a SINGLE SQL line from a service? Very far.

At the same time, many product managers are full of ideas. Maybe you get hit by someone on the bus in the morning, you come up with an idea, and then you go to the company and ask a programmer to implement it, and it’s going to be online in the afternoon. The programmer said that he could not get online this afternoon, making various explanations, saying that he needed to apply for a server, or other various resources. All of these things, in fact, come from the question: how far is the distance from a SQL to a service? If this takes only an hour, eight hours a day allows the product manager to think of eight questions. Working 4 hours after work in the evening and being able to think of 4 more questions. When the product manager is happy, the programmer must be happier.

What makes programmers unhappy

  • Environment, framework, dependencies

The first is the environment, the framework, the dependencies, these are very difficult issues.

The first is the environment, which is already very difficult from development to launch. It may be good to develop students’ local, but when it comes to testing hands or online, there is a problem. The trouble may be due to our operation and maintenance students are too weak, with the environment is not the local development environment, so it can not. In fact, there is a big problem here, the operation and maintenance students are responsible. Environment consistency is something that I think programmers should do in itself. But not all programmers are good programmers, not all full stack engineers. He may write code, but he doesn’t understand operations or how to make code work better in the environment. How many full stack engineers are there in China? How long does it take to train a full stack engineer? Here comes the difficulty.

Another difficulty is dependencies, which are a major headache during development. I am a Java programmer myself, sometimes I write a thousand or two thousand lines of code, only one or two files, make a package, 60 megabytes, a bunch of dependencies. Of course it is better to use Go now, but can such a problem be solved? For example, we have some development in Suzhou and Beijing, and we have nearly 1,000 programmers in Suzhou. When there are so many programmers, how can we make them more efficient? Of course you can do it in management, in development mode. But isn’t it better to use technology to solve dependencies so that no dependencies arise? Even if there is a dependency, but also in the form of the interface to tune, so much more comfortable.

  • Deployment, o&M, and capacity expansion

Another is deployment, operation and maintenance, and capacity expansion. I just said that maybe the students in development do not know much about operation and maintenance, and the students in operation and maintenance do not know much about development, which leads to the problem. The same cheng used to push DevOps, and now each of us developers can have his own operation workbench to operate each machine he deploys, which is no problem. But is it really a good thing? As a matter of fact, when doing this, cheng found that it was not good to ask a group of students in development to do operation and maintenance, and there were online failures every day, because operation and development were two ways of thinking. So later, when we were doing DevOps, we first transferred all the original operation and maintenance students to become product managers and project managers of operation and maintenance, and asked a group of development students to build the operation and maintenance platform. After the operation and maintenance platform was built, the development students could use it. Thus, DevOps was promoted. Does that really sound like a solution to the problem? In fact, it is not solved, because for the development students, they still need to know what the operation and maintenance work is, but some development students may not want to know. For example, one morning, the product manager wants to sell a box lunch for ticket sales. Such a function may be completed in 50 lines of code. Then why does the developer want to know operation and maintenance?

When companies get big, they have to be cost settled. The disadvantage of OTA is that a company looks like a company, but it is actually a special force, which integrates various business units and is independent of each other. All of them are excellent, but they have no money when collecting their money. For example, during the Spring Festival traffic peak, many servers are deployed, but after the Spring Festival, these servers waste resources. So there is a problem, how to make my code do not count resources when there is no traffic, and when there is a lot of traffic can expand itself. It’s a tall order, but why not for a business battlefield?

  • Debugging, performance, security, too complicated

This is probably the biggest pain. In the same process, we have a dedicated performance testing team, a performance testing platform, and a full-link performance test. For example, the National Day and the early Spring Festival are two peak points of tourism every year, so we will do the pressure measurement of the whole link. But the problem is how do you ensure that every single piece of junk that every programmer writes is highly concurrent. In our development process, there is a real situation like this, for example, stealing hotels, one dollar to stay in five-star hotels. This is actually very easy to do, two programmers to do, the results of the online hang, because the estimated time thought that the hotel will not be many people rob, did not expect so many people to rob. How to do a good system within two days, so must have a good queuing system. Who will do the queuing system is another question, whether it is a public person.

What’s the other problem? If after product manager gave a difficult problem again, say snap up is not fun, grabbed did not have, why cannot be in snap up when real-time display row is in a few. This has to save everyone’s ranking, can not be faked, then how to do the performance at this time. Some people may say that it is good to find a high-end programmer to do this, but we have so many programmers, the product manager may just hire a fresh graduate to do it. When a programmer who has been working for a year or so writes a system with a product that doesn’t make sense, and finally goes live, it really does cause a loss, so who is responsible? If the technology of the business division is not good, the technology of the company is not good, and the technology of the architect of the company is not good, then this thing is very sad. I, as the architect, do not even know this system. So we thought we could build a platform to make it easier.

Serverless vs. traditional architecture

This is the oldest structure of Tongcheng, which was basically written by the bosses when Tongcheng started its business. When I looked at the system 10 years ago, it was ok to do this architecture. But if you were still using this architecture in ’17, you would think that OTA technology is really weak and you are still using this system. But in fact, there are still such architectures in parallel systems today, and there are also new systems that look like this. Why is that a problem? Let’s say one of our business students found it fun to be in the middle of a desert somewhere in Australia, and the product manager felt good about it and set up a project to do it. Then you give him three programmers, and you tell them to write a microservice system. Impossible. It would be nice to write an architecture like the one above, perhaps not even a service, directly piled into the application. This kind of business is very much in the whole tourism scene, we call it proofing business or innovation business. A business department will incubate more than N such projects, and maybe one of them will become a major project.

But such a structure is easy to encounter problems, the flow is easy to crash, dead day is also possible, so it needs to be transformed.

Tongcheng started microservices two years ago. This set of microservices architecture is different from others. In addition to the basic container of services, there is also an intermediate Gateway, which is dedicated to shielding non-microservices. This is because there are many innovative services with traditional architecture in the same process. These innovative services do not mean that they will not call each other. Then, how can these non-micro-services be called by micro-services? You can think about whether micro service is a false proposition? In practice, this is a false proposition because no one can tell you mathematically how small a microservice should be to call it a microservice. And the service runs down for a long time and is bound to get bigger. For example, if you have an order interface today and you publish it as a micro-service, order can be a core service. If your product is iterative, then a month or two later there will be a second service called “New Order service.” Three months from now, there will definitely be a new service called “New order service”. After six months, the oldest service will be marked “no longer available, but not offline”. Therefore, I will find a problem. In all companies, it is easy to put a system online, but if you want to take a system offline, you need to contact various departments, burn all kinds of incense, and finally get offline.

As a result of microservices, the service keeps growing. The original purpose of our micro service is to simplify the business, do smaller, better and more stable, independent deployment, can do orchestration. But when you go back three or five months or a year and a half later, it’s a chubby Yang Guifei, no longer a fair lady. The choreography doesn’t fit because there are so many things in it. That’s the problem with microservices. Our solution is to make a Gateway again, lock the fat ones in, and still use multiple basic microservices interfaces, but with different routing policies to choose which version of the interface to use. After this is done, in fact, the maintenance costs and maintenance costs are very large.

Now almost all applications are deployed in Docker, including databases. However, with the increasing number of services, in fact, the operation and recovery of each service brings great problems. Services cannot continue to grow indefinitely. Moreover, it was found that 30% of the services almost disappeared over time, sometimes once a day or once a month. These are services that can be taken offline but not all of them, and a little bit of traffic coming in. These services pile up and consume a lot of your server’s resources, which can be extremely wasteful.

Can a script be a microservice

Let’s look at a comparison. Development students write code, in some cases is called pseudo-object-oriented, written code and process code is no different, and often very cumbersome, code a lump of a lump. So a lot of the developers talk about refactoring the code, because the code is so convoluted. If you think about it the other way around, if you go to operations now, operations students also write code, a lot of it is scripted. For operation and maintenance students, there is no such thing as script reconstruction. It is relatively quick and disposable, so don’t just throw it away and write another one.

In the previous system we talked about a lot of quick realization content, so can these quick realization content be turned into a script, can this script be turned into a micro service? For example, today to find data from a table in the database, external services, may be a SQL thing. What if instead of allowing it to become a script, you make a microservice that has its own set of processes to go through? He’s not going to go through this process, he’s going to take another instance of a service, add this interface, and he’s going to pollute the service. So we wanted to script things that change quickly and are light in the context of already doing microservices.

What does the concurrent Serverless implement

I started Serverless in April or May 2016. What the Serverless architecture is, no one can say today. I’ve cut out a sentence here: “if your PaaS can start an application that started in the last half second in 20ms, call it Serverless.” Anyway, when we did this, we didn’t think about how many milliseconds it would take to start it up. One of our principles is that a lightweight service can become a microservice in a single script, and its invocation deployment can be dynamically scaled up. It is better not to consume most resources when there is no traffic, and can easily go online. The problem we are trying to solve with Serverless is that there are a lot of very light services that need to be taken offline quickly without consuming huge resource costs.

This is an architecture diagram we made. We mainly divide it into three layers: scheduling layer, computing layer and base layer. There seems to be nothing wrong with the scheduling layer, but we actually do a lot of things. Imagine how quickly the following expanded services could be mounted and expanded in less than a second. So dynamic hot load balancing is very important, and it will happen at every stage, because it might be a web page published on it. If it is an external high concurrency thing, for example, the product manager said today to do a large flow of drainage, need to do a landing page, if the programmer is not careful to do this inside, the concurrency will come, it is likely to hang. And consider not only the external traffic coming in from the front end, but also the internal traffic. For example, after being internally dependent on multiple systems, how to have a circuit breaker after a break. Because if you do it with a microservice architecture, you do a circuit breaker mechanism, but if you do it with a simple script like this, it rarely thinks about the circuit breaker.

The second layer is the computing layer, mainly to do our scheduling work, how to quickly out of resources. There are two methods when Serverless is made. For example, we directly use Docker containers for storage, and each expansion is a container, which is piled up one by one. But I think this is not beautiful enough, and directly to expand the container is too rude, after all, the container itself is still taking up resources. And the problem is that the amount of code that we’re deploying is defined as a script or one or two scripts to do things, so it’s a very small amount of code, and it’s a waste of resources to allocate something as heavy as a container. So we started with a simple, dynamic language to do it. Let’s start by implementing dynamic languages like Lua, Node.js, which has the advantage of running like a virtual machine and isolating it. After this, when each script is executed, it is a separate, runnable container, which is then placed in a Docker container. Currently we only support dynamic languages, and we are trying to Go, but it is difficult and has not been successful. The two figures above expand the intermodulation relationship between each of our scripts.

This is a resource utilization map for our Serverless architecture. Basically, we deploy Docker containers on the basis of physical machines, and then open Docker containers to deploy them one by one, and then isolate them. We did experiments, and the minimum deployment was 100,000 applications on four physical machines. Of course, you can’t do that in a production environment, where 100,000 scripting services have traffic, and that traffic alone kills it. With such Serverless, many light applications can be taken offline at any time, solving the problem of services that have little traffic but must live.

But here comes the question: Are programmers happy with their programming? I don’t think so. I hate is to write a line of code from the database to get a data, the result needs me to find what driver, ask me the DB address, password, and hit a variety of class libraries, bored to death. So we also started building an SDK to make the code work better with it. In the example of pulling data from the database shown below, you don’t need to care where the server is when programming, just focus on the code. Such a Server.DB library in the entire platform will be managed by a configuration center, all DB can be manually configured into, after the configuration into it, it will generate this object, for the developer he just need to type on the line.

In addition to operating DB on this platform, you can also tune other microservices. For example, do a landing page, want to know the membership of the level, it must be transferred to the membership of the service. Membership service is a very heavy service, not on this platform, will definitely be an independent service of micro-service deployment, then the platform can also rely on it, through scheduling to call it directly.

After doing so many things, in fact, we found that it is not Serverless enough. As I said, there is still a big problem. Local is good, but not online or in the test environment. It’s still about the environment, and you can’t just focus on the code. So we made a Web IDE, developers no longer need to care about the local development environment, also no longer need to configure the dependency of the local development environment. Because when microservitization, especially when a large number of large teams work together, its interdependence is often difficult to simulate locally. We provide a Web IDE, so that every student can open a Web page to write code, so that when writing code, everything is under my control.

The Web IDE is quite powerful and can do some version merging. You can build a new project on top of it, or you can go into a project with a lot of scaffolding and a lot of characters. There is also a store function where programmers can write common things for others to use. For example, just said anti-crawling, anti-crawling brother anti-crawling system done, if you write a script is to provide external web services, need anti-crawling function, it just need to check the configuration here can support anti-crawling. For example, if you want to run a queue, just check the number of concurrent requests in the configuration, because other systems will do it for you.

Serverless in the same trip case

Efficiency in development, release, and operations

We found that by doing something like this, for our early development, any project, any standalone feature that started to be put in place, there was always a list of things that we needed to apply for, and we saved 90 percent of our time on those things. Then is to find data, where the DB seems to be a small matter, but in fact, it is very annoying in the development of the test environment, where the DB of the development environment, where the DB of the pre-release environment, where the wrong one is over. So after doing Serverless, I saved 80% of my time on these things. It’s not much of a time saver when it comes to writing code, because a lot of code needs to be written. Many of the SDKS we provided worked well in this section, saving up to 40% of the time. And 90% of the time is saved by operation and maintenance. When the application is deployed as a small script, the whole platform, even the development tools are not available, so it is easier for operation and maintenance, and a monitoring system can be done automatically.

The Web application

Now all the sites and all the web pages of the same cheng are put on the platform just now, because of this platform, the front end revolution arose in the end of 2016 inside the same Cheng. In the past, front-end engineers need to find the back-end to provide interface after finishing the page. Now after doing this platform, because it is Node.js, front-end engineers themselves will be finished, all kinds of databases can be directly found. It’s really morning demand and it’s up by night, so all the front ends like to use this platform. Moreover, this platform is not only a web platform, but also provides a lot of API. Many students began to build their own tools, such as using Atom to connect to the platform API, and using Atom to write these codes.

Light service

There are also quick services for some simple businesses, such as the actual code shown above, which is really just a script.

Supporting function integration

Then there is the real-time service of price calculation. A lot of people say otAs don’t have beautiful technology, but there’s a lot of it. Take, for example, the most traditional business, hotels. Hotel business, very traditional, but it’s very difficult to do, very complex. Imagine what the price is like when you buy something on Alibaba or JD.com. Damn it, morning prices are fixed at least until noon. But the hotel prices are different, why? If you stay three days in a row, or if you stay on a date that includes a weekend, or if you stay three days in advance, or if you stay a week in advance, the prices for each search criteria are different. The price of staying in a hotel for a week is low, maybe 300 yuan hotel 280, if you book two weeks in advance may be cheaper. But if you book tomorrow or tonight, it’s expensive, or if you stay through the weekend it’s 30% more expensive, so the price varies according to the terms of each search. At the moment, a lot of hotel searches are dead data, which can change every 5 minutes, or half an hour. This is the real time calculation, the real time calculation is actually the problem of capacity expansion, because the calculation time is too long. Of course, sanya is now quickly able to search, because it has a real-time expansion mechanism. For our hotel, there are 480 nodes to calculate, which can be expanded to 1000 nodes or reduced back.

What’s next on Serverless

Of course, we still have a lot of things to do in the future. Serverless is just the beginning, and it may not be the right thing to do. In short, Serverless can help us to do some things.

Q&A

Q1: What if the server where the code is stored or another server has problems?

Wang Xiaobo: Actually, domestic developers have this problem. We communicated with them on Facebook a couple of days ago. We require that the code be handed in on the same day in China, but no one will read it.

Q2: How about writing code on a web page, and in the middle of writing code, the Internet suddenly goes offline?

Wang xiaobo: Not this one. I think your way is better than the domestic way.

Q3: I have also been working on the gateway of the company recently. I would like to introduce my experience on the gateway.

Roy: A little bit a little bit more simple, we are many gateways, such as the front, we put it in the gateway, you said is the service gateway, with Cheng Wei responsible for service distribution service if we make it, for the service and service, we just call it a gateway, other all have no, this gateway is to sacrifice some features, we do some services, Through the gateway these things are gone, very rough, we don’t fuse, give you back to a specific place, the rest is their own service.