This article is shared by Gao Lei, a lecturer from Bytedance’s launch engineering team, on “Best Practices for Bytedance’s 100-million-tier DAU Client Launch” at the 2021 GOPS Conference.

First, LET me introduce myself: I am Gao Lei, an engineer in bytedance’s launch engineering team. I have been engaged in software development for more than 10 years, both in traditional software companies and some startups. In the last 6 or 7 years, I have focused on DevOps and accumulated some of my own understanding and experience.Today I share the theme of “Best Practices of ByteDance’s 100 million DAU client publishing”. Through this sharing, I hope you can understand some practices of ByteDance in client publishing. This time, I will share with you from four aspects: 1. Characteristics and difficulties of mobile terminal release; 2. Introduction of Byte mobile publishing system, 3. Summary of mobile publishing practice, 4.

1. Mobile publishing features & difficulties

Let’s start with the features and difficulties of mobile publishing. Publishing is a concept that belongs in the CD (continuous deployment) category, which is part of Devops. Usually, we are in contact with Devops on the server side. Today’s sharing topic seems to be more client-side, but most of the content and ideas should be the same. Originally, I thought that if I could be present, I would do an on-site survey to check the proportion of students present as clients. But since I don’t have the opportunity now, LET me briefly explain the difference between the server and client publishing processes:In fact, this is not complicated, but as a background for the rest of our content is worth mentioning. There should be little difference between the two from the construction package to the package out stage, except that the bottom layer depends on the packaging tool slightly different, the most essential difference is the process after the package out; The update process of the server side is to send the binary package to our own server, so the whole process is controllable. On the one hand, you can update the new version at will. On the other hand, in case there is a problem with the live version, you can also do one-click rollback; However, the client side is not good. After our packages are printed, we usually put the new version of the package on the server. For the official package, we upload it to the store for hosting, but we don’t know when the users will come to update it. If there is a problem with this version, there is no way for you to return to the previous version soon. You have to go through the process of issuing this version in a respectful manner, and the cost of stopping loss is relatively high. This is the biggest difference between the two: client upgrades are terminal dependent, not platform dependent, and this feature also determines other aspects of the release cycle.Specifically, the deployment media, as mentioned above, is a controlled server environment; The other is the complex and changing terminal device, operating system has Android/iOS, Android has many different manufacturers, such as Xiaomi, Huawei, Vivo and so on; Secondly, the concept of version on the server is weak. Generally speaking, we do not use the version number to locate a certain release on the server, but the client is different. Our communication language is the version of xyz, and the version number is a very important information in the release of the client. The third difference is the release cycle, the release of the server is generally not clear agreement, if there is a new function can be updated every day, multiple times a day is normal; This is also consistent with Devops’ philosophy of continuous delivery; On the mobile terminal, the preparation cycle is relatively long. At present, the mainstream rhythm may be a large version in 1 to 2 weeks. Some applications that are not updated often can even be issued once a month. The fourth differences is involved in the crowd, the service side in general, as long as the testing phase through to release phase is basically RD, but because the client version is more treasure, one is a, so during the release phase design a long link, each stage with different roles to complete a job, now byte many team have the role of QABM, QA is responsible for the release of the management of the students; The division of labor as a whole is becoming more and more elaborate and complex; The fifth point is stop loss efficiency, which has been mentioned before, can not quickly roll back stop loss this “hard wound” can be said to be caused by the client and server version of the main reason for the difference; The last point is multi-version parallelism. Obviously, under normal conditions, the server will only keep the latest version, which will be parallel for a short period of time in the online or gray stage. However, these versions accumulated in the history of the client will be retained permanently and will not be changed any more. So said so much, I believe we should have a clearer understanding of the difference between the two, we may think, “client version” really trouble, this is right, and because of trouble, once issued, will assume some possible risks.Now I will show you some typical examples of online accidents. You can take a look at this PPT. These are real accidents that have happened on our online network. And the use of inappropriate materials that led to app store rejection… There are all kinds of accidents. It is not difficult to see that there are many risks to release in mobile scenarios, such as security problems, data problems and testing problems. Any slight omission will cause great losses to the company, or financial losses, or loss of users. These are all problems to be solved when designing the release system.Then look at the above accidents, combined with the characteristics of the client version, it will lead to several problems:

  1. How to build an efficient release pipeline?
  2. How to ensure process security?
  3. How to achieve better and faster volume effect?
  4. How to ensure the reliability of published data?

2. Bytedance mobile publishing system

How are these problems solved in bytes? I’m going to take a look at byte’s mobile distribution system and try to answer these four questions.First to introduce byte mobile publishing history of China, which can be roughly divided into several stages: the first phase is 2017 years ago, because of business development is faster, firm-level development relative lag, China is more of a business from their own version of the demand, simple structures, a small cluster, Jenkins series used to all kinds of packaging and testing tasks; The business needs to maintain such a platform by itself. The investment is too large, and many tasks are repeatedly constructed, without good sharing of resources. With the increasing number of businesses, the issue of version release becomes more and more prominent. So the emergence of a middle kingdom is imminent; From 2017, the first release platform 1.0 appeared. We called it “Rocket” internally, literally meaning “Rocket”, which obviously means fast. There was some progress before this stage. There was also a dedicated build team to solve the packaging environment and deployment issues, which was much better than before. Jenkins had a problem to be maintained, and there was an assembly line that could be repeated. This process probably lasted until 2019, and some hidden dangers were exposed. Firstly, Jenkins packaged clusters. Limited by Jenkins’ single-master architecture, a large number of Jenkins tasks were also unbearable for the construction team, and it was not uncommon to have more than 10 times of machine rebooting a day. This issue of assembly line has some problem, excessive pursuit of flexible configuration, atoms that would actually doesn’t really count atomic ability, more like some scripts scattered combination, the lack of agreement in the system, the problem is that the user configuration cost is very high, because of the need to care too much, the details of the platform and the ability of the linkage between opaque, As a result, various tools are disjointed and the cost of user configuration pipeline is very high. At the same time in the security and other aspects of the process design is not reasonable, resulting in some of the card points in name only; So starting in 2019, we started to push forward the construction of the second generation of release system, and this phase we did a few things:

  • Firstly, at the construction level, we abandoned the secondary development mode based on Jenkins and completely adopted the self-developed distributed scheduling cluster to realize multi-master and multi-active, automatic recovery, support task priority scheduling and other schemes, and the overall availability was greatly improved.
  • In terms of pipeline design, we abandoned the previous completely scripted way and extracted many common atomic capabilities of the system, which greatly reduced the user’s cost and greatly improved the interactive experience.
  • In addition, we have made some construction in terms of security. The whole process of CI/CD has been penetrated from the beginning of the demand. At present, it basically covers the whole process of CI/CD.
  • In the aspect of data, we decided to take the product library as the data base of release, which changed from the macro perspective of focusing on version to the micro perspective of focusing on product, and the architecture of the whole platform became clearer.
  • Finally, we around the gray scale of this link to explore a variety of ways, the goal is to improve our problem feedback rate, as far as possible some potential serious problems in front of exposure;

If you look at this diagram, this is a major architecture in our current release; First of all, the platform currently supports the distribution requirements of toutiao, Douyin, Watermelon, novel, Feishu and other businesses, and the supported terminal scenes have been extended from Android and iOS applications to Mac and Windows. In the process, we also accumulated many platforms and capabilities, from requirements specification to R&D packaging, testing, release, and post-launch monitoring and feedback; Our ultimate goal is to create a one-stop universal mobile research and development platform;Let’s go back and look at the problems mentioned before. I will introduce some characteristics of the platform from the perspectives of assembly line, security, evaluation and product library in four sections. Let’s look at the solutions provided by bytes to these pain points.First of all, we set up a byte publishing pipeline to solve the problem of how to construct an efficient pipeline. Pipelined is almost an established de facto standard for Devops, and we are no exception. In byte scenarios, it solves two main problems: first, multi-scene custom task choreography; Once we try to think from a best practice to meet all of the business scenario, it turned out this way doesn’t work, because the bytes each business development phase, mature business such as headlines, trill, his team division of labor and maturity must be different, and some emerging business in access control, test process, access to accurate the specification on the difference is very big, for example, Some small businesses at the beginning of the online, there may be no gray stage, if you have to take a set of family barrel template to give him set, forcing him to go through a complex gray process, for the business side is actually a drag, rather than improve efficiency; Second, deconstruct the platform’s complexity for easy measurement; The whole mobile DevOps process is very long, and if you don’t rely on pipelining, you need to manage the interrelationships of dozens or even dozens of atomic capabilities on your own; Pipelining allows these atomic capabilities to be managed in a programmatic fashion; For platform, we are also easier to measure the quality of the whole platform bottleneck, is where a specific abilities, such as the hair of a business has been very slow, we through the analysis of the execution time of each atom on assembly line, it is easy to locate to be specific which stage has a problem, is the low efficiency of automated test, or gray phase volume is slow, Can be done in an automated way, even can be made into reports, give some quantitative analysis; From our own experience, I think the key is: the granularity of atomic power;

  • If the granularity is too large, the internal structure is too complex, which is not conducive to the direct penetration into the essence, which means that it needs to be disassembled further;
  • Too small to be worth measuring, it can be incorporated into other atomic abilities;

Here we summarize our two principles for your reference:

  1. Have independent function positioning:
  • Independence refers to the three dimensions of execution independence, authority independence and data independence;
  • Execution independence refers to that atomic capability can operate independently and realize specific functions after given basic dependent data.
  • Permission independence means that atomic capabilities need to have independent permissions that are not directly controlled by other atomic capabilities;
  • Data independence refers to the fact that atomic capabilities should have independent data paths that can communicate directly with the pipeline framework.
  1. Independent measurement improvement: The measurement of the whole production line will eventually fall on a single atomic ability, if your atomic ability itself can not be measured by a certain index, then it is an invalid atomic ability, it does not have the conditions for independent existence, need to be merged or disassembled to meet the requirements of measurement;

With pipelining out of the way, let’s look at how to secure releases. A few years ago people will make safety problems as it is a necessary thing, is the safety issues are thoroughly exposed the didn’t want to go to solve things, and now, the industry gradually formed some consensus, is the safety issue should not be as a firefighter, or as a means of out there, and should serve as a conversation required participants, Permeate the entire DevOPS process; In the byte scenario, we gradually implemented the concept of security into the whole process of mobile release from the first two years. From the requirement stage, we started to do relevant security compliance assessment, and in CI and CD stages, we also did static and dynamic scanning respectively. Before the final release to stores, We will also do a review based on the case base we have accumulated to avoid touching some red lines in terms of security. When the platform was just launched, we could find a lot of security vulnerabilities, including network problems and privacy compliance problems. It seemed to work well, but finding vulnerabilities was just the starting point. More importantly, we had to consume the vulnerabilities and fix the problems. From the perspective of the platform, once a high-risk vulnerability occurs, the business side should be prohibited from issuing the version; However, from the perspective of the business side, what they care most about is the timely release of the version. The final result is that the platform gives a temporary green light (possibly hardcode), and then the two sides hold a group discussion, and finally decide on a rectification plan within a deadline. “Safety is important” has become a slogan, so we need a mechanism to solve this problem. This mechanism is not technical, but more important is to form a top-down consensus within the company. Generally, there are three parties involved, the security team, the platform team and the business team; We need to define each team’s position on security issues:

  • First of all, the responsibility of the security team is to be responsible for the grading of security problems, provide the rectification plan of security problems, and assist the landing; For example, on our current platform, the security team provides black box and white box scanning capabilities and the corresponding rule base;
  • Second is the platform team, the platform team is responsible for the process of the system, the platform can not ignore, also can not be too dead, simple and rough rules “one size fits all” is certainly not, more reasonable approach is to provide flexible card point ability and configuration ability, business can be configured according to their own actual situation
  • Bayonet level; On specific issues, we still adopt the principle of “dynamic repair of incremental problems and rectification of existing problems within a specified time”.
  • Finally, the business team needs to pay more attention to and give feedback on security issues, actively cooperate with the implementation of rectification opinions, and promote the rectification and implementation of security issues in stages according to priorities;

The next pain point we want to solve is: how to improve the effect of volume? At the beginning we introduced the difference between the server and the client mentioned that the difference between the mobile terminal and the server at this stage is the most significant difference, we hope to find more problems through the gray link before the official release; Obviously, the more new versions we install, the more problems we get feedback, the better our grayscale effect will be. So we tried two ideas: the first thought is: we have tens of thousands of people, given the company internal RD, PM, QA, their professional and sensitive degree of failure is far more than the average user, if this part can very good use of resources, we have the equivalent of a backup resource pool, tens of thousands of people this is very much; So we at the beginning of the 18 years did attempts within the company, launched a small program, called “byte closed”, our operations team will regularly do some cooperation and business party, attract them in our platform set up some activities, lead everyone to download a new version, at the same time, feedback the problems in the process of trial, and offer certain incentives; So far, our ROI is very positive. About a week to participate in the activities of the number at more than 7000 people, the average number of problem feedback in dozens of up and down, including P0 to P2 problem can account for more than a quarter, if strictly to calculate, we pay the cost, mainly operating human and incentives, was apparently less than after these problems were leaked to the line losses; So, private beta activities are worth it; However, this is a high requirement for our internal operation, so we need to continue to conduct some operation guidance, lower the threshold of user participation, and ensure the smooth path of activity-feedback.The second idea we try is external: although there are many users of the company, compared with the hundreds of millions of external users, it is still a small part, so we mainly focus on these hundreds of millions of ordinary users, they are the key we want to explore; The question then becomes how can we accurately find such target groups through algorithms among these hundreds of millions of users.To use the model algorithm, you have to prepare the data; There are many data dimensions that can be used. The first is the user app information: the user’s habit of using the app, browsing content preference, active period, old users or new users, etc. Secondly, the user base information: gender, age, city and so on can also help us make some judgments; In addition to the information related to users, we can also combine some attributes of the version itself. For example, I launched a live-streaming function this time, so I should give priority to cover those users who are usually active in playing live-streaming. In this case, you can achieve a version of the information and the depth of the user two-way matching; Improve the accuracy of our algorithm;In the actual implementation stage, we will optimize the model according to the situation of different businesses. For example, for apps like Toutiao and Douyin, we can get enough data to train personalized models. But for some small businesses, because the data size is not enough to be trained individually, we will provide a set of generic models. Overall, the goal is very clear: to provide conversion rates for CTR and CVR. So far, we have achieved some achievements. Our algorithm model has improved about 10 points on average compared with manual blind selection. However, the ceiling of this matter is very high, and there is still a lot of room for improvement in the future, so we need to continue our efforts.One last sore point: How to ensure the validity of published data? As you may recall from the beginning of this post, we mentioned that the root cause of online accidents caused by link configuration errors was that our upgrade system did not have a trusted data source. Why we can’t use directly links, because the link only represents the access of the product, not only represent the upgrade package, if the link was tampered with or covered, that means you will go wrong, and the existence of products library can avoid this, he used to guarantee to the downstream data is credible, is a complete test. So in my opinion, the artifact repository is the core data base for devOps; If CI (continuous integration) is responsible for code; Then CD (continuous release) is responsible for the product. In the process of publishing, all the data after the package is basically considered to be done for the product, whether it is security detection, functional testing, or user story, all represent some features of the product; If we find a product has a problem, we can put it on the blacklist, so that it will not affect the subsequent volume. I would like to share a case: We have an internal live broadcast team, because they are cross-business departments, toutiao, Douyin, watermelon and other businesses need to use their live broadcast plug-in. What they used to do before the artifact library was store the package in a flying book online document, and then mark the status of the package in great detail, who, when, why, what the notes are, what features are in it? This process is very complicated and inefficient, appear in the library, we can put the whole package configuration information before and after the event action of all concentrated in one place, so users can easily under conditions of various labels to find the package, he need at the same time, we also support the subscription model, for example, I only focus on formal type of package, Then I can subscribe to the “official package” tag, and when such a package appears, I will actively notify you; If you have access to the product, of course. Now, we can answer this question. The effectiveness of published data is guaranteed by our quality control of the artifact library. As a data base of our platform, it needs to be decoupled from the process and maintain certain independence, so as to adapt to various complex scenarios.

3. Summary of bytedance’s mobile publishing practice

The third part of our overall introduction to byte in the release system of some practices and summary.Let’s take a look at this set of data: we now have about 100,000 builds per week, more than 700 grayscale builds per week, and more than ten million grayscale actual people per week.I also summarized several experiences about iteration, which you can refer to: First, the problem of best practice. The iteration history of our entire platform is based on a typical business as a template for continuous optimization. Continuous iteration is the best state rather than one step. Secondly, we should cherish accidents. Accidents are valuable experiences, so we should try our best to do the best for each accident. We should make a perfect case study, namely the “5W” principle. This place needs to be mentioned. As the platform side, do not blame the business side. If the business side does not comply with the specifications, you can ask yourself whether you give the business side room for mistakes? Am I giving enough guidance and support? Can I give him no room for error, can I eliminate all possibility of error? The third is demand, because demand is endless, so how to do? I think as a platform side, we need to be open-minded. We can do general requirements, and the platform can try to be closed loop. But for some personalized needs, do not go to force, do not take everything under; We can set the rules, give others the opportunity to participate, and work together to make the ecology bigger. The final point is whether platform value needs to be measured, or whether it is the cliche that what cannot be measured cannot be improved; If you want to improve, you have to be measurable.Next, I’d like to share some of the trends in mobile releases that we’ve seen internally as we’ve been working on the platform for years. The first is the gradual high-frequency trend of release; From the previous one month to two weeks, but now the mainstream business may be a one week rhythm in iteration, the future may not be long, may present half a week/day level of evolution, this trend is relatively obvious; Second, more and more attention is paid to safety. This should be a normal situation, in fact, the above mentioned more, here no longer repeat; The third is the precise test scenario; The traditional ones are based on already written use cases. The biggest problem with these use cases is that there is no maintenance at the later stage. When the UI changes a little, the test case is essentially unusable, and the use case must be constantly updated, which can be costly to maintain later. At present, precise testing based on AI technology may become a mainstream in the future. We do not need to maintain large-scale cases, but it will automatically generate real-time test scenarios based on my current scenario. We are also making some internal attempts in this aspect. The last point is the idea of continuous gray scale; As the version of the increasingly high frequency of this is inevitable, that is, the gray version and the official version of the boundary will become more and more blurred; You don’t know which day is in gray scale, tomorrow will be formal, two days after the next gray level. The idea of continuous grayscale is also a point that I am currently aware of.From a platform perspective, there are a few other things to mention; Speed, efficiency, safety, cost. You can refer to each of these points, but I won’t go into details. Let me just touch on the point of cost reduction: why do we have cost reduction? Because we store a large number of packages, there are certain storage requirements. Bandwidth may be an obvious feature in application scenarios, because if you are an APP with hundreds of millions of users, with the shortening of the iteration cycle, the update frequency of packages is increasing, and the resulting bandwidth cost is also relatively high. The annual cost of CDS is estimated to be in the hundreds of millions. Money is really expensive, so let’s try to cut the cost. How to do it? There are a lot of ways, but that’s not the topic for today, so I’m not going to go through them.Just to conclude: we’ve covered a lot of points today, including assembly line, safety capability and volume. If you look closely at these points, they are all in pursuit of balance: the pipeline, for example, is not too big or too small in terms of atomic power. In the middle of security and business ROI, we also need to balance; In terms of volume, we need to strike a balance between speed, effect and user experience. So in summary, throughout the iterations of the release platform, we’ve been pursuing Balance. In order to maximize the business benefits of a platform over a given period of time, there must be trade-offs, not everything.Let me share with you the future direction of our platform: First, we will extend the concept of publishing. From the current small release system, gradually introduce it to the large release system. So what is a big release system? In addition to the current normal upgrade package, hot repair package, we may also put the configuration resources or static resources into the whole release system to form a large system for business unity. Secondly, we will continue to optimize volume, make some algorithm modeling optimization for volume, and try to introduce more data dimensions to increase its richness. Third, we’re going to measure it more carefully, and make it a closed loop of consumption, rather than just putting up a data kanban, which is nice but not useful; In the end, we have to consume and improve.

A final introductionVolcano Engine Application Development Kit MARS, the ability mentioned in today’s share will be open to everyone through MARS in the future. Interested friends can scan the QR code in the lower right corner and apply for a free trial. If there is a need for private deployment, you can directly leave a message on the public account, we will have someone to help.

This is the end of my sharing today. I look forward to more exchanges with you in the future. Thank you.