There are two ways to design software: one is extremely simple and obviously bug free; The other is so complex that there are no obvious flaws. The former is much harder!

– C.A.R.H access to oare

Here we go!

Grayscale publishing and A/B testing

Many of you will be impressed by the examples of the Facebook and Renren front pages. One of the big things Facebook is using here is grayscale publishing and A/B testing. Like Aegis (for lack of a good analogy), this weapon has repeatedly pulled Facebook back from the brink of something going wrong. As I said in the last post:

Even carriers like Facebook are blind in the entrepreneurial world, with many product changes that no one really knows where they are going. So the formula here is “Everything must be tested.” After the grayscale release, Data Dashboard + A/B Testing is just like radar or sonar on an aircraft carrier, which plays A role in verifying the direction and route.

So here’s a look at Facebook’s radar system.

Facebook developed this publishing and testing system on the Web server (PHP) in 2007-2008. Code-named GateKeeper (first mentioned in the Boz article), it is essentially a switch that can be defined on an admin page. It then controls whether certain switches are on or off. The properties of these switches are pre-cached in memory, so the operation of reading the switch is not heavy. Example code is as follows:

The main logic is in if to determine whether the switch is on for the corresponding user, if so run the experimental code, otherwise run the old code. Very straightforward and simple, right? Facebook has since added various enhancements that allow it to be more finely segmented and controlled by, say, the 1% of users in the US, or men under the age of 25 in Japan. It can be controlled from time, country, date of joining, number of friends, FB employee or not, gender, age and so on.

This greatly facilitates us to conduct A/B testing for users in batches. Gatekeeper, or GK for short, is an important piece of infrastructure for Facebook’s entire Internal Tools group. As the number of Facebook users increased, the number of daily visits to each GK increased significantly. At the same time, Facebook’s own features and corresponding GK items continued to grow, which later posed a big challenge to the overall scale of GK.

In the mobile era, core team of iOS and Android also launched A relatively powerful mobile grayscale publishing and A/B testing tool, code-named Airlock. One of our Chinese engineers also participated in its development. Of course, there are other tools like Twitter’s open-source Clutch IO. The Airlock system on mobile is a little more complicated:

1. First of all, airLock will get all GK values from FB Server after users log in or open Facebook App on mobile phone; And then cache it locally;

2. Then write the same if logic in the iOS or Android code to check whether the current user has enabled this attribute. If yes, run the test code; Otherwise run old code;

3. Then app synchronizes with server at intervals (FB uses an interval of one hour); Of course, the app can force the latest value on the server at any time.

4. Most importantly, logging on the mobile side records each GK value of the current user in logging, so that when these logs are uploaded to the server, the server can use these logs to calculate the USER’s GK value and corresponding actions.

In retrospect, grayscale publishing and A/B testing on mobile essentially means adding A library to the local code that is responsible for synchronizing the values of all switches with the server and logging the corresponding switches in logs for later analysis of the user’s behavior to understand what combination of switches the user is exposed to.

Case 1: Evolution of Facebook’S iOS app

Here’s how the Facebook app interface has evolved on iOS. It is well known that Zuck has a good personal relationship with Steve Jobs. Zuck also takes Steve Jobs as his own image, and often goes to Joe’s house to have dinner and ask for tips on running Company. So, starting with the first iPhone SDK, Facebook had iOS native apps available on the App Store:

You can see that iOS is still skeuomorphic, and Facebook is using the nine-grid home page that was so popular in those days. The message prompt was at the bottom of the home page. At that time, Facebook did not have “Like Button”, only comment was allowed. This version of the Facebook app is the original version, completed by a Daniel alone. This big bull has open-source the later commonly used components into the Three20 library.

Then Facebook went through a major UI change:

The biggest change is the replacement of the nine-grid to a left-drawer navigation bar with the famous’ Hanburger ‘button in the upper left corner.

So in 2013, the Facebook app was ready for a major redesign, from drawer left to Tab Bar:

The main changes in this edition are to make it easier for users to switch to other functions outside the News Feed. However, this raises another question: how many buttons should be placed in the TAB bar below? What should I put in each position?

It was already 2013, and the engineering team of the company was still affected by the failure of modifying the WWW homepage of Facebook the previous year, so they decided to be more conservative and pragmatic in iOS app. Airlock played a huge role in this redesign. After the Facebook iOS Core team wrote the TAB Bar code, they did not immediately release it to all users. Instead, they started 4 months of grayscale release and A/B testing. The following TAB bar is tested for various possibilities: 5 TAB entries or 4?

For Requests or Messages? Notifications or Groups exposed outside? At the same time, the function and style of the button in the upper right corner were also tested, such as address book or Messages? Do I put an icon or do I write and so on. At one point, the release was delayed by the emergence of new test combinations and data ambiguity over the comparison of test results from several combinations. The entire iOS app interface reengineering project was led by Mick Johnson, one of the best execution PMS I’ve ever seen at Facebook.

He looked at the various combinations of data and decided on the order in which they were combined, in the context of Facebook’s big push for Messenger at the time. This combination showed the best overall performance in various tests, effectively allowing users to view the news feed, increasing the number of friends (a very important metric from the Facebook data group, which can be explained at the end of this article), easily sending and receiving messages, and viewing new messages. But groups, events, and other ancillary functions are tucked into “more”.

Case 2: Voice Message publishing process

Here is A breakdown of the whole process of Facebook grayscale publishing and A/B testing with the release of the Voice Message function in Facebook Messenger that I was in charge of before:

Similarly, in iOS Messenger, once the user logs in (and every hour thereafter), the iOS Messenger client communicates with the server to retrieve all gateKeeper values and cache them locally. Important new features on Messenger are placed behind a Gatekeeper (GK) that controls how each feature is turned on or off on the Web, iOS and Android, depending on Server Settings. The functionality is then progressively opened to all users by controlling the range (percentage of users) of GK enabled.

The release process of the entire feature is broken down as follows:

1. Preparation stage: All the codes are in the App after being written and submitted to the App Store for approval, but GK is closed. Wait until launch, first look at the App with version X code under closed conditions performance; This part of the program should be consistent with the previous version of the functional logic, and no serious instability.

2. Employee test phase: first turn on GK to 20%~50% Employee (it is still closed for ordinary users), and see the performance of the new function in the Employee group. This process usually takes days, even weeks. We call the group of employees whose functions are turned on the experimental group, and those whose functions are still turned off the control group. The core data performance of the two groups is compared. For example, for Facebook App, users’ session length (length of App usage), news feed engagement (number of likes or comments), advertising display duration and income indicators are generally looked at. Messenger looks at the average number of messages sent, the time it takes to send and receive a message; In addition, performance related features are also looked at: App startup time, core function open time and App power consumption. Generally speaking, for important features, changes in these data will be predicted before release, and changes that do not meet expectations will be screened.

3. Small launch phase: When a new feature is proven to be effective, stable, and harmless among employees, Facebook turns GK on to 1% to 5% of its users. Here are two main considerations:

A) Stress test the system to see if the server can handle such heavy user traffic for some new background functions (such as voice messaging, VoIP calls, video chat, or money transfer functions). Generally speaking: 5%, 10%, 30% and 50% are common pressure tests, and there is basically no problem after 50%;

B) See what users and media say about this feature. Typically, PMS collect media reviews from TechCrunch, VentureBeat, and Wired and post them to their internal groups. There is also a discussion of this feature by searching users on Twitter.

4. Full release stage: after the server confirms that it can resist all traffic, GK will be opened to 95~98% of users, while still keeping 2% of the reference group for comparison. At this stage, the most important thing is to look at the various monitoring data on Data Dashboard to see whether at least some key indicators cannot decline as expected.

5. Closing phase: The technical staff starts to modify the program code, removing the corresponding GK and old function code, so that the next or next new version will have pure new function X code. This step usually takes a month or several months, and the final release of pure X code is cautious, because once released to users, feature X is not controlled by Facebook Server; If a nasty bug suddenly appears in the App, there’s nothing Facebook can do but release a new version of the App, wait for it to be approved by the App Store, and ask users to upgrade immediately.

In summary, the characteristics and precautions of shadow release and A/B testing are as follows:

1. Pre-release of front-end code; Multiple versions of code may exist at the same time;

2. The code of mobile terminal or front-end MVC needs to obtain control information from the server at a certain frequency to display the corresponding form;

3. Release can be controlled in the background. For example: publish to what proportion of users, users in the region and so on; And even after opening up to 100% of users, the Server can still remotely control (or roll back) the functionality under trial as long as the GK check is not clear. In the case of some serious bugs, you can roll back a feature completely.

4. Note: Due to A/B testing, different users A and B sit together and have just downloaded and installed the latest Facebook App, but they may see different functions (even UI layout) after opening the App; And for the same user, the features displayed in the app today may not be the same as those displayed tomorrow, even if he hasn’t updated his app.

In fact, wechat has already used this method for testing and data collection several years ago (the main force of wechat team visited Facebook in 2012, and some of our FB engineers and their engineers + Zhang Xiaolong had in-depth discussions on products and technologies in that year). Wechat, for example, features a shortcut menu in the upper right corner:

The above options and order are controlled on the server side and can be changed at any time if it wishes. In addition, there may be differences among different people (for example, my wechat was downloaded and registered in North America, and my list was less carried out. In addition, JINGdong shopping appeared nearly a year later than my friends in China. And I’ve never seen any ads on moments).

5. In fact, Apple’s official website does not support the logic of server remote control of app Client, because it has great obstacles for app Store approval. But Apple has been turning a blind eye.

6. As iOS programmers know, Objective C is a purely dynamic language where all functions are virtual, meaning that any function call can be changed at Runtime (especially with the bizarre method Swizzling mechanism). This leads to users being able to hack your GK switch locally when they download your app. There have been many times before when A new version of Facebook has been released and the iPhone hackerheads start digging into Facebook apps, find a bunch of GK values, turn them on artificially, and try new features that haven’t been opened. Sometimes they write about their experiences on blogs or send them to TechCrunch for use and reporting. Later, Facebook internally separated the important and strategic functions from the Product build directly with Compile Flag.

Finally — Don’t reinvent the role

The last and most important point: as can be seen from the above, there is A large amount of development for the whole grayscale release and A/B testing, whether it is the product itself or the grayscale release system. That is to say: grayscale release + A/B test will slow down product iteration speed on the one hand, and greatly improve iteration stability on the other hand. This should be used with caution (or, more specifically, forbidden) especially for startups (within six months of being established and with a small user base).

However, A/B test should be used as soon as possible for well-developed start-up companies, especially when homogenization competition is serious (Chinese characteristics). For example: Nice and In, Qunar and Ctrip (before the merger). In terms of doing A/B test, I think the startup company that has done well in China is AppAdhoc; Founder Wang Ye has a strong technical background and previously worked at Google Mountainview. Company website: Call tech AppAdhoc. I have seen the interface and use of their products before, which are very similar to FB’s Inhouse system. Personally, I also feel that their UI is closer to the bootstrap style that everyone is used to, rather than FB’s own blue style: Call tech AppAdhoc (interested companies, can I directly help you recommend :D).

— the END —


Author: Qin Chao, public id: QC_Empire

— Do have the faith in what you love