More than 70% of the business is developed by H5. How to optimize and evolve the structure of mobile QQ Hybrid?

First of all, introduce myself. My name is Tu Qiang. I joined Tencent in 2005, before mobile, hybrid, etc. I was mainly developing THE PC version of QQ at that time. Later, WHEN I was in charge of the PC VERSION of QQ UI engine, I made some attempts to integrate the browser kernel on the PC client. At that time, I did some framework work for the mixed development of H5 and Native. Later, I joined the QQ member team of Tencent and was responsible for the technology of QQ members on mobile terminals. At the same time, I also had a very difficult task: maintaining all the frames developed by H5 Hybrid in mobile QQ, namely the technical work of WebView components.

Now, the mainstream hybrid is still H5 + Native. H5 development is of great importance to current mobile terminals, but we can see obvious problems of H5 in Native, such as page loading, which takes a long time to open applications. When loading, users may lose the interface of switching chrysanthemums, which is not what product managers want to see. Another point is that every time opening H5 involves network interaction and file download. These operations will consume users’ traffic, and users will be unhappy if the traffic consumption is large.

The QQ Membership team is making improvements to page open times and user traffic for Sonic and 0’s two autonomous tech frameworks, respectively

All the technology selection and framework are combined with the business form to choose, we may have a simple understanding of the QQ member’s business form. It can be said that more than 70% of the business in mobile QQ is developed by H5, such as the main mall of members: game distribution center, member privilege center and the mall of personalized business that I am now in charge of. The features of these stores are obvious. They are not pages generated by UGC, but content configured by product managers in the background, such as emoticons and themes that can be seen on the page.

These pages are relatively traditional. At the beginning, a traditional H5 page would be optimized for static and dynamic separation in order to improve speed and experience, such as the banner at the top of the page and the item area below. The data in these areas can be changed at any time by the product manager. After loading the page, we make a CGI request to fetch data from the dataServer and then concatenate it.

The process here is roughly as follows: the user starts from Click, launches WebView, and the WebView loads THE HTML file on the CDN. After loading the page, it will get JSON. In order to speed up the process, localStroage may be used for caching. This whole process is a very traditional static page loading process and is relatively simple.

However, there are some problems with the above scheme. For example, when we launch WebView, the network is in an empty state, which will waste time. According to the internal statistics of our team, the launch WebView of Android machine takes less than 1 second (since mobile QQ is a multi-process architecture, and WebView exists in another process, launching a WebView can also load the browser kernel in addition to process loading).

Secondly, static pages published on CDN do not contain item data, so when users first see the page downloaded from CDN, the banner area and item area in the page are blank, which also causes great harm to user experience.

Another problem is that the page loading has to refresh the current DOM, which means pulling JSON and then splicting the DOM structure to refresh. We found that on some low-end Android machines used by QQ users, this execution can also be very time-consuming.

Faced with these problems, we boldly adopted some technical means, which we call the mode of static straight out + offline pre-push. First of all, we parallel the loading of WebView with the network request. All our network requests are not initiated from the WebView kernel, but in the process of loading WebView, we establish our own HTTP link through the native channel. And then get the page from the CDN and what we call offlineServer, which is the offline package caching strategy you’ve heard of.

We have offlineCache in native. When we initiate HTTP request, we first check whether offlineCache has the current HTML cache. This cache is isolated from WebView cache and will not be affected by WebView cache policy, which is completely controlled by us.

If the offlineCache does not have a cache, the offlineServer will be used to synchronize files, and updates will be downloaded from the CDN. The HTML stored in THE CDN has typed all the data such as banner and item into the static page. At this time, as long as the WebView gets the HTML, it does not need to refresh and execute any JS. The whole page can be directly displayed and users can also interact with each other.

First of all, this solution will save the time of WebView launch. During this time, you can directly network transfer. In addition, if there is a local offlineCache, there is no network transfer request, which is equivalent to loading a local page completely. Most of the time, however, we just want to be safe by loading the page and then refresh the data to prevent inconsistencies.

This mechanism works well after it is put on line. However, there will be some problems when implementing this H5 loading mode. For example, the banner picture and item data configured by the product manager may have different data versions.

The product manager must be configuring the latest data information on dataServer, but the built-in data on the pages on the CDN may still be the previous version, or worse, the HTML generated by the offline package server and offlineServer may be a different version. When the user’s local cache is not synchronized with the server, which is a common cache refresh problem, it is possible to store another copy of data.

Therefore, when the grayscale trial of this system started, the product manager soon asked us for a joke: when you open the page, you see a piece of data, but after a second page refresh, you see a different content, and this will happen every time you enter the page.

How to quickly unify all four versions of data? We built a small automated build system for static straight out. When the product manager configures the dataServer on the management side, we immediately launch what we call vNues as a build system.

This system is built based on Node.js, which will generate the latest version of HTML in real time from the code files and UI material pictures and other data, and then publish to CDN and synchronize to offlineServer, which can solve the problem of inconsistency between CDN files and the latest data.

But the offline package cache is on the user’s phone, how can we update the offline cache on the user’s phone as quickly as possible? You might do something like this: every time you log in, your QQ client will download the latest files of linkedin Server back, but this solution will run into huge traffic challenges.

QQ now has hundreds of millions of active users every day, and the login peak is about 100,000 per second. Even if an update of a 100KB offline package is released once, hundreds of GB of bandwidth is often needed, which is unacceptable both in terms of cost and technology.

OfflineServer is divided into two parts: flow control and Offline calculation. When all resources on a page need to be packaged for offline package calculation, the offline part will not only package all resources, but also store all previous historical versions, and generate all diff according to the historical version and the latest version, that is, the difference samples of each offline package.

This scheme is also based on our business form, because every time the product manager updates the page data, it is not too much, basically in the range of a few KB to 10+KB, so we do not need to ask users to download the full package every time the offline package is updated.

When QQ users log in, they will ask offline flow control server every time to see if there is the latest package available for download. If the bandwidth calculated by the current flow control server is within the acceptable cost (currently the space is tentatively 10GB to 20GB), the latest DIFF will be sent to the client when the BANDWIDTH of CDN can hold up. In this way, the client can be refreshed as soon as there is an update to the offline package with minimal traffic cost.

Through this system, we only spent more than ten GB of bandwidth to maintain the offline package coverage of all BG H5 businesses at about 80% to 90%. One of the very unusual things we found in this work was that offline packet prepush was expected to consume a lot of bandwidth, but only occasionally did it consume a lot of bandwidth; If push continuously for years and years, in fact, the bandwidth consumption is very small, because it is always in the state of differential delivery.

After doing this work, we collected the live network data, and the contrast between the two modes of static straight out and traditional page is very obvious. The page time in the following figure is about 500 milliseconds to 1 second less than that in traditional pages, because static direct page does not require any JS execution and only requires WebView rendering.

This interesting phenomenon is offline package price problem, you can see the traditional page using the offline package can save more than 700 milliseconds, in network time consuming part but static straight out this model using the offline package can only save about 300 milliseconds, because the use of static straight out in the process of network rely on external CSS and JS are straight out into the HTML, No additional network requests are required, so its own network time is reduced, at which point the benefits of using offline packages begin to decline.

It is possible to wonder why static direct outputting takes more than 800 milliseconds in the case of offline packets. Shouldn’t local caches be zero?

The network time we calculated is from the WebView LOAD URL to the first line of the page, which actually includes part of the page loading, the startup of the WebView kernel, and the loading of network components and rendering components, so the time is relatively high.

There was definitely room for optimization, but when our client team was about to optimize the time consuming part of the network, our business shape changed. Before, the product manager configured the page to display what, all users see the same content, now the product manager said that each user into the mall home page to see the content to be completely different.

For example, the content of the home page on the left is recommended randomly, while the content on the right is actually recommended according to the user’s preferences and behaviors through machine learning, which is calculated and matched with the items in the background.

Each user comes in and sees different content, so the static straight out model does not work, because we can not generate all users’ pages in the background and send to the CDN, but there is a very simple solution to this model.

Instead of storing THE HTML on the CDN, we dynamically concatenate the entire HTML file on the backend Node.js server, pulling the data from the dataServer.

This pattern addresses the needs of the product but introduces new problems. WebView needs to request Node.js to obtain HTML, and Node.js needs to assemble the background page. The network and background computing time in the middle are larger than we expected. In this process, the whole page could not be rendered. When users entered the home page of our mall, they saw a blank, which the product manager could not accept, and users did not pay the bill.

In addition, it is almost impossible to use the cache of WebView itself in this mode, because the background straight out of CSS/JS has all been executed in the back end, WebView is difficult to cache a pure static HTML down. In order to solve the above problems, we introduce the mechanism of dynamic cache.

Also, we don’t allow webViews to access our Node.js server directly. We add sonicBridge, the offlinecache-like intermediate layer mentioned earlier, which first downloads the full HTML from the Node.js server to the WebView. At the same time, the downloaded content will be fully cached locally.

We used to cache the same HTML for all users on the web. Now we have changed to cache the same HTML for all users on the web.

SonicBridge will preferentially submit locally cached pages to the WebView when the user enters the page for the second time, and the user can see the content without waiting for a network request. This improves the user experience in terms of speed, but it introduces another problem.

In fact, the user sees different content every time he opens the WebView. Node.js returns the latest data every time, so we have to make WebView reload to pull back the data, which gives the user the following experience: You’ve opened the locally cached HTML and seen the content, but the page is completely reloaded. On some low-end models, WebView reload takes a lot of time, and the user can clearly feel the entire WebView H5 page go blank for a while before the new content is refreshed.

Combined with the statically straight out local Refresh part of the DOM experience mentioned earlier, we can reduce the amount of network traffic and reduce the amount of data on the submitted page. The first thing we do is reduce the network traffic and avoid refresh too late.

We changed the protocol for node.js group HTML so that when sonicBridge makes a second request for data, the Node.js server does not return the entire HTML to sonicBridge, but instead returns what we call data data.

After getting the data, we made an agreement with the H5 page that the native side would call the fixed refresh function of the page and transfer the data to the page. Pages refresh their DOM nodes locally, so that even if the page needs to refresh, the entire page will not be reload.

In terms of the flow of data content, the first sonicBridge page will return the full HTML, and it will return what we call a template-tag ID. This template-tag will mark the static hash value of the page in order to control caching. In the returned HTML we will also have some tags, such as sonicdiff-banner, which determines what its refresh ID is.

When the second load returns, instead of the entire HTML you saw earlier, it returns about 37KB of data. This data is actually a JSON file, but it defines the DOM structure corresponding to the previous sonicdiff-banner. To save code for H5, we put together the DOM node code directly in JSON so that the page only needs to do id matching and refreshing.

The 37KB of data transferred here is hard to avoid, and we observe that the amount of data refreshed varies from business to business. Can we reduce the amount of data submitted to the page for refreshing? After all, the product manager does not change much data at a time.

In sonicCache, in addition to caches full HTML and templates, we also extract data for dataCache.

Template is the frame of the page that is first accessed by removing all the variable data from the sonicdiff id information. The user only needs to merge the template with the returned data on the client to obtain the complete HTML.

After caching each dataCache, we also make a difference in the data. For example, this request returned 37KB of data, and the last one was 37KB of data. We determine how much data has really changed, and then just send the difference to the HTML refresh. So in most scenarios our page only needs to process about 9KB of data to refresh the entire page.

With cache, users can open the cache locally very quickly, the transmission of differential data also reduces the time for users to wait for the refresh, and finally, the diff of data submission also greatly reduces the page refresh range.

The process of sonic mode is as follows, which may seem complicated, but the basic principle is to cache the requested HTML template and data via Bridge.

There may be a question here, the front static straight out effort to do offlineServer and offline pre-push strategy, there is no use? In fact, we are still using the offline caching mechanism of dynamic pages and static pages mentioned before, because there are still a large number of common JS in our business pages, such as THE JS API encapsulation provided by QQ, and some common CSS are also pre-promoted through the offline package strategy, which is also downloaded every time we log in.

After completing the model data, the effect is relatively obvious, the first load and the general HTTP load performance is similar, but the user open the second page, usually only need 1 second time to see the page, this one second also includes client launch process and the WebView overhead, at the same time we loading speed is no longer affected by the user’s network environment, Whether it’s 2G or 4G, the loading speed is almost the same.

Moreover, it also brings a benefit. If the user’s network is poor, for example, the user often jitter and fails to connect, because we have local cache, our page can be opened even if the user is currently disconnected.

There is no mention of the template update scene here. Template update means that the extracted template may change dynamically in our server. At this time, the loading process is different from what we mentioned before. The time here is relatively high, but we find that most users are still in the state of data refresh, that is, second open.

When optimizing the speed of H5, it is easy to think whether we should make use of persistent connection to avoid the time consuming of connect, DNS, and handshake of the server, such as QQ client, which has persistent connection with the background server. Would it be faster to use this connection to request an HTML file from the backend server and hand it to the WebView than to make a temporary connect request? We only need to set up a reverse proxy service to access our Node.js server from the QQ message background. This process can be opened, but we believe that this mode may not be suitable for all scenarios.

It is true that some apps use the persistent connection channel to load pages, but it is difficult for mobile QQ, because the persistent connection channel between mobile QQ client and sever is a very traditional CS architecture, which sends socket package and needs to send a request packet each time. The next request will not proceed until the reply is received.

This response mechanism determines that it needs a waiting process every time, and the constraint of socket package limits the size of packets transmitted every time. The data of 30+KB may be split into five or six packets. Although persistent connection is used to save connect time, However, communicating back and forth with the server increased the overall time.

In addition, the data returned from node.js server is HTTP streaming, so WebView does not need to wait for the completion of the whole HTML loading before rendering and display, as long as the transmission of the first byte can start document parsing and DOM construction.

If persistent connections are used, it is likely that we will have to go through client-side encryption, decryption, and packaging, and not display the HTML until the entire HTML has been downloaded, which we think slows down performance.

After the above introduction, you may have a general intuitive impression of QQHybrid: 1. We did some work on the front-end development of WebScope; 2. Our native layer terminal developer made bridge connection. 3. Our background students did a lot of work such as automatic integration and offlineServer push. The structure of this part is as follows:

Next I’ll look at the page traffic section on the right side of the architecture diagram. We have counted the distribution of traffic in each business. As shown in the figure below, we can obviously see that most of the traffic is consumed in picture resources. However, when we did this analysis, we were also skeptical whether the business characteristics determined that our picture consumption was the largest. Mobile QQ other H5 business is not also so?

In the Spring Festival of 2016, we had a chance to do an activity involving almost all businesses on mobile QQ — Red envelope during the Spring Festival. You may still remember the operation of constantly poking the screen to get red envelope during the Spring Festival of 2016. This kind of national carnival brings huge traffic pressure behind, about 300,000 gift packages are sent to users every second, and the web traffic guiding users will open H5 pages about 100,000 times every second, and the peak traffic assessed at that time is more than 1TB.

We have analyzed the picture traffic inside, and it does occupy nearly half of the level. Some of it has been delivered to users’ mobile phones in advance through offline package pre-promotion, but the picture traffic on the live network still exceeds 200GB during the activity.

Traffic is not a problem that can be solved by simply buying from operators. During the Spring Festival activity, we almost encountered a situation of 200GB traffic under a single domain name. At that time, the CDN architecture could hardly bear.

We all think there’s a lot of potential here, and if image traffic can be saved, bandwidth costs can be reduced. The user side can consume better network traffic, phone power and other experiences, so our team check new things about the picture format.

Everyone is familiar with WebP, and It is well supported by Android. The QQ team internally developed a picture format called SharpP, which can save about 10% of the file size compared with WebP. Below is the data comparison of the existing pictures extracted from our CDN server.

Image size is dominant, but what about decoding speed? Unfortunately, SharpP is indeed slower than WebP or even JPG, but fortunately, the size of our business picture is not too large, and it is acceptable to spend tens of milliseconds more on the page, which we think is more favorable than saving the time waiting for the network.

Therefore, we plan to promote SharpP format in mobile QQ H5 business, but the promotion of new picture format will bring great application cost. First of all, most image links are written in the code, and the page does not know whether the mobile terminal has the ability to decode SharpP format.

H5 page for different mobile QQ version to prepare different HTML? Or do image resources generate two links in different formats when published to the CDN, and then select different links within H5 based on terminal version? This development cost is certainly unacceptable.

In addition to the image format problem, we found that users of different models will have a waste of traffic. Our UI design is usually designed for the iPhone6’s screen size, with 750px images by default. For phones with smaller screens, like 640px and 480px, download a 750px image and zoom it out during rendering.

This actually wastes a large amount of bandwidth, so we think whether CDN can deliver pictures in different formats according to the screen size of users’ mobile phones.

This screen adaptation strategy is 0 0 0 Also faces the cost of 0 0 0 for a privateformat because the CDN doesn’t know about the phone either

At the bottom layer, we call it the CDN source station, where we deploy the image format conversion tool. The business side does not need to care whether JPG is generated into sharpP or WebP, but only needs to publish the image on the CDN source station, which can be automatically converted into the corresponding format and screen resolution.
Above is the CDN node that users’ phones connect to, deploying servers across the country for speeding up and caching files.
We worked with the browser team to put the sharpP decoding format in the browser kernel so that the top level business doesn’t have to care whether the current browser supports WebP or sharpP.

When the page is opened, the WebView will automatically bring the screen size of the terminal and what image formats it supports to the CDN node. The CDN node then obtains the latest pictures from the source station, and the source station may have generated the corresponding pictures offline or in real time.

Apart from integrating sharpP’s decoder library, the WebView layer is relatively simple. For example:

Add additional fields to the request header, such as “Pixel/750” in the user-agent, which becomes 480 on a 480px machine;
Added sharpP protocol header to Accept: image/sharpP

3*3 images will be stored in the source site. When each business picture is submitted to the source site for publication, 9 pictures will be generated. CDN node according to the request of the WebView, when back to the source to the CDN source station request corresponding types of pictures, but for the business and the WebView request or the same links, so all mobile phone QQ H5 page doesn’t need any line of front-end code changes, can enjoy image format size adaptive and traffic to save.

Here’s a more visual process: Add a field to Accept and return the corresponding image:

This technology is not complicated, I personally think there is no deep technical threshold, more from the client, Web to the CDN background of the whole chain through. But there were some pitfalls: we found a lot of complaints from iOS users when we were graying that the images weren’t displayed on the page.

This was a big surprise to us, because at the time it wasn’t deployed on iOS, it was just on Android. We have checked the code of CDN and there is no problem. Then why will sharpP’s picture be sent to iOS users?

Later analysis found that operators in different regions in China would provide CDN Cache services. When Android users first requested sharpP pictures, the operator’s server got the sharpP format link from our CDN. When other iOS users in the same area make a request within the validity period of the cache, the carrier finds the same URL and directly returns the sharpP image to iOS users.

The problem is that we failed to review the entire architecture and fell into the trap. HTTP has a standard convention to solve this caching problem. When CDN distributes content, the Vary field is used to specify the cache to refer to the fields in Accept and user-Agent. After we add the Vary field, the problem is basically solved.

This case gives us additional inspiration. In our live network, the Pixel field has three values: 480px, 640px and 750px. We have discussed internally whether the screen size can be directly written in user-aagent. In this way, when Android comes out with some new screen resolutions, we can do better self-adaptation in the background and generate pictures of different formats for each model.

If this were to happen, there would be a significant backsource overhead for the carrier and our own CDN cache. Each picture with a resolution should be cached, for example, 498px. If intermediate operators do not have the cache of this model, they will go to our service to return the source. In this way, N screen sizes will bring N times of return pressure to our CDN.

Without further ado, the final data effect is also relatively obvious, the following is our effect data in Android gray scale. Our H5 image traffic dropped by 20+GB from 40+GB. For Tencent, 20+GB bandwidth is not a huge cost, but it can nearly double the business space under the Spring Festival activity scenario. As an added bonus, the user waits less time to see the images on the page, and the traffic on the user side is cut in half.

As we solved the problems of page loading speed and traffic consumption, we also began to consider the stability of the H5 under rapid operation. I believe that front-end development has encountered a page code changes, other functions are not normal. Hybrid development is likely to provide a lot of APIS for JS pages by Native. Small changes of the client may affect JS APIS, resulting in abnormal functions of H5 pages on the whole network.

In addition to functional stability, there is a big problem, we release front-end pages every day, the page optimization performance is not degraded? We’ve spent a lot of time getting the page load performance down to 1 second. Will there be some front-end changes like introducing more external JS/CSS dependencies that will degrade the whole page? We made some tools to solve these problems.

This is what we internally call rapid automation. We will write all the test case sets of the front end into automatic tests, and then run all the test case sets of all the pages of the whole network every day to check whether the function is normal.

We will conduct Web Performance Test monitoring of Web Performance. Here, we mainly observe the traffic consumed by each opening of the page, so we will use tools to analyze whether some pictures loaded in the page can be converted into sharpP but still use JPG. With this monitoring, H5 developers outside of our team are encouraged to optimize their pages.

There’s a lot of talk on the front end about optimizing to reduce the number of requests and so on, and these are sort of military rules that we monitor in testing. I didn’t go into detail about some of the methods of client optimization, but we did monitor the time it takes for WebView to start up on the client.

We also have a more stringent front-end release process. All code written in the test environment and passed the test must be verified by QTA and WPT if it is to be released to the formal environment. If the automated test success rate is less than 95%, it is not allowed to be released.

After the release to the formal environment, we also have a comprehensive scoring and monitoring system on the extranet, whose primary monitoring index is about speed. We split the page opening speed into client time, network time and page time and monitor them separately.

We will output the following monitoring report every day to observe the daily speed change. Here, we are not only concerned about the performance of the whole network, but more concerned about the experience of slow users, such as the recent proportion of users who are longer than 5 seconds.

In addition to these, H5 often encountered some JS errors resulting in abnormal pages, loading too slowly resulting in a long time for users to see a white screen and other problems, we have systematic monitoring of these.

In addition to the content mentioned before, we also made a Debug platform, and many debugging capabilities have been deployed in all mobile QQ terminals in advance. We can use remote commands to check the DNS resolution of the user, which server is hit, whether the user is hijacked by the carrier and so on.

The whole QQHybrid architecture is basically introduced. In addition to the optimization of performance, we also adjusted the architecture of CDN and made an operation monitoring tool. I think it’s the operational monitoring system that allows our entire H5 and Hybird team to be able to make bold changes to the page and release new features while maintaining stability and reliability.

During the whole process, we also felt that hybrid architecture was not as well understood as before. It was only the cooperation between the client and the front end, and the background technology also played a great role in the whole architecture system. For CDN transformation, we also sought support from the operation and maintenance team, and the test and development team also participated in QTA and WPT. It can be said that the establishment of the whole system is the result of all posts fighting side by side.

Today’s recommendation,

Click on the image below to read it

Programmer, is this the technical leader you want?

More than 70% of the business is developed by H5. How to optimize and evolve the structure of mobile QQ Hybrid?

Today’s recommendation,

Related Posts

My first experience as a project manager

The front-end interview Guide! 2021 Baidu, Bytedance and other large factory interview set!

Solve Spring Security @preauthorize problem