[Global Software Conference] Huawei front-end engineers to share: intelligent practices of Huawei cloud official website

Abstract:In the 7th global software conference, huawei YuJiBo software development engineers and developers chatted huawei cloud intelligent practice the club’s official website, mainly concentrated in the operation of content production, content analysis, content quality, content distribution, content consumption and user feedback process, as well as the business pain points encountered in the process.

This article is shared from the Huawei cloud community “Huawei cloud official website intelligent practice of five key measures [global software conference technology sharing]”, the original author: technology torchbearer.

The Internet generates a huge amount of content all the time. According to a report from Ruia, 4.2 million voice messages, 8.3 million shared videos, 4.16 million search queries and 1.65 million Weibo visits were generated within 60 seconds on the Internet in China.

In the face of so much content, how should we do a good job of website content operation?

In the 7th global software conference, huawei YuJiBo software development engineers and developers chatted huawei cloud intelligent practice the club’s official website, mainly concentrated in the operation of content production, content analysis, content quality, content distribution, content consumption and user feedback process, as well as the business pain points encountered in the process.

It also focuses on how Huawei Cloud uses AI algorithms and models to provide automation capabilities, reduce labor costs, and improve content quality and efficiency of content distribution.

How to judge content quality, and what is the key to efficient content delivery?

In the digital age, traffic is the key to the operation of website content, and high quality content and the good experience brought by efficient content distribution are the foundation of traffic increase. A negative example is the misuse of Putin’s photo by Indian media in reporting sexual assault incidents. A positive example is the use of recommendation and search for content distribution by news, e-commerce and video websites.

That Huawei cloud official website as a content website is how to do it?

Firstly, this paper introduces the content life cycle and content operation process of Huawei Cloud. The content operation of Huawei Cloud official website is divided into six stages: content production, content analysis, content quality inspection, content distribution, content consumption and user feedback. The pages, documents, audio, video and pictures of the official website are first analyzed and understood. After the content is reviewed, they are distributed to Xianxia.com by the operation personnel. After the end users consume the content on the official website of Huawei Cloud, they will feedback the relevant opinions to the internal and external platforms.

In the process of content operation, our pain points include the following parts:

A large amount of multimedia (audio, video, pictures, etc.) content needs in-depth semantic analysis to judge the quality of the content and carry out effective distribution, which is time-consuming and labor-consuming.
Large amount of content release data, frequent updates, a large number of content quality inspection consumption of manpower, low efficiency;
The traditional mode of operation configuration can not meet the personalized needs of complex customer groups, which is easy to reduce user interest and lead to user loss;
The access experience of end users cannot be effectively collected, analyzed and closed loop, which is not conducive to the rapid improvement of product experience.

To solve the above problems, we mainly use intelligent solutions to solve the business pain points at each stage, including:

In the link of content analysis, OCR, ASR, NLP and other technologies are used to automatically extract the structured information of content to reduce labor costs.
In the link of content audit, NLP technology and Huawei Cloud Moderation service were used for machine audit.
In the link of content distribution, the use of structured (TDK, label, category, etc.) information of content, as well as intelligent recommendation, intelligent search and other relevant technologies to improve the efficiency and accuracy of content distribution and improve user experience;
In the user feedback link, NLP related technology is used for sentiment analysis and voice classification, timely processing, closed-loop, and continuous formation of product improvement suggestions.

The following is a detailed introduction to the relevant practices of Huawei cloud intelligent operation.

Key measures of intelligent operation practice of official website

First of all, I will introduce the overall architecture of intelligent operation of Huawei’s official cloud website. The architecture is relatively simple, including several key layers.

First, the bottom layer is the basic service layer. All our businesses are built based on Huawei cloud services, including AI-related OCR, ASRC, NLP, RES, ModelArts, big data-related DLI, MRS, and basic SQL and NoSQL storage services. Above the basic service layer is the core data layer, including user portrait, behavior data, item information and other data; The middle layer is our feature engineering and algorithm model layer. The algorithm model mainly focuses on NLP, intelligent recommendation and intelligent search related algorithms. Then we build service components to support different business scenarios, including portrait and label components, policy management sorting components, AB testing and log collection components, etc. The upper application scenarios at the top are mainly five thousand people, recommendation, search, public opinion and intelligent Q&A.

I will highlight some of the key initiatives for smart practices.

Key move 1: content analysis

In the content parsing stage, we use Huawei Cloud OCR and ASR technology to extract pictures and audio and video text, facilitate the next step of automatic content review; At the same time, we use NLP related technology to extract the structured information of the text, such as keywords, summaries, labels, categories, topics, etc., for model training in the stage of search engine optimization and content distribution.

Key move 2: content quality control

After text extraction and semantic understanding, we use automatic means to carry out content quality inspection, including text error correction, content review and normative check. Among them, text error correction provides the ability of error correction based on pinyin, error connection based on N-gram substring and error correction based on language model, because the business needs to update keywords and corpus regularly, and update the model regularly.

Content audit is connected with the Moderation service of Huawei Cloud. It has the audit ability of text, image and video, and the business only needs to update the sensitive thesaurus regularly. In addition, there are normative checks, including 404 dead chains, TDK information, currency units, etc., and the solutions adopted are mainly crawler service and rule engine.

Key move 3: Content distribution — intelligent recommendation

In the content distribution stage, we mainly introduced intelligent recommendation and intelligent search. Intelligent recommendation is to use intelligent means to predict users’ interest based on the user’s article portrait and user behavior, so as to realize content finding, accurate recommendation and improve the conversion rate.

The system architecture of Huawei Cloud Intelligent Recommendation is as follows: based on the off-line OBS data, the off-line processing of DLI is used to extract the user’s item portrait and user behavior information, and the off-line processing of DLI is used to carry out feature engineering, recall and sorting model training. The training is then published to the ModelArts platform, which provides online reasoning capabilities.

At the same time, we also support real-time recommendation ability. The business uploads user and item information and updates user and item portrait in real time through the DIS channel. Then, the DIS channel connects real-time behavior, updates user interest label, and recalls real-time recommendation result set. Finally, when the user visits the official website page, they request the ModelArts interface to put back the sorted recommendations.

Key move 3: Content distribution — recommendation algorithms

Recommendation algorithms in the industry are relatively mature. We have adopted commonly used recall and sorting algorithms. The recall part includes collaborative filtering and interest matching, while the sorting part mainly adopts LR and DEEPFM. The advantage of LR is that the model is simple, efficient and requires little computation, while the disadvantage is that it cannot deal with the relationship between multiple features. The advantage of DEEPFM is that it combines low-order and high-order features, with more features being more accurate.

Finally, intelligent recommendation brings a lot of improvement effects to the business, such as the efficiency of content distribution from hour level to minute level, and the coverage of content push up to 90%+.

In addition, the click-through rate of the official website products and activity recommendations, the conversion rate of registration and purchase, and the click-through rate of the community homepage blog recommendations all improved.

In terms of intelligent recommendation of content distribution, we also summarized several experiences:

For business scenarios with small data volume, the algorithm with simple model and strong interpretation is preferred to be put online, and the effect of the algorithm is quickly optimized and verified by AB test.
Make full use of the user’s proximity and search behavior, because the proximity represents the real-time interest of the user, and search can generally represent the user’s content request, which is better for the improvement of business indicators.
In the recommendation scenario, no algorithm is universal. It is necessary to select the appropriate algorithm in combination with the scene, user and business characteristics and the results of data analysis.

Key initiative 4: content distribution — intelligent search

Another key measure of intelligent distribution is intelligent search. According to data statistics and heat map analysis on the right, users pay more attention to structured card sections and articles at the top of the list in search results, and less attention as they go on. Therefore, our search optimization mainly focuses on the following aspects:1. Intelligent card recall; 2. Search for recall optimization; 3. Search sorting optimization.

Smart Card Recall

In the intelligent card recall part, we mainly use the FastText model to predict the card category (text classification) corresponding to the user’s search term. The input layer is the vector of words that compose Query, and the output layer is Softmax layer, which mainly outputs predicted cards and probabilities.

At the same time, we optimized the structure of the hiding layer. The original structure adopted the stacking averaging method. Although the computing speed was fast, there was information loss, so the hiding layer was changed to the fully connected embedding bedding after stitching.

Recall optimization based on deep semantic model RNN-attention-DSSM

We use RNN-attention-DSSM model to optimize search recall. Traditional ES queries are all based on query recall based on keyword matching, while those with keywords that do not match but are semantically consistent cannot be recalled. The DSSM model expressed Query and Doc as low-dimensional meaning vectors with DNN, and then calculated the semantic vector distance between them by cosine distance, and finally trained the semantic similarity model. RNN-attention-DSSM is a further optimization of DSSM, considering the context features of sentences through RNN and Attention mechanisms.

The RNN-attention-DSSM model is as follows: the top layer is a typical DSSM layer, which calculates semantic similarity according to the vector distance of queries and positive and negative documents, and conducts Softmax. The goal of the training is to maximize the probability of querying forward documents. The bottom left is a typical GRU network, and the right is a typical self-attention model.

Our training data are as follows: positive samples are Doc clicked by Query, negative samples are randomly selected from Doc unclicked by Query, and the ratio of positive and negative samples is 1:4. Query is the user Query, Doc is the file title + book name.

Ranking Optimization Based on Learning Sort Algorithm RANKNET

At the same time, we use the RANKNET model to sort and optimize the search recall results, and put the doc with high relevance in the first place to improve the accuracy of search results and user experience. The RANKNET model belongs to the Pairwise method, which does not care about the specific value of the correlation degree between a certain Doc and Query. Instead, it transforms all the ranking problems of Doc into the problem of solving the sequence of any two Doc. That is to say, there are three categories: doci is more relevant than docj, docj is more relevant than doci, and the degree of correlation between them is equal, and {1, -1, 0} is used as the corresponding category label respectively.

As shown in the figure above, the RANKNET algorithm process is as follows: on the left side, features are extracted according to users’ queries and recalled articles, and then a DNN network is used to calculate the word segmentation of each document, and then the difference of the score value of the document is calculated pair by pair. After that, the value is constrained between (0,1) through Sigmoid function.

The most right-click annotation data, the current use is the number of clicks of each document, the number of clicks of the document for pair comparison, small for -1, equal to 0, large for 1. The comparison value is then linearized and scaled to the position of [0,0.5,1]. The goal of model training is to make the comparison value obtained by the model as close as possible to the pair comparison value of label data. The cross entropy loss function is adopted in model training.

Our intelligent search has also brought good results. Whether it is intelligent recall of cards or sorting optimization, the search click rate of TOP1000 and TOP5000 has been improved

In the next step, we plan to further improve the offline indicators of the sorting model, enrich the feature set according to the business understanding and feature selection, and find out more features related to correlation. Secondly, we distinguish short and short word queries and build a separate training model for short queries to improve the sorting accuracy of short query statements. Finally, based on NLU, the search intention of users is further mined to solve the problem of unclear search intention of users.

Key Action 5: Experience Closed Loop — Sentiment Analysis and Voice Classification

The analysis and improvement of user experience is an important way for continuous improvement of product experience. We mainly use NLP technology to analyze user emotion, classify and distribute experience problems, and the relevant logical view is as follows:

After the internal and external sound is connected, it is stored in the database after data removal and cleaning, and then NLP and other capabilities are used for sentiment analysis and sound classification. For negative sound, public opinion warning is issued in time, and the experience problems and requirements of the product are tracked and closed loop through Bug list and demand list respectively. At the same time, we also have corresponding operation management platform for public opinion configuration, key public opinion tracking, emotional feedback and Kanban data presentation. The model adopted in this part is also relatively simple: the bottom layer is a Bert pre-training model, and the downstream is connected with a classification model.

Finally, our effect data are as follows:

1. The accuracy rate of negative sentiment analysis reached 95%+;

2. The workload of sentiment analysis is greatly reduced and the number of manpower is reduced;

3. The efficiency of negative emotion processing is improved from hour level to minute level;

4. According to the classification of experience problems, promote the cloud service to complete the closed-loop of 50+ effective improvement suggestions.

My experience is as follows: 1. The category definition should be as clear and easy to distinguish as possible to reduce ambiguity; 2, annotated corpus small batch high frequency, sampling quality inspection, accuracy is less than 95% back to re-annotate.

Engineering practice summary

Our engineering practice is relatively simple: based on Huawei Cloud ModelArts one-stop development platform, we build the ability of data processing, model training, model management and deployment, and based on DGC timing scheduling, we build the ability of continuous model training and release.

In order to make content operation smarter, we are also working on:

Optimize the accuracy of text classification and information extraction based on the pre-training capability of Huawei Cloud NLP Pangu Model;
According to Huawei cloud product keywords and new functional features, AI algorithm is used to intelligently generate article content;
Based on content deep semantic mining and structured information, the association relationship of Huawei cloud content is established, the unified life cycle management of content is constructed, and the knowledge graph is constructed based on the association relationship to carry out intelligent recommendation and search;
Multi-tasking articles are scored based on page vision, information content and semantic depth to improve content quality.

welfare

After understanding the key measures of Huawei cloud official website intelligent practice, do you have any harvest or questions to communicate? Please leave your questions or thoughts in the comments section of the original article. We will select 3 pieces, invite experts to communicate with you 1V1 (the original portal is here), and send a developer gift package.

This time, there are two HUAWEI experts to bring you website high availability guarantee scheme andFront-end low code practicesThey also answer developers’ concerns, such as the best way to ensure high availability on your site, the choice of low-code platforms, and more. Welcome to scan the code to watch the video.

Finally, attached is the technology sharing PPT of Huawei’s front-end R&D engineer Guo Xiao at this Global Software Conference. Click on “Five Key Measures of Huawei’s Cloud Intelligent Practice” to download and view at the end of the article.

Click on the attention, the first time to understand Huawei cloud fresh technology ~