Author: Mo Epilepsy

1. Business background

The biggest problem facing Xianyu livestreaming service after its launch is growth. Xianyu BI found that, by comparing short-term and long-term viewing groups, the two groups showed obvious phased differences in interest. Based on the understanding of live broadcasting, anchors and users, the business hopes to accurately deliver high-quality live broadcasting in the head according to their interests, and enlarge Matthew effect of anchors in the head to realize the transformation of live broadcasting and the increase of viewing time.

2. The target

A brief summary requires two outcomes:

  • Accurate launch of the platform within three weeks, precipitation of the infrastructure of the basic operation platform;
  • Business to ensure the head broadcast room average conversion UV to achieve a certain goal, the conversion rate has been significantly improved;

Then, can the business goals be achieved by simply using the algorithm model to achieve high-quality live broadcast recommendation? But the reality is that you can’t make bricks without straw. Due to the short online time of live broadcast and the limited times of playing and watching, there are not enough samples for model training to directly understand users’ interest in live broadcast, and the platform does not control the live broadcast content of anchors to realize the structure of content. Therefore, it is necessary to combine the experience of operation in the field of live broadcasting with BI analysis and algorithm. Based on the understanding of users, live broadcasting and live broadcasting rooms, it is necessary to realize the delivery of live broadcasting rooms to interested groups and deposit the platform capability.

3. Implementation scheme

The first step is to realize the understanding of people, including c-terminal users and anchors, and the second step is the understanding of live broadcast. The result of understanding will eventually be associated with the page resource bits in the form of interested groups and anchors, forming a preliminary match between people (users) and goods (live broadcast) field (resource bits).



The understanding of users depends on the characteristic data of users, including the basic characteristics of Xianyu users, the records of commodity related behaviors such as searching, browsing, publishing and trading, the characteristics of interactive behaviors and the characteristics of user interest tags. These features do not have high requirements for real-time performance. Most of the features are produced by offline calculation, and the subsequent features of different data sources are normalized by offline calculation.



All features of users will be synchronized to the crowd selection platform, and crowd selection can be realized through cross and difference, and crowd preview and export can be carried out.



Overall platform design



The selected crowd data is stored offline in the mapping table of userId and crowd Id, and the association relation of < user, resource bit, anchor > is obtained after the combination with the configuration of the placement. Then the relational data will be synchronized to the graph database Igraph, and provided to the algorithm when online recommendation query associated live broadcast to realize recommendation and exposure according to interest. Limited is the overall exposure traffic quota, the algorithm will be based on the model, within the limited PV quota for online live broadcast room to achieve a better choice.



How is this implementedUsers understandandBroadcast in the studio.

Users understand

It is not difficult to produce conventional features that users understand, while user interest tags need to be started from scratch for idle fish users to make up for the lack of ability in this aspect. Interest tag is mainly to find out the correlation between the behavioral text generated by user’s historical behavior and the phrase involved in the domain tag. Contains all kinds of behavioral text of the goods and posts shown in the picture, and the data is being added gradually.



The operation will sort out the keywords and phrases of different fields as input, and match them to the domain tag features of users with high correlation degree. To realize the output of interest tags, three problems need to be solved: storage, retrieval and relevance calculation.



Interest Label Outputs (Option 1)



As shown in the figure, scheme 1 is the initial scheme, and the overall process is as follows:

  • Keyword structure: BI students completed the processing of behavioral text details, including data source normalization, de-duplication and UDF word segmentation, and calculated scores according to keyword frequency and preset weight. Output structured text details of user behaviors, including user ID, entity ID, keyword list, and keyword corresponding score list.
  • Dsl-based marking rules: the key phrases of industry interest entered by operation are segmayed into database executable DSLS;
  • Interested user DUMP: perform DSL to retrieve the structured behavior text matching the input keywords, perform user deduplication, and complete the association of user interest tags.
  • Crowd selection: based on user interest tags and other characteristic data to do the cross difference to export the final crowd, this step is carried out on the two-party crowd selection platform;

The whole scheme is feasible and has good flexibility. The offline part can constantly improve and enrich the structured behavior text. The engineering test focuses on THE visual optimization of DSL and the improvement of the whole data flow. However, this scheme is difficult to implement, mainly because of the following problems:

  • The construction period is short, requiring 2 to 3 weeks to complete all link functions online and support business verification, so it is almost impossible to implement the scheme;
  • The storage cost is huge. It is estimated that about 30PB of online storage resources are needed, which is impossible to apply for for a business with no verified value.


Some students may quickly find that the process from text structuring to retrieving users with specific interests is not a business scenario that can be realized with a search engine. The biggest problem is still the budget problem. Building a search engine is not a small cost, and there are serious performance problems in dumping a large amount of data from the search engine. Meanwhile, BI students cannot be optimized in the whole process.



Search engine basic flow



The online scheme is ideal, which can realize the operation of interest label association and crowd selection by using their own industry experience. Due to the limitations of the above objective conditions, we finally chose the way of offline association between users and interest tags, fast access to part of interest tags, and then gradually promote the online solution. Thanks to BI students’ comprehensive ability, they completed the “offline search engine” and deposited some user interest tags in advance. So the overall scheme looks like this:

  • Unstructured text is processed offline, and structured text is obtained through deduplication, word segmentation and algorithm (this step is the same as scheme 1).
  • Organize keyword phrases associated with domain tags
  • Offline computing method to retrieve users matching keyword phrases


The biggest disadvantage of scheme 2 is that it is less universal than Scheme 1. The output of each interest label needs BI development, which can only meet the real-time performance of T+1. However, it also has some advantages, such as low cost of offline storage and customized complex UDFs for offline computing. Please refer to the offline section for more detailsData team’s interest labeling systemImplementation introduction.



Interest Labelling Outputs (Option 2)

On the implementation

The placement is divided into offline and online parts. The placement and configuration of operation and maintenance is stored in RDB (relational database), which needs to be synchronized to the data warehouse. The offline calculation completes the relationship between users and interested hosts, forming the < user, interested Host list > relationship. Syncing associated data to a relational database in a line graph provides algorithms recommended in anchors of interest. The entire data link needs to be transferred automatically, as soon as possible:

  • The online configuration cannot be synchronized to offline in real time. Currently, the online configuration is scheduled once every hour to meet the requirements on punctuality.
  • Offline tasks can basically meet the requirements of quasi-real-time line by relying on task drivers. New partitions are added in each full update of “User anchor Interest Relationship”, and done partitions are added with the same time as the new partition.
  • Offline data synchronization to the line graph database is based on the data exchange component. The database periodically checks the done partition of the offline table. If there is a new DONE partition, the database updates the full data of the partition at the same time through the synchronization message mechanism.

4. Homepage effect

In less than three weeks, the platform of the complete link was realized and put online, and the circle selection and placement configuration of operators could be completed in minutes. After the pilot placement of the head live broadcast in some fields on the home page, the effect is obvious:

  • All head broadcast room, UV hits far more than the target;
  • Compared with the market, the click conversion rate of PV and UV in most areas was significantly improved, and the highest reached multiple improvement;

5. Look forward to

Due to the short time of the whole project, the minimum set of interest live casting functions is realized to support quick verification and get better feedback and results. On this prototype, the future will gradually improve and enrich its capabilities:

  • On the basis of docking BI interest tags, it is necessary to continuously enrich the ability of docking interest tags and other dimensions of feature data, while supporting the operation of students to produce universal interest tags and other features by themselves.
  • Rich support for resource placement, and multi-dimensional AB scheme and multi-index general report analysis ability. Can support more business quick trial, quick feedback and quick adjustment;
  • Precipitation and abstraction of core links are not limited to supporting live streaming services, but can be platformized to support more community and non-community services. At the same time, on the basis of understanding user interests, better support to understand the content, to achieve content structure, to achieve low-cost operation of users and interested content;

~ ~