You can get the learning materials and videos of Flink by replying to the keyword of Flink in the official account mangodata.

| production | Flink + live two (2) how to build real-time public image dimension table?

Each article in this series is based on some actual production practice needs, to solve some problems in production practice, to help partners solve some actual production problems. This article mainly introduces the whole process of real-time dimension table construction of portrait in the broadcast room, if it is helpful to your friends, welcome to like + watch ~

The technical architecture

Recall the “Technology Architecture” diagram from the previous section.

The technical architecture

The overall architecture is relatively easy to understand. From the data source to the data processing and finally to the data sink section.

But everyone’s doubts may focus on the construction of three dimension table, including “anchor user portrait dimension table, audience user portrait dimension table, broadcast room portrait dimension table”.

The technical architecture

We still start from the following points of view, through the analysis of the scene, to answer these questions to introduce the above three dimensional table construction process.

Question

  • WHAT: Live real-time public portrait dimension table refers to WHAT? What does offline public portrait dimension table refer to? The difference between them?
  • WHY: WHY are the three common portrait dimensions in the architecture diagram divided into real-time and offline dimensions? Why need to build real-time public portrait dimension table, offline public portrait dimension table can not meet the demand?
  • HOW: HOW to build a real-time public portrait dimension table that meets the requirement of live real-time data?
  • WHO: What components are needed to build a live public portrait dimension table? Why were these components chosen for construction?

WHAT: Real-time & offline public portrait dimension table?

concept

Simple introduction, the first “real-time & offline public portrait dimension table” stored in the content is the inherent nature of the entity (such as user’s age, etc.), I understand this two word itself is a high-level abstract concepts, this article introduces the “anchor user portrait dimension tables, the audience user portrait dimension table, studio portraits dimension table” is the concrete realization.

There will be a deeper understanding of “real-time public portrait dimension table” & “offline public portrait dimension Table” in the explanation of other big guys’ articles. Here I only illustrate my understanding in the construction process of live real-time data ~

The difference between

In fact, the difference between these two words can be distinguished from the name. The biggest difference between real-time public portrait dimension table and offline public portrait dimension table is the different timeliness of data construction and application scenarios.

Offline public portrait dimension table

Features:

  • Scenario: suitable for offline scenarios with weak timeliness requirements. It provides dimension filling or marking services for indicators
  • “Build” : Generally build in an offline T + 1 manner
  • Application: Uses offline T + 1 data
  • “Example” : user portrait dimension table in data warehouse, providing portrait service for application layer data; For example, not only the total uv statistics, but also the statistical age of UV.

Real-time public portrait dimension table

Features:

  • “Scene” : suitable for real-time scenes with strong timeliness requirements. It provides portrait dimension filling or marking services for indicators
  • “Build” : Build in real time, usually in seconds
  • Application: The data to be used is constructed in real time and must be available in real time (after second delay)

WHY: Build a real-time public portrait dimension table?

Why are the three common portrait dimensions in an architecture diagram divided into live and offline dimensions? Why need to build real-time public portrait dimension table, offline public portrait dimension table can not meet the demand?

In fact, these questions can be answered based on the construction of our live broadcast real-time data and the application scenarios.

According to the technical architecture diagram in the previous part, the public dimension table to be constructed for live real-time data is divided into the following three categories:

  • “Portrait dimension table of broadcast room” : contains the corresponding broadcast category, broadcast client, title, broadcast address and other information
  • “Anchor portrait Dimension table” : corresponding anchor name, anchor category, gender, age group, etc
  • “Audience portrait dimension table” : audience corresponding audience gender, age, etc

Studio portrait dimension table

First of all, the conclusion is presented: “The studio portraits are the inherent attributes of the studio portraits, the construction process of the studio portrait dimension table is real-time”.

Since the duration of most live broadcasts varies from a few hours, with the beginning of live broadcasts, the interaction of the audience in the host domain also occurs, and the indicators of live broadcast production and consumption also begin to be produced. With the end of live broadcasts, the interaction between the host and the audience ends, and the corresponding production and consumption indicators of live broadcasts cease to exist. Therefore, the value of the broadcast studio portrait that can be provided to other indicators as a dimension table quickly disappears, so the application scene of the broadcast studio portrait (title, broadcast address) is characterized by “strong timeliness”. Therefore, for the construction and application of live production and consumption indicators, the portrait dimension table in the live broadcast room needs to meet the requirements of real-time construction and real-time query and acquisition.

Anchor & Audience user portrait dimension table

Conclusion: “Such portraits are the inherent attributes of users, rather than the inherent attributes of broadcast rooms, and are not strongly correlated with broadcast rooms. The construction process of anchors and audience user portrait dimension table can be offline”.

No matter whether the broadcast room is off or on, the production and consumption in the process of live broadcast, the portrait of anchor and audience will not change basically. (For example: In most cases, when the age profile of a user has been determined as 18-23, there will be no change in age determination even if the user has opened 10 live broadcasts or watched 10 live broadcasts). Therefore, anchor user portrait dimension table and audience user portrait dimension table can meet the requirements of offline T + 1 construction and real-time acquisition of data services for the construction and application of live broadcast production and consumption indicators.

Notes:

Anchors and audience user portraits need to be judged by using machine learning based on user’s production and consumption behaviors and other information. There are also many scenes that carry out real-time construction of such portraits for real-time personalized recommendation. However, the construction of live and real-time data in this paper has weak timeliness requirements for these two types of portraits, so offline construction is adopted.

HOW to build it? With what?

Live broadcast studio life cycle & data flow

The whole life cycle of the broadcast room is shown in the figure.

The life cycle
  • 1. The anchor establishes a live broadcast room, and the live broadcast room enters the state of broadcasting;
  • 2. After entering the studio, the audience will interact with the anchor in the studio;
  • 3. Finally, the anchor shuts down the live broadcast room, marking the end of the life cycle of the live broadcast room.

Live studio portrait dimension table – real time

Construction of real – time portrait dimension table. The “red” font in the figure above is the construction and application process of real-time portrait dimension table.

Real-time data transfer of portrait in broadcast room

  • 1. When the anchor starts broadcasting and the studio broadcasts live, the studio generates portrait information in the studio. At this time, portrait information can be constructed into the real-time dimension table of portrait in the studio in real time. In addition, real-time indicators on the production side can be constructed at the same time, and the dimension filling of indicators on the production side can be carried out by using the constructed “real-time dimension table of live broadcast portrait + off-line dimension table of anchor & audience portrait”.
  • 2. When the audience enters the live broadcast room, they interact with the anchor to produce a series of consumption behaviors, and then the real-time indicators on the consumption side can be constructed. The dimension of the indicators on the consumption side can be filled by the construction of the “real-time dimension table of the live broadcast portrait + the offline dimension table of the anchor and the audience portrait”;
  • 3. When the anchor closes the broadcast in the studio, the portrait in the studio can be deleted from the real-time dimension table of the portrait in the studio.

Component selection

Through the above analysis, it can be understood that the requirements for real-time dimension table construction of portrait in the live broadcast room are as follows:

  • Real-time portrait: first need to support real-time construction, real-time access;
  • Real-time portrait: the construction data are real-time indicators, that is, request response time with low latency;
  • Public portrait: need to support the access requests of multiple high-traffic production and consumption real-time tasks, that is, provide high QPS portrait data service;
  • Public portrait: high stability.

Therefore, component selection naturally falls into the category of cache. We finally choose Redis as the storage engine of our real-time dimension table after comparing the scheme.

Hash in Redis is used as the storage structure of dimension table. The dimensionality storage design of portrait in broadcast room is shown in the figure below.

Dimension storage

Flink real-time dimension table construction code example

public class LiveStreamRealtimeDimBuilderJob {

    public static void main(String[] args) throws Exception {
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

 DataStream<byte[]> source = SourceFactory.getSourceDataStream();  source.process(new ProcessFunction<byte[], String>() {  @Override  public void processElement(byte[] bytes, Context context, Collector<String> collector) throws Exception {  CommonModel c = CommonModel.parseFrom(bytes);  / / air  if (c.isStartLiveStream()) {  RedisConfig  .get()  .hmset(c.getLiveStreamId()  , ImmutableMap.<String, String>builder()  .put("type", c.getType())  .put("client", c.getClient())  .put("title", c.getTitle())  .put("address", c.getAddress())  .build()  );  RedisConfig  .get()  .expire(c.getLiveStreamId(), 30 * 24 * 60 * 60);  } else if (c.isEndLiveStream()) {  / / close planting  RedisConfig  .get()  .expire(c.getLiveStreamId(), 2 * 24 * 60 * 60);  }  }  });   env.execute();  }   @Data  public static class CommonModel {  private String liveStreamId; // Id of the studio  private String type; // Studio type  private String client; // Start client  private String title; // Studio title  private String address; // Broadcast studio address   public static CommonModel parseFrom(byte[] bytes) {  // Logic is based on business logic  return null;  }   public boolean isStartLiveStream(a) {  // Logic is based on business logic  return false;  }   public boolean isEndLiveStream(a) {  // Logic is based on business logic  return false;  }  } } Copy the code

Anchor & Audience user portrait dimension table – offline

Construction of offline portrait dimension table. It mainly contains the user portraits, gender, age and other information of anchors and viewers. The “blue” font in the following figure shows the application process of offline portrait dimension table.

The life cycle

Anchor and audience portrait data flow

When producing real-time data on the production side and consumption side of the broadcast room, anchors and audience portraits are used to fill the portrait dimensions.

Storage component

The selection of storage components of offline portrait dimension table is the same as that of real-time, which is redis. The storage method of portrait information is also redis hash structure.

Portrait data construction and data synchronization are carried out in the way of T + 1, and the completed portraits of anchors and audience users are synchronized to redis cache.

conclusion

This paper connects with the above, mainly introduces the construction process of real-time dimension table in broadcast room. Put forward several construction problems, starting from these problems, led to the next three sections.

The first section briefly introduces the concept of real-time and offline public portrait dimension table.

The second section introduces why it is necessary to construct real-time public portrait dimension table from the perspective of data application scenarios.

The third section mainly introduces the construction process and detailed technical scheme of real-time portrait dimension table.

The last section summarizes this paper.

If you have also built real-time portrait dimension table, or have the same needs, welcome to leave a message or leave your article link, mutual exchange ~