\

takeaway

\

Based on the experience of academia and industry, IQiyi has designed and explored a set of deep semantic representation learning framework suitable for various business scenarios. Recall, sorting, de-duplication, diversity, semantic matching, clustering and other scenes in recommendation, search, live broadcasting and other businesses are put online to improve the richness and diversity of video recommendation and users’ viewing and searching experience.

This paper will introduce the core design ideas and practical experience of iQiyi deep semantic representation framework.

background

British linguist J.R.Firth said in 1957: “You shall know a word by the company its keeps.” Based on this idea, Hinton first proposed Distributed representation in 1986. Words with similar context often have similar semantics. Distributed means to distribute the semantics of words to each component of the word vector. This method can map words to continuous real number vector space, and similar words have similar positions in this space. The typical representative work is Neural Network Language Model (NNLM)[1]. In 2003, Google proposed word2vec[2] algorithm to learn Word embedding (word embedding or word vector), making Distributed representation truly recognized by academia and industry, thus opening the new yuan era of NLP embedding development.

In the information flow era when everything is embedding, Embedding can map text, image, video, audio, user and other entities from a high-dimensional sparse discrete vector representation to a low-dimensional dense continuous semantic representation. And bring similar entities closer together. It can be used to measure the semantic relevance between the different entities, as the depth model of semantic features or discrete training of embedding, widely applied to recommend and search, and other business scenarios, such as recommended in the recall, sorting, to control weight, diversity, etc., the semantic search recall, semantic correlation matching, relevant search, in an effort to search play, etc.

Compared with the traditional embedding model, deep semantic representation learns rich side information(llGE). Multi-modal information, knowledge graph, meta information, etc.) and deep model (e.g. Transformer[3], graph convolution network [4], etc.) are deeply integrated to learn the entity embedding with good generalization and semantic representation, which provides rich semantic features for downstream business models. And solve the problem of cold start to a certain extent, and then become a sharp tool to improve the performance of search and recommendation system.

Iqiyi has designed and explored this set of deep semantic representation learning framework applicable to various business scenarios of iQiyi, and has been successfully launched in multiple recommended business lines and searches. Recommended for short & small video, graphic information flow and recall of search, live 15 business, sorting, heavy, diversity, clustering of 7 kinds of semantic matching, scene, complete multiple AB experiment and whole flow online, short & small video and graphic recommended scenario, the per capita consumption time of users were promoted more than five minutes, The accuracy of searching semantic correlation is more than 6% higher than the baseline single feature.

Challenges:

The traditional embedding learning model mainly constructs the training set based on node sequence or random walk generated sequence based on graph structure. Each node in the sequence is encoded as an independent ID, and then shallow networks (e.g. item2vec[6], node2vec[7]) are used to learn the embedding of nodes. This kind of model can only obtain shallow semantic representation of nodes in the training corpus, but can not deduce the embedding of new nodes, which can not solve the problem of cold start and has poor generalization. The following problems are faced when the traditional embedding learning model is applied to iQiyi business scene:

1. Diversity of types and relationships of Embedding entities

The traditional embedding model often regards item in the sequence as a node of the same type, and the relation type between nodes is relatively single. IQIYI each business line of user behavior data often contain multiple types of data, for example, text (length of text, sentence and paragraph and discourse level), image, graphic, video (for example, long, short, small video), user (such as the up main, actors, directors, role), circles (such as bubbles, literary community), query, etc. Different types of nodes have different relationships. For example, the relationship between nodes in user behavior sequence includes click, favorites, reservation, search and attention, etc., and the relationship between nodes in video atlas includes direction, compilation, partner and participation, etc.

2. Rich Side information

The traditional embedding model usually adopts shallow network (such as 3-layer DNN, LSTM, etc.) with weak feature extraction ability. In addition, when item is represented by an independent ID, the rich side information and multi-modal information of item are not taken into account, and only superficial semantic representation of item can be learned. Items in each business of IQiyi have rich multi-modal information (such as text, image, video and audio) and various meta information (such as video type, subject matter and actor attributes, etc.). How to effectively and fully use these rich side information and the fusion of multi-modal features? It is essential to better understand the deeper semantics of item.

3. Various service scenarios

Embedding can be used for recall, sorting, weight removal, diversity and user portrait modeling in recommendation, semantic recall, sorting, video clustering, relevant search in search, as well as semantic features of various downstream tasks and other business scenarios. Different business scenarios often require different types of embedding.

· Recommended recall scenarios:

1) Behavior-based embedding model recall is popular with good effect;

2) Content-based embedding model recalls partial correlation, which is more helpful for related recommendation scenes and cold start of new content;

3) The embedding model based on behavior and content is between the former two and can guarantee the correlation and effect at the same time.

· Sorting scenario:

The latter two embedding models are often used to obtain real-time embedding features of unknown nodes based on the trained model and content.

· Diversity control:

The embedding model based on the original representation of content usually achieves better performance in de-weighting and diversified fragmentation.

Deep semantic representation learning:

Based on the traditional embedding learning model, the rich side information(multi-modal information and meta information) of the nodes and the heterogeneity of the types are introduced. The multi-modal features are effectively integrated and the shallow model is replaced by the deep model with stronger feature extraction ability. So we can learn the deep semantic representation of nodes.

According to the business scenarios and data characteristics of IQiyi, we designed a deep semantic representation learning framework (as shown in Figure 1) that meets the existing business scenarios. The framework mainly consists of four layers: data layer, feature layer, policy layer and application layer. The following is a detailed introduction of various deep semantic representation models in feature layer and policy layer.

  • Data layer: it mainly collects various behavior data of users to construct node sequence and graph and build training data of embedding model.
  • Feature layer: it is mainly used to extract and fuse various modal features (text, image, audio, video, etc.) as the initial semantic representation of input in the deep semantic representation model;
  • Policy layer: provide rich deep semantic representation models and evaluation methods to meet different business scenarios;
  • Application layer: it provides embedding feature, neighbor and correlation computing services for various scenes of downstream business lines.

Figure 1 deep semantics represents the learning framework

Feature extraction and fusion:

In the field of natural language processing (NLP), pre-trained language models (such as BERT[8]) can make full use of massive unlabeled corpora to learn potential semantic information of texts, refreshing the effects of various tasks in the FIELD of NLP. Iqiyi covers video, graphic search, recommendation, advertising, intelligent creation and other business scenarios. In addition to text (title, description, etc.), in-depth understanding of image, video, audio and other modal information is also required.

1.1. Multi-modal feature extraction

Drawing on the idea of pretrained language model, we try to learn general pretrained semantic representation of text (Query, sentence, paragraph, text), image, audio, and video of different granularity by using large-scale unlabeled video and graphic corpus, and provide initial semantic representation for subsequent deep semantic representation model.

  • Text semantic features: According to the text length, text semantic features can be extracted into four levels:
  1. Token-level words, such as a user search string, are usually 2 to 6 characters long.
  2. Sentence-level, such as video & comic titles and descriptions, biographies, artist profiles, etc.
  3. Paragraph-level, such as descriptions of movies and TV series and snippets of scripts, etc.
  4. Document-level, such as long texts such as plays and novels.

Due to the limitations of the existing pretrained language model to deal with long text, different schemes are needed for different levels of text. On the one hand, the semantic characteristics of Topic granularity are learned by combining the Topic model [10] and ALBert[9]. On the other hand, based on ALBert, the token-level semantics are combined into fine-grained semantic features at paragraph and discourse level by WME[11], CPTW[12] and other methods.

  • Image semantic features: With respect to the video cover image, video frame, film and television stills, artist image and cartoon images, the state-of-art ImageNet pre-trained classification model (e.g. EfficientNet[13]) was used to extract basic semantic representation. Self-supervised presentation learning thought (e.g. Selfish[14]) was adopted to learn better image representation.
  • Audio and video semantic features: For audio information in videos, Vggish[15] model, which is pretrained on youtuber-Audioset data, is used to extract semantic 128-dimensional feature vectors from audio waveform as audio representation. For the semantic modeling of video content, we choose a simple and efficient method commonly used in the industry, which only uses the video keyframe sequence to represent the video content, and obtains the video-level semantic features by fusing the image-level semantic features of each key frame.

1.2. Multi-modal feature fusion

· Fusion timing: Mainly includes Late fusion, early fusion and hybrid Fusion. As the name suggests, early fusion is the merging of multiple features (e.g Splicing), and then training through feature learning module; Late fusion means that each feature is transformed by its own feature learning module before fusion. Hybrid Fusion Combination of two fusion timing, can learn rich features cross, the effect is usually the best.

· Fusion method: Efficient and reasonable fusion of various modal information can greatly improve the semantic understanding of the video.

At present, multi-mode fusion methods mainly include three categories:

  1. The most direct method is to fuse multi-modal features through element-wise product/sum or splicing, but the complex association between multi-modal features cannot be effectively captured.
  2. Pooling method: The main idea is to fuse various modal features through the idea of Bilinear pooling. Typical representative works include MFB[16] and MFH[17], etc.
  3. Methods based on attention mechanism: Drawing on the idea of Visual Question Answering(VQA), the attention mechanism can make the model focus on the relevant feature parts in the image or video according to the text representation, and capture the correlation between multiple modes. Typical works include BAN(Bilinear Attention Network)[18], etc.

Deep semantic representation model:

The application of pretraining model is usually divided into two steps: 1) A large amount of unsupervised corpus is used for pretraining to learn general semantic representation; 2) Based on the general semantic representation, a small amount of annotated corpus is used to finetuning specific tasks. Similarly, on the basis of the general pre-trained semantic representation of text, image, audio and video, we try to introduce rich side information and heterogeneous node and edge types into specific tasks (such as recall, semantic matching, etc.), and fine-tune with the help of depth model with stronger extraction ability. To learn the semantic features that meet different business scenarios.

According to the modeling methods, deep semantic representation models can be roughly divided into the following categories:

1. Deep semantic model based on content

Content-based deep semantic model, as the name implies, the model takes the content of a single node (metadata and multi-modal information, etc.) as input, and trains based on manual annotation data as supervisory signals, independent of any user behavior data. This kind of model can derive semantic representation of node directly based on node content without cold start problem. However, a large amount of manual annotation data is often needed for model training.

1.1. Image Embedding model based on ImageNet classification

This kind of model is mainly based on the middle or last layer of state-of-Art ImageNet image pre-training classification model, extracting pure content representation of images or videos, and finetuning based on self-supervised representation learning as semantic representation of images or videos. It can be applied to two scenarios of weight removal (FIG. 2) and recommended post-rank stage diversity control with better effect.

 

Figure 2. Example of de-duplication based on ImageNet classification model and self-supervised learning method

1.2. Task-based embedding model

The class models are usually based on mass tagging supervised training data for a particular task, and extract the model layer or the last layer as a text or video representation, such as the embedding model based on tag classification task (as shown in figure 3), the model based on metadata, text, images, video, audio and video features, training on large-scale labeled data, Identify the type tag and content tag of the video. The representation of fusion layer of extracted model is often used as the semantic representation of topic granularity of video, which can effectively solve the problem of cold start and is widely used in recommended recall, sorting and diversity control scenarios.

 

Fig.3 Embedding model based on type tag task

2. Deep semantic model based on matching

This model is a combination of content and behavior of deep semantic model, mainly through the integration of text, images, video, and audio etc modal information, and based on the user’s click, watch, or search for current monitoring signal, to build the positive and negative sample, makes the training model, e said samples of semantic representation, x said video or users, etc. This kind of model lacks the modeling of long distance dependence and structural similarity of nodes. However, the modeling is relatively simple, and the model can be directly used for reasoning after training, which can effectively solve the problem of cold start, and is better used for recalling and sorting scenes.

Deep Semantic Model based on matching is mainly implemented based on Siamese network (twin network or twin tower structure) or multi-tower structure. Currently, popular methods in the industry include DSSM(Deep Structured Semantic Model)[5] and CDML[20]. DSSM was initially used to search the semantic relevance of the modeled text, while CDML was used to model the semantic relevance of the video based on the features of audio and video frames. It was believed that the multi-modal feature fusion of late fusion method was better. For the semantic modeling of video, on the basis of DSSM text input, we introduce the pre-trained semantic representation of cover image and video modes to improve the semantic representation effect of video. Similarly, CDML also introduces pretrained semantic representation of two modes, text and cover image, to enrich node information. At the same time, considering that CDML only adopts late fusion in feature fusion timing, and feature interaction is limited and lacks diversity, hybrid fusion is adopted to fuse multiple modal features and learn richer multi-modal feature crossover (as shown in Figure 4).

 

Figure 4 CDML model structure based on Hybrid Fusion

3. Deep semantic model based on sequence

This kind of model is a behavior-based deep semantic model, which learns the deep semantic representation of nodes by replacing the traditional shallow network (e.g. Skpp-gram (LSTM)) with the strong feature extraction capability of deep network (LLDB Transformer). Given the user’s behavior sequence, the user’s behavior preference is modeled using sequential neural network, and the next item that the user may click is predicted based on the representation of the last hidden layer of the model. This kind of model can be used to model the long distance dependence of nodes, and the recall effect is usually good in recommendation scenarios, but there are cold start problems.

There are three main methods for sequence modeling:

1) MDPs (Markov decision Processes) : Calculate the probability of clicking the next item through state transition probability. The current state only depends on the previous state. The model is relatively simple and suitable for short sequence and sparse data scene modeling.

2) Based on CNN: CNN is used to capture the short-distance dependency of item in the sequence, such as Caser[21], which is easy to parallelize;

3) Based on RNN: It can capture long-distance dependencies, which is suitable for long sequences and data-rich scenes. However, the model is more complex and difficult to parallelize, such as GRU4Rec[22].

Currently popular sequence modeling methods are mainly based on RNN. In order to solve the problems of difficult parallelization and low efficiency of RNN, we use Transformer (as shown in Figure 5) with stronger feature extraction ability and easy parallelization to replace RNN for sequence modeling. Typical work includes SASRec[23], Bert4Rec[24]. SASRec uses a one-way Transformer decoder(right half, N=2) to model the click probability of the next item based on the above; However, Bert4Rec uses bidirectional Transformer encoder (left half, N=2) to learn from BERT’s mask idea and predict the click probability of Masked items based on context. In addition, since BERT assumes that masked items are independent of each other, the correlation between masked items is ignored, In order to improve the sequence modeling effect, we modeled the correlation between bidirectional context and masked item by referring to the idea of auto-regressive and permutation language model of XLNet[25].

 

Figure 5Transformer network structure

4. Graph-based deep semantic model

The Graph embedding model (also known as Graph embedding or network embedding) can project the nodes in the Graph into a low-dimensional continuous space while preserving the network structure and inherent properties. On the basis of isomorphic graph or heterogeneous graph of nodes (different node types or edge types), the embedded depth graph model introduces rich side information and multi-modal features of nodes, and adopts a network with stronger feature extraction ability to learn the deep semantic representation of nodes. Compared with the previous deep semantic models, this method is more complex, but it can make full use of rich graph structure information to model high-order dependencies of nodes.

4.1. Introduce rich side information and multi-modal information

The traditional graph embedding method mainly generates sequence data based on graph structure and some node sequence sampling strategy, and learns node embedding based on skip-gram method, as shown in Figure 6. Typical work includes DeepWalk, LINE and Node2vec. The main difference among them lies in the different sampling strategies for sequence generation. The traditional graph embedding model regards all nodes as ids and can only cover the high-frequency nodes in the training set, but cannot obtain the embedding of new nodes.

 

Fig.6 Basic principle of traditional graph embedding method

To solve the cold start problem of new nodes, on the one hand, multiple modal information of nodes can be introduced into the traditional graph embedding model, and on the other hand, rich meta information (such as category, uploader, etc.) of nodes can be fully utilized. Based on graph structure, attribute information of nodes is introduced into attribute Network to enrich the semantic representation of nodes and make the semantics of nodes with similar topological structure and attributes more similar. For the cold start problem, the new node embedding can be obtained directly through the node attribute embedding. EGES[26] and ANRL[27] are two typical works. Among them, EGES introduces attribute information into the input of skip-Gram model. ANRL combines SKp-Gram and AE, only uses attribute features as node representation, and replaces the decoder in traditional AE with neighbor Enhancement decoder, making the node more similar to its context node (rather than itself). EGES and ANRL are mainly used for image embedding in the field of e-commerce with rich attribute information. However, in the field of video recommendation, except for a small number of long videos (movies and TV series) and actors with rich attributes, most short and small video attributes are scarce and cannot be reused directly. To solve this problem, we propose multimodal ANRL, as shown in FIG. 7. Attribute features of nodes and pre-trained semantic representation features of multiple modes (text, cover image and video) are used to represent nodes as model inputs. For the new node, it can be obtained by inference directly based on the trained model and the node’s own content (i.e. attributes and multi-modal features). The neighbor example based on multi-modal ANRL embedding is shown in Figure 8. In addition, knowledge graph can also be regarded as a kind of rich side information, which can be used to further learn better deep semantic representation by introducing external prior knowledge.

FIG. 7 Structure of multimodal ANRL model

Figure 8. Example of multimodal ANRL nearest neighbor results

(The first video on the left is the seed video, and the other videos are neighbor videos)

4.2 More advanced feature extractor

The traditional graph embedding model usually generates sequence data based on the graph and uses the simple Skip-gram model to learn node embedding. As the model is too simple, the feature extraction ability is weak, and only local neighbor information (usually first-order or second-order) can be modeled. Both GNN (Graph Neural Network) and GCN (Graph Convolutional Network) can be directly based on the multimodal features of Graph structures and nodes, Textual and visual features (textual and visual features, etc.) are used to conduct convolution operation on the neighborhood subgraph of the node by convolution iteration of multilayer graph with stronger feature extraction ability, and then generate the deep semantic representation of the node. Drawing on the experience of the industry, we reproduce a variety of GCN models, such as PinSAGE[28] and ClusterGCN[29]. In addition, we also use a very fast and scalable ProNE graph embedding algorithm for large-scale graph data [30]. ProNE first transforms the graph embedding problem into sparse matrix decomposition problem, and efficiently obtains the feature vector with first-order neighbor information as the initial embedding of nodes. Then, it is filtered by the filter in the frequency domain to fuse the higher-order neighbor information as the final semantic representation of node depth, and the low-order neighbor information and the higher-order neighbor information can be integrated into the semantic representation of node. More importantly, the embedding generated by common network embedding algorithms (such as Node2vec, etc.) can be regarded as the initial embedding for ProNE nodes, and then the frequency spectrum propagation can improve the effect by ~10% on average.

4.3. Modeling multivariate heterogeneous graphs

Existing methods are mainly based on network graphs (isomorphism) with single type of nodes and edges. However, most graphs in the real world contain multiple types of nodes and edges, and different types of nodes often have different attributes and multi-modal characteristics. For example, in the search scenario, the simplest heterogeneous graph is the user’s search-click bipartite graph, which has two types of nodes: Query and video. Video has rich attributes and multi-modal characteristics. The recommended scene also contains a large number of heterogeneous maps, such as user-video, video-circe-content tag, actor – role – work, etc.

The traditional graph embedding algorithm ignores the types of edges in the graph and features of nodes, such as Node2vec and metapath2vec. Although metapath2vec can be used for representation learning of heterogeneous nodes, nodes are still regarded as IDS and rich features of nodes are ignored. The deep semantic model of Heterogenous Information Network Embedding (HINE) simultaneously introduces various modal features of nodes and the diversity of node and edge types in the graph, and models different types of nodes and edges respectively. Among them, multivariate means that the graph has multiple types of edges.

We first make a preliminary attempt to learn the deep semantic representation of heterogeneous graph in semantic correlation task of search scenario. Semantic correlation plays an important role in search and can be used for semantic recall and semantic correlation matching. In order to measure the semantic relevance of Query and video title, and to learn the deep semantic representation of query and video in the same space, we combined representation-based and interaction-based ideas based on search query-click heterogeneous graph. Learning the semantic correlation embedding of Query and video title, the model structure is shown in Figure 9. The encoder on the left models the deep semantic representation of Query or video title, which is used to learn the display semantic correlation of text. Decoder introduces behavioral correlation constraints to model implicit semantic correlations, such as < Query: Li Jing Jing, title: Happy to in-laws >, the former is one of the main actors of the latter. The right side is used to model the multi-granularity interaction semantics between Query and video titles. Compared with baseline, semantic correlation accuracy increases by more than 6%. Table 1 shows some examples of Query-title semantic correlation. In addition to click types, we are also trying to introduce edge types such as favorites, comments and likes, as well as video types (such as long, short and small videos, albums and playlists, etc.), as well as cover images and video modal features on the video side for more detailed modeling.

 

Fig.9 Structure of searching query-title semantic correlation embedding model

The model is also being migrated to recommendation scenarios to learn the same spatial semantic relevance between users and videos, circles, tags, etc. In addition, Gatne-I [31], ali’s work on Heterogeneous Graph representation learning, was recently introduced to support representation learning and representation learning of Heterogeneous multi-source networks and HGT(Heterogeneous Graph Transformer)[32] network with powerful feature extraction, and the multi-modal features of nodes were introduced. Try to learn better semantic representation of node depth.

Query Similar Videos (just list top 5)
How to use sea pole fish wheel Score:0.9448, title: Installation method of drunk sea pole Score:0.9415, title: how to install sea pole and fishing wheel Score:0.9365, title: use method of sea pole Score:0.9350, title: Sea rod installation and use teaching Sea rod installation video Usage Method Sea rod fishing method and skills Score:0.9345, Title: sea rod fishing white silver carp line set installation method
Judges look down on contestants Score:0.8894, Title: Feng Xiaogang Spot debunking the identity of the contestant: Don’t pretend! Score:0.8807, Title: In a talent show, the garbage judge made harsh comments, and the contestant angrily left the show Score:0.8713, Title: The folk singer’s sincere efforts to sing were despised by the judges, so he boldly attacked the judges. Score:0.8632, Title: Score:0.8593, Title: When the man impersonated Li Yugang on stage, the judge scolded the contestant after his wife came on stage.
Wang Dali Score:0.7873, title: Wang Taili Score:0.7695, title: This chopsticks brothers wang Taili is really sad, have such a daughter-in-law and mother-in-law, tragic Score:0.7625, title: Chopsticks brothers hate each other! Wang Tai Li ridicule: I treat you as brothers, you treat me as chopsticks Score:0.7536, title: Wang Tai Li drunk on the show, unexpectedly to Alan Tam so, too great Score:0.7469, title: super visit chopsticks brothers north floating past first exposure
The Sound of Diamonds Score:0.9235, title: “Diamonds” Score:0.9118, title: “Diamonds” Score:0.9118, Title: “Diamonds” This little girl was thrilled to hear her three children sing Rihanna’s hit song “Diamonds”. Score:0.9100, Title: The Voice of Germany champion jouline, a 13-year-old girl, amazed the world with her heavenly voice! Score:0.9053, Title: Jouline vs Besim vs Lisa Battle – Diamonds the Voice Kids 2018 (Germany)

Table 1 Search example of query-title embedding semantic correlation

Subsequent optimization:

1. General semantic representation of video pre-training: Due to time performance and lack of pre-training data of video semantic representation, currently, video semantic features are simply obtained by fusing image level features of video key frame sequence. Subsequently, based on a large number of video captioning data, the deep semantic representation of the video was extracted by learning the video pre-training semantic model (E.G.Univilm [35]) using BERT’s idea.

2. Deep semantic representation learning with knowledge map prior: The description of video text and often contain some entity (such as the title of “diffuse civil war hero, iron man for his teammates tailored uniforms, the team look silly” contained in the entity “diffuse, iron man”), the article introduces the map in a text representation of the entity, as well as the relations between entities such as prior knowledge (” iron man “and” avengers alliance “), It can further improve the effect of semantic representation. In the future, we will try to introduce knowledge mapping into NLP pre-training language model and recommendation scene, which are respectively used to improve the semantic representation effect of text (e.g. KEPLER[33]) and to discover users’ deep user interests, so as to improve the accuracy, diversity and interpretability of recommendations (e.g. KGCN[34]).

3. Covering more businesses: Deep semantic representation is usually used in the scene of intelligent video distribution, which has covered iQiyi’s recommendation and search services such as long, short and small videos, live broadcast, graphics and cartoons; In the future, iQiyi will continue to increase the scene support of intelligent production to provide deep semantic features for various business scenes.

 

reference

[1] Yoshua Bengio, et al. A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3:137 — 1155, 2003.

[2] Tomas Mikolov, et al. Efficient Estimation of Word Representations in Vector Space. In International Conference on Learning Representations (ICLR), 2013.

[3] Ashish Vaswani, et al. Attention Is All You Need. In International conference on Neural Information Processing Systems (NeurIPS), 2017.

[4] Thomas N. Kipf. et al. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations (ICLR), 2017.

[5] Po-Sen Huang, et al. Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. In ACM International Conference on Information and Knowledge Management (CIKM), 2013.

[6] Oren Barkan, et al. Item2Vec: Neural Item Embedding for Collaborative Filtering. arXiv preprint, arXiv: 1603.04259 the v3, 2017.

[7] Aditya Grover, et al. node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016.

[8] Jacob Devlin, et al. BERT: ArXiv Preprint, arXiv: 1810.04805 V2, 2019.

[9] Zhenzhong Lan, et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations (ICLR), 2020.

[10] David M. Blei, et al. Latent Dirichlet Allocation. The Journal of Machine learning Research, 3:993-1022, 2003.

[11] Wenfei Wu, Et al. Embedding: From Word2Vec to Document Embedding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.

[12] Casper Hansen, et al. Contextually Propagated Term Weights for Document Representation. In International Interest Group on Information Retrieval (SIGIR), 2019.

[13] Mingxing Tan, et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019.

[14] Trieu H. Trinh, et al. Selfie: Self-supervised Pretraining for Image Embedding. ArXiv Preprint, arXiv 1906.02940, 2019

[15] Shawn Hershey, et al. CNN Architectures for Large-Scale Audio Classification. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017.

[16] Zhou Yu, et al. Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering. In International Conference on Computer Vision (ICCV), 2017.

[17] Zhou Yu, et al. Beyond Bilinear: Generalized Multimodal Factorized High-order Pooling for Visual Question Answering. IEEE Transactions On Neural Networks And Learning Systems, 26:2275-2290, 2015.

[18] Jin-Hwa Kim, et al. Bilinear Attention Networks. In International conference on Neural Information Processing Systems (NeurIPS), 2018.

[19] Francois Chollet. Xception: Deep Learning with Depthwise Separable Convolutions. arXiv preprint, arXiv: 1610.02357, 2017.

[20] Joonseok Lee, et al. Collaborative Deep Metric Learning for Video Understanding. In Proceedings of the 24th  ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2018.

[21] Jiaxi Tang, et al. Personalized Top-N Sequential Recommendation via Convolutional Sequence Embedding. In ACM International Conference on Web Search and Data Mining (WSDM), 2018.

[22] Balazs Hidasi, et al. Session-based Recommendations with Recurrent Neural Networks. In International Conference on Learning Representations (ICLR), 2016.

[23] Wang-Cheng Kang, et al. Self-Attentive Sequential Recommendation. In IEEE International Conference on Data Mining (ICDM), 2018.

[24] Fei Sun, et al. BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer. In ACM International Conference on Information and Knowledge Management (CIKM), 2019.

[25] Zhilin Yang, et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In International conference on Neural Information Processing Systems (NeurIPS), 2019.

[26] Jizhe Wang, et al. Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2018.

[27] Zhen Zhang, et al. ANRL: Attributed Network Representation Learning via Deep Neural Networks. In Proceedings of the 27th International Joint Conference on artificial Intelligence (IJCAI), 2018.

[28] Rex Ying, et al. Graph Convolutional Neural Networks for Web-Scale Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2018.

[29] Wei-Lin Chiang, et al. Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2019.

[30] Jie Zhang, et al. ProNE: Fast and Scalable Network Representation Learning. In Proceedings of the 28th International Joint Conference on artificial Intelligence (IJCAI), 2019.

[31] Yukuo Cen, et al. Representation Learning for Attributed Multiplex Heterogeneous Network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2019.

[32] Ziniu Hu, et al. Heterogeneous Graph Transformer. In Proceedings of the World Wide Web Conference (WWW), 2020.

[33] Xiaozhi Wang, et al. KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation. arXiv preprint, arXiv: 1911.06136, 2020.

[34] Hongwei Wang, et al. Knowledge Graph Convolutional Networks for Recommender Systems. In Proceedings of the World Wide Web Conference (WWW), 2019.

[35] Huaishao Luo, et al. UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arXiv preprint, arXiv: 2002.06353, 2020.

\

You may also want to know about the 2020 IQiyi Cartoon Character Detection and Recognition Challenge **, click “Read the original article”, go to the contest channel! **

Scan the qr code below, more exciting content to accompany you!