Recently, the first Graph Neural Network Competition jointly organized by KDD Cup and OGB (Open Graph Benchmark) was officially released. In the fierce competition between DeepMind, Microsoft, Ant Financial, UCLA and other more than 500 top enterprises, universities and laboratories around the world, Baidu fought its way through the competition with PGL, the paddle chart learning framework, and finally won two championships and one championship in all three courses.

▲ Professor Jure Leskovec of Stanford University, the organizer of the competition, announced the winner team

It is known that the KDD Cup is an annual event organized by ACM SIGKDD, known as the “World Cup in the field of big data”. It is currently the highest level, most influential and largest international top-level event in the field of data mining. This year, KDD Cup and OGB jointly held the first OGB-LSC (OGB-Large-Scale Challenge) competition, providing super-large Scale graph data from the real world to complete the three tasks of node classification, edge prediction and graph regression in the field of graph learning.

This competition is a “closed book exam”. There are only two opportunities to submit the results of the model in the whole competition cycle. It is extremely difficult to test the model generalization ability of the participating teams. Thanks to Baidu’s continuous deep cultivation in graph neural network, among the three circuits of this competition, Baidu flying paddle graph learning framework PGL won the champion of large-scale node classification track, the champion of large-scale graph relation prediction track, and the runner-up of chemical molecular graph property prediction track.

Bring about flying oar PGL title page:…

The PGL code is fully open for use, feedback and contributions.

PGL links: B station map neural network (7 tutorial: PGL figure learning introductory tutorial: fly oar PGL entry report & code:

Large-scale Node Classification Track Champion: Introducing a Unified Messaging Model based on Heterogeneous Relationships

OGB-LSC node classification dataset, derived from the real world super-large scale Academic citation network MAG (Microsoft Academic Graph). OGB officials extracted more than 240 million entities (including papers, authors, etc.) and constructed a large-scale heterogeneous graph containing 1.6 billion edge relationships. Entrants are required to mine the information from the heterogeneous graphs to predict the topic of the arXiv paper (a total of 153 topics, such as CS.LG Machine Learning, Q-Bio.BM Biomolecules, etc.).

At present, there are mainly two kinds of graph learning methods for node classification: one is label passing algorithm, and the other is graph neural network which aggregates multi-order neighbor features through model and predicts the center node label. However, both of these two methods have their limitations and cannot make full use of the label information in graph nodes.

In order to solve the above problems, PGL proposed the Unified Messaging Model (UNIMP), which skillfully utilized the prediction strategy of “label masking”, so that the model could carry out label passing and feature aggregation simultaneously in training and prediction, and successfully unified the two graph learning methods mentioned above into the message passing model. In addition, the classification task of semi-supervised nodes has been significantly improved. At present, the related papers have been included by IJCAI2021, and become the mainstream of node classification tasks in the current strong benchmark.

▲ Unimp: Tags and features (blue energy in the figure) are propagated under a unified messaging mechanism

Aiming at this large-scale heterogeneous graph, flying PGL further introduced the sampling method based on heterogeneous relation and the attention fusion mechanism, upgraded UNIMP to R-UNIMP, and realized the training and prediction of distributed large-scale graph neural network model on the basis of flying parallel computing framework. The experimental results are nearly 10 percent more accurate than the official baseline verification set! Finally, it won the title in the competition with a number of domestic and foreign technology companies and academic institutions, including DeepMind, Microsoft, Ant Financial, and Tsinghua University.

Large-scale Graph Relationship Prediction of Track Champion: A 20-layer Note-RPS Knowledge Graph Embedding Model is proposed

The edge prediction task is the relation prediction in the large-scale knowledge graph. In the knowledge graph, factual knowledge about the world is represented by triples linking different entities (for example, Yao Ming – Born -> Shanghai). However, these large knowledge atlas are incomplete and lack much information about the relationships between entities.

Automated estimation of missing triples using machine learning methods can significantly reduce the cost of manual repair and thus provide a more comprehensive knowledge graph. This competition uses the Wikidata Knowledge Atlas, which includes nearly 90 million entities and 500 million triples, making it the largest Knowledge Atlas task with data to date.

At present, knowledge representation models emerge endlessly in the industry, such as Transe, Rotate, etc. Based on the large-scale knowledge representation library PGL-KE, a Normalized Orthogonal Embedding (NOTE) model was proposed to upgrade the existing algorithm, which could model the relationship in multiple dimensions while maintaining numerical stability in large-scale scenarios.

▲ Note: Normalized orthogonal transformation knowledge graph embedding model

Secondly, Relation based Post Smoothing (RPS) graph neural network algorithm proposed by Fei Fei PGL carries out post-processing on the trained Note model and uses a 20-layer RPS model, which can be called the deepest graph neural network model in the field of knowledge graph. The experimental results of the large-scale knowledge representation scheme based on Note +RPS improved by 12 percentage points compared with the benchmark provided by the government, and finally won the first place in the competition with Alibaba, Harbin Institute of Technology, University of Science and Technology of China and other teams, which helped the knowledge graph take a huge step towards practical application.

Chemical Molecular Map Property Prediction Track: Constructing a Self-Supervised Pre-Training Assist Task Using Molecular 3D Conformation

Molecular property prediction has been widely recognized as one of the most critical tasks in computational drug and material discovery. Methods based on DFT quantum physics calculations require a lot of time to effectively predict the properties of multiple molecules. In order to make use of the powerful expression ability of graph neural network to predict molecular properties, LiteGem model was jointly proposed by PGL and PaddleHelix biological computing framework. The 3D conformation of molecules was used to construct self-supervised pre-training auxiliary task to improve the prediction effect of molecular properties, and finally won the second place.

Application landing: can support a larger scale of industrial applications, flying paddle diagram learning framework PGL ushered in a major upgrade

In addition to the full flowering in the KDD Cup, PGL has also been continuously committed to graph neural network algorithm innovation and a larger scale of industrial applications.

Recently, flying paddle PGL ushered in a major upgrade, launched a trillion super large scale distributed graph engine, the KDD Cup winning technical solutions are based on the distributed graph engine. The original intention of the distributed graph engine research and development is also the hope that the graph learning algorithm can achieve a larger scale of industrial application in the industry. At present, Baidu has implemented dozens of applications in search, information flow recommendation, financial risk control, smart map, knowledge graph and other scenarios with the help of flying PGL.

Deep Learning Developer Summit WaveSummit 2021 Tillion Graph Engine Launch

In addition, PGL also cooperates with a number of external institutions: after investigating a large number of open source solutions, netease Cloud Music also chose PGL, which is more friendly to large-scale graph training, as the basic framework of graph neural network for cloud music recommendation. At the same time, PGL also contributes to OpenKS knowledge computing engine, a major project of the “new generation of artificial intelligence” of Science and Technology Innovation 2030.

Due to the convenience of Graph Neural Network for complex data modeling and its powerful expression ability, PGL also explores the combination of Graph Neural Network and multiple cross-disciplines, including building a big data epidemic prediction system, cooperating with PaddleHelix to predict the properties of compounds, And it has obtained SOTA on multiple compound prediction lists.

▲ Flying paddle diagram learning framework PGL

Graph learning, as one of the general artificial intelligence algorithms, is bound to become a new basic ability in the intelligent era, enabling all walks of life and boosting the intelligent economy to take off. The current stage is just the beginning of the learning boom, and there will be deeper technical output and larger industrial opportunities in the future. Taking root in the learning field of the learning map, it will continue to empower the intelligent upgrading of the industry, which needs to start from now.