Tencent YunDaXue higher-ups share | tencent cloud atlas practice knowledge

Author: Meng Hui, senior researcher of Tencent Cloud. Graduated from the Department of Control Science and Engineering, University of Chinese Academy of Sciences, with rich experience in machine learning and data mining. After joining Tencent Cloud AI Semantic product group, I was mainly responsible for the research and development and application of knowledge graph related products.

Click here for the full lesson

1. Knowledge graph foundation

Knowledge graph was first proposed by Google in 2012. It uses semantic retrieval method to collect and process information from multi-language data sources (such as FreeBase and Wikipedia) to improve search quality and search experience. In fact, In 2006 Tim Berner-Lee proposed Linked Data, a method of creating semantic associations on web Data. Further back, Semantic Link Network has been systematically studied, aiming at creating a self-organized Semantic interconnection method to express knowledge to support intelligent applications. For systematic theories and methods, please refer to The Knowledge Grid published by H. Zhuge in 2004.

You might be wondering, what technology stacks do you need to quickly build a knowledge graph? Data collection, data cleaning, knowledge extraction, knowledge fusion and graph storage are the most basic technology stacks for constructing knowledge graphs. The author summarizes the technical process of constructing knowledge graphs by referring to hundreds of suggestions as follows:

Let’s go back to the original origin and explore the essence of the knowledge graph. The essence of knowledge graph is semantic network composed of nodes and edges. Nodes represent entities or concepts in the physical world, and edges represent relationships between entities. “THINGS NOT STRINGS”, NOT meaningless STRINGS, but objects or THINGS hidden behind the string. For example, singer and actor Andy Lau is the entity mentioned above. His birthday, wife, height and movie works are the attributes of Andy Lau. The film infernal Affairs was directed by David Lau and produced in Hong Kong, China.

From a macro point of view, knowledge graph has been widely applied in personalized recommendation, address resolution, search engine, intelligent question answering and education. Tencent cloud knowledge Graph team has also tried to explore in different scenarios, for example, relevant entity recommendation based on knowledge graph is used in short video recommendation and knowledge question and answer technology based on knowledge graph is used in intelligent question and answer. Tencent Cloud Knowledge Graph team developed a small program combining business scenarios, integrating map visualization, knowledge questions and so on. Interested readers can scan the QR code to experience.

2. Master attribute extraction from 0 to 1

According to the technical framework of knowledge graph mentioned above, converting unstructured data into structured data that is easy to store in graph database generally requires knowledge extraction, which includes entity extraction, relationship extraction, attribute extraction and concept extraction. In general, entity extraction, attribute extraction and concept extraction can be abstracted as sequence labeling tasks, while relation extraction can be abstracted as classification tasks. Tencent Cloud Knowledge Atlas team has developed a set of knowledge extraction algorithm framework (Merak, Tianxuan knowledge extraction algorithm framework, just as the Big Dipper is in Taiwei North mentioned in The Annals of Astronomy in The Book of Jin Dynasty,

Mature for the day, xuan for the ground, as a human, right, balance for sound, Yang for the law, shake light for the star. , can realize one-stop knowledge extraction algorithm task. For attribute extraction and concept extraction tasks, Merak knowledge extraction algorithm framework provides multiple algorithm models, such as BERT (Bidirectional Encoder Representations from Transformers), BI-LSTM +CRF, etc. In general, Merak algorithm framework has the following technical advantages:

Merak provides a one-stop algorithm solution. Users can easily realize the automatic generation of each module (data processing, model training, model deployment) of the project through simple configuration, which greatly improves the production efficiency of knowledge graph.
Merak abstracts the model layer to facilitate the understanding and assembly of the model, enhancing the simplicity, flexibility and versatility of the framework, and users can also do secondary development on this basis.
Merak supports the current mainstream algorithm models in the field of knowledge extraction, including BERT model, BI-LSTM +CRF, Attention CNN and so on.
Merak supports CPU and GPU multi-card distributed training, and provides high-quality BERT Chinese pre-training model for Tencent cloud customers to download and use.

In terms of experimental results, Merak performs well in relation extraction (including multi-example learning), relation extraction, attribute extraction and other tasks, and achieves industry-leading standards in terms of training time cost and prediction accuracy.

It should be mentioned here that the process of constructing knowledge maps is actually a process of trade-offs. Especially, overly extensive domain knowledge maps may not be applicable to businesses, especially for tasks requiring fine-grained knowledge such as Q&A and task-based dialogues. In addition to being very expensive to build, some tasks (reasoning based on knowledge graphs) will become difficult and difficult to use due to data noise.

Next, the author introduces how to use Merak algorithm framework to achieve attribute extraction task from 0 to 1.

Before the task of attribute extraction algorithm starts, details need to be clearly extracted. Here, taking character attribute extraction as an example, gender, education background, place of birth, date of birth, native place and school of graduation all belong to the category of attributes. Secondly, it should be clear why attribute extraction can be abstracted as a sequence annotation task. In fact, sequence tagging is one of the four key tasks in the field of natural language processing, and its development can be roughly divided into three stages: Statistical learning methods (HMM, CRF), deep neural network (BI-LSTM +CNNs+CRF) and post-deep neural network (post-deep neural network era represented by Transformer, BERT, etc.) are all over the world.

As we all know, BERT has achieved good results in 11 authoritative NLP tasks. Here, BERT is discussed as an example. After the initial selection of the model, training samples need to be prepared. Tianxuan provided sample sets of character attribute extraction in the algorithm framework, including date of birth, place of birth, and school of graduation, as shown in the upper part of the figure below.

As mentioned above, Tianxuan knowledge extraction algorithm framework provides a variety of attribute extraction algorithm modules, such as BERT, BI-LSTM +CRF and other classical algorithms. In the figure above, the input vector generation method of BERT model is shown on the left. The whole calculation process is mainly divided into two steps: first, model pre-training (1) Language model-prediction of missing words in context; (2) sentence for binary relationship prediction – is it the next sentence), then fine-tune on this basis; On the right is the sequence annotation method based on bi-LSTM +CNNs+CRF model.

So what is the difference between the method based on fine-tuning BERT model and the method based on BI-LSTM +CNNs+CRF/ BI-LSTM +CRF model? The author has made a simple analysis here, and the results are as follows:

The accuracy of BERT Fine-tune method is higher than that of BiLSTM+CRF method.
The BERT method has more parameters (300 million +), which requires higher computing resources, namely higher cost.
BiLSTM+CRF is an end-to-end network architecture that does not require any pre-training to achieve good results.

Next, it is necessary to download the project code and start the character attribute extraction model training. Special attention should be paid to downloading the Pre-trained Chinese pre-training model in advance (the basic version of the Chinese pre-training model is used here) and placing the training sample in./.. For other preparations in the/People_Attribute_extraction folder, see the quick Start section below.

After the model training was completed, the author made a comparative analysis of the training effects of different methods on the sample extraction of character attributes. The results are shown in the figure below. The experiment found that the method based on BERT+ full connection had the best effect, and the F1 value was about 0.985.

In an industry news flash, Nvidia has recently trained what it calls the industry’s largest Transformer based language model using 512 high-performance V100 Gpus, with 8.3 billion participants, far larger than the pre-training model announced by Google.

BERTBASE (L=12, H=768, A=12, Total Parameters=110M)
BERTLARGE (L=24, H=1024, A=16, Total Parameters=340M) BERTLARGE (L=24, H=1024, A=16, Total Parameters=340M)

At this point, the training process of character attribute extraction model is introduced. Similar operation procedures can be referred to for concept extraction, relationship extraction and entity extraction, and interested readers can try it by themselves.

3. Tencent Cloud Encyclopedia knowledge map

Before introducing the knowledge graph of Tencent Cloud Encyclopedia, the author first analyzed the differences and connections between the general knowledge graph and domain knowledge graph, as shown in the figure below:

Comparative analysis of general knowledge graph and domain knowledge graph

As can be seen from the figure above, there are great differences between the two in knowledge representation, knowledge acquisition and knowledge application in real scenes. Moreover, the construction of knowledge map requires the coordination of multiple factors, among which map quality, map construction cost and map update are the most important key factors. On the other hand, map quality and map construction costs often constrain each other, requiring a task-specific balance.

Tencent cloud encyclopedic knowledge map (tencent cloud encyclopedic knowledge map is tencent cloud knowledge map team with tencent AI LAB TopBase team construction cloud products) belongs to the category of general knowledge map, although the knowledge granularity coarser, but knowledge coverage is larger, now covers 51 areas (mainly music, film and television, wikipedia). 221 types, 4320 attributes, over 97 million entities, and nearly 1 billion triples, supporting full or incremental updates. The detailed domain division is shown in the figure below:

Here, the author conducts a survey and analysis on entities and triples of Chinese encyclopedia knowledge atlas in the industry, and the results are as follows:

The name of the	Number of entities	A triple
CN-DBpedia	16.89 million +	220 million +
zhishi.me	17.28 million +	120 million +
Tencent Cloud Encyclopedia knowledge map	97 million +	1 billion +

Tencent cloud of knowledge map building data sources are mainly tencent entertainment, Chinese wikipedia, encyclopedia, Chinese news, watercress, etc., so tencent cloud encyclopedic knowledge map in the field of science and technology, music, sports and film both entities and triple quantity are more abundant, so the corresponding build detailed process is as follows:

At present, Tencent cloud encyclopedia knowledge map has gray access to Tencent listen, Tencent Dingdong, Tencent small micro robot, wechat search, etc., and in the relevant entity recommendation, encyclopedia knowledge question and answer has accumulated rich practical experience.

Having said so much, Tencent cloud encyclopedia knowledge map to provide what specific interface, how users access it? Tencent Cloud encyclopedia knowledge graph currently provides entity query, relational query and triplet query. It should be noted that triplet query will involve the use of TQL (Tencent Graph query language) grammar. Specifically, it can be found on the official website
The API documentation
As many sample methods as possible are shown in Figure B below. See the API documentation for a complete example.

At present, Tencent Cloud encyclopedia knowledge graph interface is in the inner stage of free, interested readers can apply for access to use according to the following process:

Here, the author recommends users to call the ENCYCLOPEDIA knowledge graph API through the SDK toolbox provided by Tencent Cloud, as shown in the following example:

Writing here, has come to the end, the author through this article to introduce to you:

Industry development status of knowledge graph
Key technical points of knowledge extraction
Application of Tencent Cloud Encyclopedia knowledge atlas

Sorry for the short time.

Iv. References

Attention Is All You Need. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. 2017
Improving Language Understanding by Generative Pre-Training. Radford A, Narasimhan K, Salimans T, et al. 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. 2018
XLNet: Generalized Autoregressive Pretraining for Language Understanding. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. 2019
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. 2019

Note: this paper has referred to a large number of literature in the writing, thank you

The original link: https://cloud.tencent.com/developer/article/1494570

Tencent YunDaXue higher-ups share | tencent cloud atlas practice knowledge

Click here for the full lesson

1. Knowledge graph foundation

2. Master attribute extraction from 0 to 1

3. Tencent Cloud Encyclopedia knowledge map

Iv. References

Related Posts

Handwritten digit recognition based on Matlab GUI Bayesian minimum error rate Handwritten digit recognition

What is a “robot shop”? Who sells robots?

【 mathematical Modeling 】2021 BEAUTY B problem solving ideas (forest fires and uav layout)