A sequence of

This article belongs to the greedy NLP boot camp study notes series. Overall, module 1: language models, module 2: machine learning. The third module: information extraction

Summary of information extraction

summary

Unstructured Text includes images, texts, videos, and audio, which need to be extracted and processed before the model can be used for calculation.

Information Extraction (IE)

Entities: Entities are things that exist in real life.

· Medical field: protein, disease, medicine…

Extraction Relations

Locatedin, work at, is part of

Example:

This hotel is my favorite Hinton Property in NYC! It is located right on 42nd street near Times Square in New York, it is close to all subways, Broadways shows, And next to Great restaurants like Junior’s Cheesecake, Virgil’s BBQ and many others. Entity NER is extracted and the type of each entity is marked. After relationship extraction:

Note: It is located right on 42nd Street near Times Square in New York. What does it mean in English? This Hotel uses the classification method to find the corresponding object of it, which is referential resolution.

Entity disambiguation (Apple is Apple or Apple is Apple) and entity Unified algorithm (NYC and New York are the same), which we’ll talk about later.

The teacher leads a question: search engine and question answering system

Search engine: all relevant documents are presented, and the user filters by himself.

Question and answer system: the system gives the answer without the user’s choice. Levels vary: from returning documents initially, to returning sentences to keyword or answer. The latter need to rely on the support of knowledge graph.

Introduction to Named Entity Recognition (NER)

It refers to the identification of entities with specific meaning in the text, mainly including people’s names, place names, organization names, proper nouns, etc.

Case: chat bot

Identification of intent based on the question, and then categorization, what is the intent, where, when? Or something, and then the answer is based on the rules, and what the rules can’t do is use other solutions.

Industry: Rules are preferred, (intent recognition can be seen as a classification problem, try the smart speaker dialogue)

Case: Extract from News

Universal tool: HanNLP

The disadvantage of open source tools is the inability to do domain specific entity recognition, such as healthcare, finance, etc. Be generic: people, times, events, etc.

Build named entity recognition classifier

  • 1 Define entity categories: Business defines categories
  • Prepare training data
  • 3 training NER

The following is an example of training data. The first column is the sentence and number, the second column is the word, the third column is the word part of speech tagging, and the fourth column is to mark whether we need to pay attention to the word. If we do not pay attention to the label marked O, otherwise marked B-type, for example, B-geo represents the place, b-per represents the person. If more than one phrase is combined into a single entity: b-org, the i-org after it means the previous word is combined into one word.

It’s weird watching this for the first time. Is the same with Chinese training data?

Overview of NER method

· Non-sequential model: Logistic regression, SVM… · Timing model: HMM, CRF (the first two are required), LSTM-CRF (feature extraction is not required)

Rule-based approach

Using the existing thesaurus, direct matching. (by hand)

Majority Voting

Count the entity types for each word, and record the entity types with the highest probability for each word.

No need to learn, just statistics.

Supervised learning – Feature extraction

Example :The professor Colin proposed a model for NER in 1999.

English has more feature extraction than Chinese.

  • 1 bag-of-word, including bigram, trigram
  • 2. The part of speech
  • 3. Prefixes and suffixes (not in Chinese)
  • 4 Characteristics of the current word

The method of feature extraction still needs to be experimented with, Chinese pinyin also counts.

Feature encoding Feature encoding

The most common types of features are: continuous features; There are two ways to normalize the distribution, and one is to transform it into a Gaussian distribution. Besides being used directly, it can also be treated as a discrete one. (You can use the one-hot method to convert each unordered feature into a numerical vector.) Ordinal characteristics. (The characteristics of ordered categories cannot calculate the difference between each interval. For example, if the grade is ABCD priority, it cannot be quantified without knowing the specific score segment corresponding to each interval.)

Ontological Relation

Ontological translation is Ontological and abstract and difficult to understand. It usually contains two elements: entities and their relationships.

The teacher’s drawing of such a system is also a description of the information structure. Take the medical field: drugs – act on -> pathogens.

IS-A(Hypernym Relation) represents the subordinate relationship. It’s a very common relationship.

The rule-based method takes IS-A extraction as an example

First, the rules are defined manually, and then match from the corpus, and save the rules to the database. Suppose we were to screen fruit:

The richer the rule set we defined, the richer the information extracted from the article, but there may be noise (outliers), as shown in the car screenshot above.

We can add some restrictions when defining rule sets to improve accuracy and return only the desired results.

The whole rule-based approach has the following advantages: 1. It is more accurate; 2. No training is required. Disadvantages: 1. Low recall rate; 2. High labor cost; 3

Based on supervised learning methods

General steps: 1. Define the relationship type, such as the relationship between disease and symptom. 2. Define entity types, such as disease, symptom and drug. 3. 3.2 Mark relationships between entities.

Similar to NER, the extraction relationship can be transformed into a multi-classification problem, where the goal is to classify the relationship between two entities into a type.

Feature engineering of extraction relationship usually has the following methods:

  • 1, Bag of word: bigram
  • 2. Features of parts of speech: noun, verb
  • 3 Entity type: ORG and PER
  • Location: where does it appear, title, beginning of sentence, etc.
  • Syntactic analysis features (syntactic dependency features are the dependencies between words, syntax trees require a lot of linguistic knowledge, not the same thing)
  •  

After feature extraction, the next step is the classification algorithm: there is a way to filter without classification (the dichotomous problem), and then classify again. Classification algorithm can have SVM, NN, GBDT, etc., according to the situation of their own test

The Bootstrap method

This algorithm has little to do with machine learning, so it’s a classic one. Because the above two methods (rules-based and supervisor-based) are cumbersome to manually set rules and tag entities.

Bootstrap is a sampling method. A sample is as good to a sample as a sample is to a population.

Example: Suppose you want to find the relationship between the author and the book and start with 3 known records (seed/seeds tuple).

Step 1: Generate rules

(To find the rule is to match whether there is any entity in the text, if there is, the text between entities out in advance as a rule)

Finally, a rule library is formed

Use the above rule library to scan the text, get three new tuples, add the new tuple to the previous seed tuple, and repeat the process

The Bootstrap the pros and cons

Advantages: Automatic method, does not require much manual intervention. Disadvantages: The extracted rules may have low accuracy (error Accumulation, even iterative algorithms)

Snowball

This algorithm is an improvement on Bootstrap to prevent error accumulation.

On bootsrap, add filtering steps, and change from = to calculate approximate degree on rule matching.

Snowball specific algorithm implementation

As with bootstrap, you have a seed tuple and then scan the text. The rules are then extracted, in this case in a quintuple fashion (to calculate similarity).

Note the composition of the 5-tuple, left, entity 1, Middle, entity 2, right. Where left\middle\right is converted to a vector (using something like TFIDF) so that each rule can be evaluated with a vector.

Similarity formula:

Mu is the weights of similarity calculation, directly expressed in inner product similarity, does not need to be divided by length, because L, M, R made normalization = > | | L | | = 1

1 Generating a Template

In the screenshot above, on the left are the computed vectors, where 1 word is 1,2 words are 0.75 each, and 3 words are 0.57 each because their squares have to add up to 1.

On the right side is the combination of rules (template pattern), which is to combine those with large similarity, using the method of cluster.

The example taught by the teacher adopts the idea of comparing the similarity one by one and merging them together if they are similar. Then (2) average multiple rules in a group into one rule (centroid).

2 generate a tuple

Once you have your rule library, you can generate tuples.

First, scan the text and use NER to find the entity type that is the same as the one in the template, which is also expressed as a quintuple in a regular way. Then calculate the similarity. If it is greater than 0.7, it means we have scanned the correct tuple and add it to the following table.

3 Template Evaluation

Then the template is evaluated and the seed tuple generated by the previous iteration is taken as a pair (groud truth). A rule (template) is then applied to the text, and a series of tuPLES results are detected and compared against the data. Calculate confidence score. Judgment of the templateWhether to discard (e.g. <0.5).

Tuple evaluation and filtering

The higher the score of a single rule is, the higher the reliability is. The more rules are hit, the higher the reliability is.

Confidence score formula:

We can also set the confidence score threshold of the tuple to 0.7 for filtering

She summed up

Miss Snowball spoke highly of it, although it wasn’t the AI algorithm. However, it solves the bootstrap problem and many ideas can be used for reference.

Information extraction, neural network applications are few, rule-based extraction is still dominant.

Entity Disambiguiation

The nature of substantive ambiguity is that one word can have multiple meanings, meaning different meanings in different contexts. For example: Apple, Xiaomi, and people with the same name.

Given an entity library, how to effectively determine which entity in the problem is specific.

The similarity between the entity in question and each object in the entity library is calculated:

First, convert the word to a vector (TFIDF, etc.) and take the strings on both sides of ‘apple’. Then do the similarity calculation with the objects in the physical library. It can be done with other fancy technologies.

Entity Resolution

Descriptions of multiple entities may describe the same entity. (Dichotomy problem)

For example, given two entities: string str1, str2, check whether the two strings are the same entity.

Algorithm 1: The similarity of two words can be expressed by edit distance.

Algorithm 2: Rule based.

Algorithm 3: Supervised learning

First, tFIDF and other methods are used to transform them into two vectors respectively (feature extraction). Then there are two methods:

1. Splice and then throw it into the model for dichotomy

2. Cosine similarity is calculated first, and then classification results are obtained through logistic regression model

Algorithm 4. Entity unification based on graph, relational graph:

Again, we have to calculate the similarity. Similarity calculation is carried out through individual features, relational features, etc

Co-reference Resolution (CO-reference Resolution)

Algorithm:

1 points to the nearest entity (inaccurate)

2 supervised learning methods:

2.1 Collect data and mark objects, such as which is the person’s name and which is the pronoun referring to

2.2 Marked data: Similar to the above sentences, mark out which object each pronoun belongs to and form samples

2.3 Keywords are extracted around words, vectors are constructed and put into the model.

There is not much innovation in the algorithm, but the accuracy is not high, which is one of the unsolved problems.