0. Guide language

What exactly is feature engineering? As the name implies, it is essentially an engineering activity designed to extract features from raw data to the maximum extent possible for use by algorithms and models.

I’ve written the following quick introduction to AI basics. This article covers feature engineering basics part 3: Text feature Processing.

It has been released: \

AI Basics: Easy to get started with Python

AI Basics: Numpy easy to get started

AI: Pandas

AI: Scipy(Scientific Computing Library) easy to get started

AI Basics: An easy introduction to Data Visualization (Matplotlib and Seaborn)

AI Fundamentals: Feature Engineering – Category Features

AI Fundamentals: Feature Engineering – Digital feature processing

Follow-up updates

References:

[1] Book Address:

www.oreilly.com/library/vie… [2] Translation of the following:

github.com/apachecn [3] translated by @Kkejili:

github.com/kkejili

Code modification and reorganization: Huang Haiguang, the original text was modified into Jupyter Notebook format, and some codes were added and modified. All the tests passed, and all data sets have been downloaded on Baidu Cloud.

The code can be downloaded at Github:

Github.com/fengdu78/Da…

Baidu Cloud of data set:

Link: pan.baidu.com/s/1uDXt5jWU… Extraction code: 8P5D

Text data: expand, filter and block

If you were to design an algorithm to analyze the following paragraphs, what would you do?

Emma knocked on the door. No answer. She knocked again and waited. There was a large maple tree next to the house. Emma looked up the tree and saw a giant raven perched at the treetop. Under the afternoon sun, the raven gleamed magnificently. Its beak was hard and pointed, its claws sharp and strong. It looked regal and imposing. It reigned the tree it stood on. The raven was looking straight at Emma with its beady black eyes. Emma felt slightly intimidated. She took a step back from the door and Tentatively said, "hello?"Copy the code

This paragraph contains a lot of information. We know it talks about a person named Emma and a crow. There was a house and a tree. Emma was about to enter the house when she saw the crow. The gorgeous crow notices Emma, who is a little scared but trying to communicate.

So, what parts of this information are the salient features that we should extract? First, it seemed like a good idea to extract the names of the main characters, Emma and Raven. Next, pay attention to the house, door and tree arrangement may also be good. What about the crow? What about Emma’s behavior, knocking, stepping back, saying hello?

This chapter introduces the basic knowledge of text feature engineering. We started with bags of Words, the simplest text feature based on word count. A very related transformation is tF-IDF, which is essentially a feature scaling technique. It will be fully discussed by me in the next chapter. This chapter first discusses text feature extraction and then how to filter and clean these features.

Bag of X: Turn natural text into plane vectors

Whether building machine learning models or feature engineering, the results should be easy to understand. Simple things are easy to try, and explicable features and models are easier to debug than complex ones. Simple and explicable features do not always lead to the most accurate model. But it’s a good idea to start simple, and we can add complexity only when absolutely necessary.

For text data, we can start with a word count called BOW. The word count doesn’t look particularly hard for interesting entities like “Emma” or a crow. But these two words are mentioned repeatedly in the paragraph, and they count higher here than random words like “hello”. For such simple document sorting tasks, word count is usually appropriate. It can also be used for information retrieval, with the goal of retrieving a set of documents related to the input text. Both of these tasks are good explanations for word-level characteristics, as the presence of certain words can be an important indicator of the topic content of this document.

The word bag

In the word bag feature, text documents are transformed into vectors. (A vector is just a set of N numbers.) The vector contains the number of possible occurrences of each word in the vocabulary. If the word"aardvark"If it appears three times in the document, the count of the eigenvector at the position corresponding to the word is 3. If the word in the vocabulary does not appear in the document, the count is zero. For example, the sentence “This is a puppy and it is very cute” has the BOW representation shown in the pictureFigure 3-1 Description diagram of converting words into vectors

BOW transforms a text document into a plane vector. It is “flat” in that it does not contain any original textual structure. The original text is a series of words. But the bag vector has no sequence; It simply remembers how many times each word appears in the text. It does not represent any concept of word hierarchy. For example, the concept of “animal” includes “dog”, “cat”, “crow”, etc. But in a bag of words representation, these words are all the same elements of a vector.Figure 3-2 Two equivalent word vectors, the order of the words in the vector is not important, as long as the number of the words in the data set is the same as the number of the documents.

What matters is the geometry of the data in the feature space. In a bag of words vector, each word becomes a dimension of the vector. If there are n words in the vocabulary, the document becomes a point in n-dimensional space. It’s hard to imagine the geometry of anything other than two or three dimensions, so we have to use our imagination. Figure 3-3 shows our example sentences in the feature space corresponding to the “puppy” and “cute” dimensions.

Figure 3-3 Signature space Icon of the document

Figure 3-4 THREE-DIMENSIONAL feature space

Figure 3-3 and Figure 3-4 depict data vectors in the feature space. The axes represent individual words, which are features under the word bag representation, and the points in the space represent data points (text documents). Sometimes it is useful to look at feature vectors in the data space. The eigenvector contains the value of the feature in each data point. Axes represent individual data points and points represent feature vectors. Figure 3-5 shows an example. Through word bag feature of text documents, a feature is a word, and a feature vector contains the count of the word in each document. Thus, a word is represented as a “one-word vector”. As we will see in Chapter 4, these document word vectors are derived from the transpose matrix of the bag vector.

Bag-of-N-gram

Bag-of-n-gram or bag-of-ngram is a natural extension of the BOW. N-gram is n ordered tokens. A word is basically a 1-gram, also known as a unary model. When it is tagged, the counting mechanism can count individual words or overlapping sequences as n-gram. For example, the sentence “Emma knocked on the door” generates an N-gram, such as “Emma knocked”, “knocked on”, “on the”, “the door”. N-gram preserves more of the original sequence structure of the text, so bag-of-Ngram can provide more information. But there is a price. In theory, with k unique words, there could be k separate 2-grams (also known as bigrams). In practice, not so much, because not every word can be followed by a word. Still, there are usually more different N-grams (n > 1) than words. This means the word bag is larger and has a sparse feature space. This also means higher costs for N-gram computation, storage, and modeling. The larger the n, the richer the information, and the higher the cost.

To illustrate how the number of N-grams increases as n increases, let’s calculate N-grams on the New York Times article dataset. We use the CountVectorizer converter in Pandas and SciKit-Learn to calculate the N-gram of the first 10,000 comments.

import pandas as pd
import json
from sklearn.feature_extraction.text import CountVectorizer
# Load the first 10.000 reviews
f = open('data/yelp_academic_dataset_review.json')
js = []
for i in range(10000):
    js.append(json.loads(f.readline()))
f.close()
review_df = pd.DataFrame(js)
Copy the code

Note: All data sets have been downloaded on Baidu Cloud.

review_df.head()
Copy the code
business_id date review_id stars text type user_id votes
0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on my birthday for breakf… review rLtl8ZkDX5vH5nAx9C3q5Q {‘funny’: 0, ‘useful’: 5, ‘cool’: 2}
1 ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 I have no idea why some people give bad review… review 0a2KyEL0d3Yb1V6aivbIuQ {‘funny’: 0, ‘useful’: 0, ‘cool’: 0}
2 6oRAC4uyJCsJl1X0WZpVSA 2012-06-14 IESLBzqUCLdSzSqm0eCSxQ 4 love the gyro plate. Rice is so good and I als… review 0hT2KtfLiobPvh6cDC8JQg {‘funny’: 0, ‘useful’: 1, ‘cool’: 0}
3 _1QQZuf4zZOyFCvXc0o6Vg 2010-05-27 G-WvGaISbqqaMHlNnByodA 5 Rosie, Dakota, and I LOVE Chaparral Dog Park!! . review uZetl9T0NcROGOyFfughhg {‘funny’: 0, ‘useful’: 2, ‘cool’: 1}
4 6ozycU1RpktNG2-1BroVtw 2012-01-05 1uJFq2r5QfJG_6ExMRCaGw 5 General Manager Scott Petello is a good egg!!! . review vYmM4KTsC8ZfQBg-j5MWkw {‘funny’: 0, ‘useful’: 0, ‘cool’: 0}
# Create feature transformers for unigram, bigram, and trigram.
# The default ignores single-character words, which is useful in practice because it trims
# uninformative words. But we explicitly include them in this example for illustration purposes.
bow_converter = CountVectorizer(token_pattern='(? u)\\b\\w+\\b')
bigram_converter = CountVectorizer(
    ngram_range=(2.2), token_pattern='(? u)\\b\\w+\\b')
trigram_converter = CountVectorizer(
    ngram_range=(3.3), token_pattern='(? u)\\b\\w+\\b')
# Fit the transformers and look at vocabulary size
bow_converter.fit(review_df['text'])
words = bow_converter.get_feature_names()
bigram_converter.fit(review_df['text'])
bigram = bigram_converter.get_feature_names()
trigram_converter.fit(review_df['text'])
trigram = trigram_converter.get_feature_names()
print(len(words), len(bigram), len(trigram))
Copy the code
29222 368943 881620
Copy the code
# Sneak a peek at the ngram themselves
words[:10]
Copy the code
['0'.'00'.'000'.'007'.'00a'.'00am'.'00pm'.'01'.'02'.'03']
Copy the code
bigram[- 10:]
Copy the code
['zuzu was'.'zuzus room'.'zweigel wine'.'zwiebel kräuter'.'zy world'.'zzed in'.'the eclairs napoleons'.'école lenôtre'.'ém all'.'oc cham]
Copy the code
trigram[:10]
Copy the code
['0 0 eye'.'0 20 less'.'0 39 oz'.'0 39 pizza'.'0 5 i'.'0 50 to'.'0 6 can'.'0 75 oysters'.'0 75 that'.'0 75 to']
Copy the code

Figure 3-6 Number of unique n-gram in the first 10,000 reviews of the Yelp dataset

Filtration and cleaning characteristics

How do we clearly separate the signal from the noise? With filtering, techniques for generating simple word lists or N-gram lists using raw tokenization and counting become more available. Phrase detection, which we will discuss below, can be seen as a special BigRAM filter. Here are several ways to perform filtering.

Stop words

Classification and retrieval usually do not require a deep understanding of the text. For example, in “Emma knocked on the door”, the words “on” and “the” don’t contain much information. Pronouns, articles and prepositions don’t show their worth most of the time. The popular Python NLP package NLTK contains lists of stop words defined by linguists of many languages. (You’ll need to install NLTK and run nltk.download() to get all the good stuff.) A list of stop words is also available online. For example, here are some sample words from English stop words

Sample words from the NLTK stopword list a, about, above, am, an, been, didn't, couldn't, I'd, I'll, itself, let's, Myself, our, they, through, when's, whom...Copy the code

Notice that the list contains apostrophes and that the words are not capitalized. To use it as is, the tokenization process must not remove apostrophes, and the words need to be converted to lowercase.

Frequency-based filtering

A stop list is a method of removing common words with empty features. There are other, more statistical ways to understand the concept of “common words.” In collocation extraction, we see methods that rely on manual definitions, as well as methods that use statistics. The same idea applies to word filtering. We can also use frequency statistics.

High-frequency words

Frequency statistics are useful for filtering out common corp-specific words and common stop words. For example, “New York Times” and individual words often appear in the New York Times article dataset. The word “house” often appears in the term “House of Representatives” in the Hansard Corpus of Canadian parliamentary debates, a popular data set used for statistical machine translation because it contains both English and French versions of all documents. These words have meaning in ordinary language, but not in a corpus. A manually defined list of stop terms captures general stop terms, but not corpus-specific stop terms.

Table 3-1 lists the 40 most commonly used words in the Yelp review dataset. Here, frequency is considered to be the number of them in the file (comments), not the number of them in the file. As we can see, this list covers many stop words. It also contains some surprises."s"and"t"In the list, because we use apostrophes as tokenized separators, and things like"Mary's"or"did not"The words are interpreted as"Mary s"and"didn t". word"good"."food"and"great"Each appeared in a third of the comments. But we might want to keep them because they are useful for sentiment analysis or business classification.The most commonly used words are the most revealing and highlight the often useful words that have often appeared more than once in the corpus. For example, the most common word in the New York Times corpus is “era.” In fact, it helps to combine frequency-based filtering with stop word lists. There is also the thorny question of where to put the cut-off point. Unfortunately there is no universal answer. In most cases, truncation needs to be determined manually and may need to be rechecked if the data set changes.

Rare word

Depending on the task, you may also need to filter out rare words. For statistical models, words that appear in only one or two documents are more like noise than useful information. For example, suppose the task is to classify businesses based on their Yelp reviews, and the individual reviews contain the word “gobbledygook”. Based on this one word, how can we say whether the business is a restaurant, a beauty salon or a bar? Even if we knew that in this case this kind of business took place in a bar, it would have been a mistake for other comments that included the word “gobbledygook.”

Not only are rare words unreliable, but they also incur computational overhead. The set of 1.6 million Yelp reviews contains 357,481 unique words (denoted by Spaces and punctuation), of which 189,915 appear in only one review and 41,162 in two reviews. More than 60% of words rarely occur. This is a so-called heavy-tailed distribution that is common in real-world data. The training time of many statistical machine learning models varies linearly with the number of features, and some models are quadratic or worse. Rare words incur significant computing and storage costs without generating additional revenue.

Rare words can be easily identified and trimmed based on word count. Or, their counts can be aggregated into a special bin that can be used as an added feature. Figure 3-7 shows the representation in a short document that contains some frequently used words and two rare words"gobbledygook"and"zylophant". Usually words retain their own count and can be further filtered by stopping word lists or other frequency methods. These rare words lose their identity and are grouped into the trash can function.Since few words are not known until the entire corpus is computed, the collection trash can function is required as a post-processing step.

Since this book is about feature engineering, we will focus on features. But the concept of rarity also applies to data points. If the text document is short, it may not contain useful information, and that information should not be used when training the model.

Caution must be exercised when applying this rule. The Wikipedia dump contains many incomplete stubs that can be safely filtered. Tweets, on the other hand, are inherently short and require additional characteristics and modeling skills.

Stem resolution (Stemming)

One problem with simple parsing is that different variations of the same word count as separate words. For example, “flower” and “flowers” are technically different notations, as are “swimmer”, “swimming” and “swim”, although their meanings are very similar. It would be nice if all these different variations mapped to the same word.

Stem parsing is an NLP task that attempts to slice words into basic language stem forms. There are different ways. Some are based on linguistic rules, others on observational statistics. A subclass of algorithms called lexicalization combines pos tagging with language rules.

Porter Stemmer is the most widely used free stem tool in the English language. The original program was written in ANSI C, but many other packages have packaged it to provide access to other languages. Although efforts are underway in other languages, most stem tools focus on English.

The following is an example of running Porter Stemmer through the NLTK Python package. As we’ve seen, it handles a number of situations, including turning “Sixties” and “Sixty” into the same root “Sixti”. But it’s not perfect. The word “goes” maps to “goe”, and “go” maps to itself.

import nltk
stemmer = nltk.stem.porter.PorterStemmer()
Copy the code
stemmer.stem('flowers')
Copy the code
'flower'
Copy the code
stemmer.stem('zeroes')
Copy the code
'zero'
Copy the code
stemmer.stem('stemmer')
Copy the code
'stemmer'
Copy the code
stemmer.stem('sixties')
Copy the code
'sixti'
Copy the code
stemmer.stem('sixty')
Copy the code
'sixti'
Copy the code
stemmer.stem('goes')
Copy the code
'goe'
Copy the code
stemmer.stem('go')
Copy the code
'go'
Copy the code

Stem parsing does have a computational cost. Whether the benefits ultimately outweigh the costs depends on the application.

Atoms of meaning: from words to N-grams to phrases

The concept of a word bag is simple. But how does a computer know what a word is? A text document is represented numerically as a string, which is basically a series of characters. You may also encounter semi-structured text in the form of JSON blobs or HTML pages. But even with tags and structures added, the base unit is still a string. How do I convert a string into a series of words? This involves parsing and tokenization tasks, which we discuss below.

Parsing and word segmentation

Parsing is necessary when a string contains more than plain text. For example, if the raw data is a Web page, E-mail, or some type of log, it contains additional structure. People need to decide what to do with tags, headers, footers, or uninteresting parts of the log. If the document is a Web page, the parser needs to process the URL. In the case of E-mail, special fields such as From, To, and Subject may need To be handled in a special way; otherwise, these headings will be counted as ordinary words in the final count, which may not be useful.

After parsing, the plain text portion of the document can pass the tag. This converts a string (a series of characters) into a series of tokens. Each token can then be counted as a word. The tokenizer needs to know which characters indicate that one token has ended and another is beginning. Space characters are usually good separators, just like punctuation. If the text contains a tweet, the pound sign (#) should not be used as a separator (also known as a separator).

Sometimes, analysis requires the use of sentences rather than the entire document. For example, an N-gram is a summary of a sentence and should not go beyond the sentence. More sophisticated textual characterization methods such as Word2vec also work for sentences or paragraphs. In these cases, you need to first parse the document into sentences, and then further mark each sentence as a word.

String object

String objects have various encodings, such as ASCII or Unicode. Plain English text can be encoded in ASCII. Unicode is required for common languages. If the document contains non-ASCII characters, make sure that the tokenizer can handle that particular encoding. Otherwise, the results will be incorrect.

Collocation extraction of phrase detection

Consecutive tokens can be instantly converted into glossaries and N-grams. But semantically, we’re more used to understanding phrases than n-grams. In computational natural language processing, the concept of useful phrases is called collocations. In the words of Manning and Schutze (1999:141) : “A collocation is an expression consisting of two or more words that correspond to some conventional way of speaking.”

Collocation is more meaningful than the sum of its parts. For example, “strong tea” has different meanings beyond “great physical strength” and “tea” and is therefore considered collocations. On the other hand, the phrase “cute puppy” means exactly the sum of its parts: “cute” and “puppy.” Therefore, it is not considered collocations.

Collocations don’t have to be continuous sequences. The term Emma knocked on the door is considered to include the collocation “knock door, “so not every collocation is an N-gram. Conversely, not every N-gram is considered a meaningful collocation.

Since collocations are more than just the sum of their parts, their meaning cannot be fully expressed by counting individual words. As a form of expression, thesaurus is inadequate. Bag Ngrams are also problematic because they capture too many meaningless sequences (consider “this is in the bag-of-ngram example”) and not enough meaningful ones.

Collocations are useful as features. But how do you find and extract them from the text? One way is to define them up front. If we try hard, we can probably find a comprehensive list of idioms in every language, and we can look through the text for any match. It will be very expensive, but it will work. This may be the preferred approach if the corpus is very domain-specific and contains esoteric terms. But this list requires a lot of manual management and constant updating of the corpus. For example, analyzing tweets, blogs and articles may not be realistic.

Since the emergence of statistical NLP in the past two decades, people have increasingly chosen statistical methods for finding phrases. Rather than building fixed lists of phrases and idiomatic languages, statistical collocation extraction relies on evolving data to reveal today’s popular languages.

Frequency based approach

A simple dark magic is the frequent N-gram. The problem with this approach is the most common, and this may not be the most useful. Table 3-2 shows the most popular Bigram (n=2). As we know, the top 10 most common terms listed by file count are very generic terms that don’t carry much meaning.

Hypothesis testing for collocation extraction

The Raw Popularity Count is a crude measure. We have to find smarter statistics so that we can easily pick out meaningful phrases. The key idea is to see if two words often go together. The statistical mechanism for answering this question is called hypothesis testing.

Hypothesis testing is the determination of noise data to a “yes” or “no” answer. It involves modeling the data as a sample drawn from a random distribution. Randomness means that one can never be 100% sure of the answer; There are always opportunities for anomalies. So the answer is attached to the probability. For example, the result of a hypothesis test might be “There is a 95% probability that these two data sets come from the same distribution.” For a gentle introduction to hypothesis testing, see khan Academy’s tutorial on hypothesis testing and P-values.

In the context of collocation extraction, many hypothesis tests have been proposed over the years. One of the most successful methods is based on likelihood ratio test (Dunning, 1993). For a given pair of words, the method tests two datasets of hypothetical observations. Hypothesis 1 (null hypothesis) states that word 1 occurs independently of word 2. Another way of saying it is that seeing word 1 has no effect on whether we see word 2. Hypothesis 2 (alternative hypothesis) states that seeing word 1 changes the likelihood of seeing word 2. We use alternative assumptions to suggest that these two words form a common phrase. Therefore, the likelihood ratio test of phrase detection (also known as collocation extraction) raises the following question: is the occurrence of words observed in a given text corpus more likely to be generated from a model in which two words occur independently of each other, or is the probability of entanglement of two words in the model?

This is useful. Let’s do a little bit. (Mathematics expresses things very precisely and concisely, but it does require a completely different analyzer than natural language.)

Likelihood function L(Data; H) represents the probability of observing word frequency in the dataset under the independent or independent model of word pairs. To calculate this probability, we have to make another assumption about how the data is generated. The simplest data generation model is the binomial model, where for each word in the data set, we flip a coin, and if the coin comes up heads, we insert our special word, and other words otherwise. Under this strategy, the frequency of occurrence of special words follows binomial distribution. The binomial distribution is entirely determined by the total number of words, the number of occurrences of words, and the probability of the word heading.

The algorithm benefits of likelihood ratio test to analyze common phrases are as follows.

  1. Calculate the occurrence probability of all single words:p(w).
  2. Calculate the probability of conditional pairing for all unique binaries:P (W2 (W1)
  3. Compute all unique log-log-likelihood ratios.
  4. Sort double bytes according to their likelihood ratio.
  5. It is characterized by minimum likelihood ratio.

Master likelihood ratio tests

The point is that the test does not compare the probability parameters themselves, but rather the probability of observing the data under those parameters (and the hypothetical data generation model). Probability is one of the key principles of statistical learning. But it’s definitely a confusing question the first few times you see it. Once you determine the logic, it becomes intuitive.

There is another statistical method based on point mutual information. But it is sensitive to rare words commonly found in corpora of real-world texts. So it’s not commonly used, and we’re not going to show it here.

Note that all statistical methods of collocation extraction, whether using raw frequencies, hypothesis testing, or peer-to-peer mutual information, operate by filtering the list of candidate phrases. The simplest and cheapest way to generate such a list is to compute an N-gram. It may produce discontinuous sequences, but they are computationally expensive. In practice, even for continuous N-grams, one rarely exceeds bi-gram or tri-gram because there are many of them even after filtering. There are other ways to generate longer phrases, such as chunking or in combination with pos tagging.

Chunking and part-of-speech Tagging

Chunking is a bit more complicated than N-gram because it forms a sequence of tokens based on a part-of-speech, rule-based model.

For example, we might be most interested in finding all the noun phrases in the question, where the substance of the text, the topic, is the most interesting. To find this, we tag each work with a part of speech, and then examine the tag’s neighborhood for part-of-speech groupings or “chunks.” The models that define word-to-speech classes are usually language-specific. Several open source Python libraries (such as NLTK, Spacy, and TextBlob) have multiple language models.

To illustrate how several libraries in Python can be chunked very easily using pos tagging, we use the Yelp review data set again. We will use Spacy and TextBlob to evaluate speech classes to find noun phrases.

import pandas as pd
import json
# Load the first 10 reviews
f = open('data/yelp_academic_dataset_review.json')
js = []
for i in range(10):
    js.append(json.loads(f.readline()))
f.close()
review_df = pd.DataFrame(js)
## First we'll walk through spaCy's functions
Copy the code
#! pip install -U spacy # ! Python -m spacy download en# to use spacy, you need to install the above two steps and run it without commentsCopy the code
import spacy
# preload the language model
nlp = spacy.load('en')
# We can create a Pandas Series of spaCy nlp variables
doc_df = review_df['text'].apply(nlp)
# spaCy gives you fine grained parts of speech using: (.pos_)
# and coarse grained parts of speech using: (.tag_)
for doc in doc_df[4] :print([doc.text, doc.pos_, doc.tag_])
Copy the code
['General'.'PROPN'.'NNP']
['Manager'.'PROPN'.'NNP']
['Scott'.'PROPN'.'NNP']
['Petello'.'PROPN'.'NNP']
['is'.'VERB'.'VBZ']
['a'.'DET'.'DT']
['good'.'ADJ'.'JJ']
['egg'.'NOUN'.'NN']
['! '.'PUNCT'.'. ']
['! '.'PUNCT'.'. ']
['! '.'PUNCT'.'. ']
['Not'.'ADV'.'RB']
['to'.'PART'.'TO']
['go'.'VERB'.'VB']
['into'.'ADP'.'IN']
['detail'.'NOUN'.'NN']
[', '.'PUNCT'.', ']
['but'.'CCONJ'.'CC']
['let'.'VERB'.'VB']
['me'.'PRON'.'PRP']
['assure'.'VERB'.'VB']
['you'.'PRON'.'PRP']
['if'.'ADP'.'IN']
['you'.'PRON'.'PRP']
['have'.'VERB'.'VBP']
['any'.'DET'.'DT']
['issues'.'NOUN'.'NNS']
['('.'PUNCT'.'-LRB-']
['albeit'.'ADP'.'IN']
['rare'.'ADJ'.'JJ']
[') '.'PUNCT'.'-RRB-']
['speak'.'VERB'.'VBP']
['with'.'ADP'.'IN']
['Scott'.'PROPN'.'NNP']
['and'.'CCONJ'.'CC']
['treat'.'VERB'.'VB']
['the'.'DET'.'DT']
['guy'.'NOUN'.'NN']
['with'.'ADP'.'IN']
['some'.'DET'.'DT']
['respect'.'NOUN'.'NN']
['as'.'ADP'.'IN']
['you'.'PRON'.'PRP']
['state'.'VERB'.'VBP']
['your'.'DET'.'PRP$']
['case'.'NOUN'.'NN']
['and'.'CCONJ'.'CC']
['I'.'PRON'.'PRP']
["'d".'AUX'.'MD']
['be'.'VERB'.'VB']
['surprised'.'ADJ'.'JJ']
['if'.'ADP'.'IN']
['you'.'PRON'.'PRP']
['do'.'VERB'.'VBP']
["n't".'ADV'.'RB']
['walk'.'VERB'.'VB']
['out'.'ADV'.'RB']
['totally'.'ADV'.'RB']
['satisfied'.'ADJ'.'JJ']
['as'.'ADP'.'IN']
['I'.'PRON'.'PRP']
['just'.'ADV'.'RB']
['did'.'VERB'.'VBD']
['. '.'PUNCT'.'. ']
['Like'.'INTJ'.'UH']
['I'.'PRON'.'PRP']
['always'.'ADV'.'RB']
['say'.'VERB'.'VBP']
['... '.'PUNCT'.'. ']
['"'.'PUNCT'."" '"]
['Mistakes'.'NOUN'.'NNS']
['are'.'VERB'.'VBP']
['inevitable'.'ADJ'.'JJ']
[', '.'PUNCT'.', ']
['it'.'PRON'.'PRP']
["'s".'VERB'.'VBZ']
['how'.'ADV'.'WRB']
['we'.'PRON'.'PRP']
['recover'.'VERB'.'VBP']
['from'.'ADP'.'IN']
['them'.'PRON'.'PRP']
['that'.'DET'.'WDT']
['is'.'VERB'.'VBZ']
['important'.'ADJ'.'JJ']
['"'.'PUNCT'."" '"]
['! '.'PUNCT'.'. ']
['! '.'PUNCT'.'. ']
['! '.'PUNCT'.'. ']
['\n\n'.'SPACE'.'_SP']
['Thanks'.'NOUN'.'NNS']
['to'.'ADP'.'IN']
['Scott'.'PROPN'.'NNP']
['and'.'CCONJ'.'CC']
['his'.'DET'.'PRP$']
['awesome'.'ADJ'.'JJ']
['staff'.'NOUN'.'NN']
['. '.'PUNCT'.'. ']
['You'.'PRON'.'PRP']
["'ve".'VERB'.'VB']
['got'.'VERB'.'VBN']
['a'.'DET'.'DT']
['customer'.'NOUN'.'NN']
['for'.'ADP'.'IN']
['life'.'NOUN'.'NN']
['! '.'PUNCT'.'. ']
['! '.'PUNCT'.'. ']
['... '.'PUNCT'.'. ']
[':'.'PUNCT'.':']
[A '^'.'PUNCT'.'LS']
[') '.'PUNCT'.'-RRB-']
Copy the code
print([chunk for chunk in doc_df[4].noun_chunks])
Copy the code
[General Manager Scott Petello, a good egg, detail, me, you, you, any issues, Scott, the guy, some respect, you, your case, I, you, I, I, "Mistakes, it, we, them, Thanks, Scott, his awesome staff, You, a customer, life]
Copy the code
import nltk
nltk.download('averaged_perceptron_tagger')
Copy the code
True
Copy the code
## We can do the same feature transformations using Textblob
from textblob import TextBlob
# The default tagger in TextBlob uses the PatternTagger, which is fine for our example.
# You can also specify the NLTK tagger, which works better for incomplete sentences.
blob_df = review_df['text'].apply(TextBlob)
blob_df[4].tags
Copy the code
[('General'.'NNP'),
 ('Manager'.'NNP'),
 ('Scott'.'NNP'),
 ('Petello'.'NNP'),
 ('is'.'VBZ'),
 ('a'.'DT'),
 ('good'.'JJ'),
 ('egg'.'NN'),
 ('Not'.'RB'),
 ('to'.'TO'),
 ('go'.'VB'),
 ('into'.'IN'),
 ('detail'.'NN'),
 ('but'.'CC'),
 ('let'.'VB'),
 ('me'.'PRP'),
 ('assure'.'VB'),
 ('you'.'PRP'),
 ('if'.'IN'),
 ('you'.'PRP'),
 ('have'.'VBP'),
 ('any'.'DT'),
 ('issues'.'NNS'),
 ('albeit'.'IN'),
 ('rare'.'NN'),
 ('speak'.'NN'),
 ('with'.'IN'),
 ('Scott'.'NNP'),
 ('and'.'CC'),
 ('treat'.'VB'),
 ('the'.'DT'),
 ('guy'.'NN'),
 ('with'.'IN'),
 ('some'.'DT'),
 ('respect'.'NN'),
 ('as'.'IN'),
 ('you'.'PRP'),
 ('state'.'NN'),
 ('your'.'PRP$'),
 ('case'.'NN'),
 ('and'.'CC'),
 ('I'.'PRP'),
 ("'d".'MD'),
 ('be'.'VB'),
 ('surprised'.'VBN'),
 ('if'.'IN'),
 ('you'.'PRP'),
 ('do'.'VBP'),
 ("n't".'RB'),
 ('walk'.'VB'),
 ('out'.'RP'),
 ('totally'.'RB'),
 ('satisfied'.'JJ'),
 ('as'.'IN'),
 ('I'.'PRP'),
 ('just'.'RB'),
 ('did'.'VBD'),
 ('Like'.'IN'),
 ('I'.'PRP'),
 ('always'.'RB'),
 ('say'.'VBP'),
 ('.. '.'VBP'),
 ('Mistakes'.'NNS'),
 ('are'.'VBP'),
 ('inevitable'.'JJ'),
 ('it'.'PRP'),
 ("'s".'VBZ'),
 ('how'.'WRB'),
 ('we'.'PRP'),
 ('recover'.'VBP'),
 ('from'.'IN'),
 ('them'.'PRP'),
 ('that'.'WDT'),
 ('is'.'VBZ'),
 ('important'.'JJ'),
 ('Thanks'.'NNS'),
 ('to'.'TO'),
 ('Scott'.'NNP'),
 ('and'.'CC'),
 ('his'.'PRP$'),
 ('awesome'.'JJ'),
 ('staff'.'NN'),
 ('You'.'PRP'),
 ("'ve".'VBP'),
 ('got'.'VBN'),
 ('a'.'DT'),
 ('customer'.'NN'),
 ('for'.'IN'),
 ('life'.'NN'),
 (A '^'.'NN')]
Copy the code
print([np for np in blob_df[4].noun_phrases])
Copy the code
['general manager'.'scott petello'.'good egg'.'scott'."n't walk".'... ..'.'mistakes'.'thanks'.'scott'.'awesome staff'.'... . . ']
Copy the code

You can see that the noun phrases found in each library are a little different. Spacy contains common words in English, such as “a” and “the, “while TextBlob removes those words. This reflects the differences in rules engines that drive each library to consider “noun phrases.” You can also write your part of speech relationship to define the block you are looking for. Natural language processing in Python provides insight into chunking in Python from scratch.

case

Construct a text data set

corpus = ['The sky is blue and beautiful.'.'Love this blue and beautiful sky! '.'The quick brown fox jumps over the lazy dog.'.'The brown fox is quick and the blue dog is lazy! '.'The sky is very blue and the sky is very beautiful today'.'The dog is lazy but the brown fox is quick! '    
]
labels = ['weather'.'weather'.'animals'.'animals'.'weather'.'animals']
corpus = np.array(corpus)
corpus_df = pd.DataFrame({'Document': corpus, 
                          'Category': labels})
corpus_df = corpus_df[['Document'.'Category']]
corpus_df
Copy the code
Document Category
0 The sky is blue and beautiful. weather
1 Love this blue and beautiful sky! weather
2 The quick brown fox jumps over the lazy dog. animals
3 The brown fox is quick and the blue dog is lazy! animals
4 The sky is very blue and the sky is very beaut… weather
5 The dog is lazy but the brown fox is quick! animals

Basic pretreatment

nltk.download()#
Copy the code
True
Copy the code
# word frequency and the stop WPT = me. WordPunctTokenizer () stop_words = me. Corpus. Stopwords. Words ('english')
print (stop_words)
def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z0-9\s]'.' ', doc, re.I)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)
Copy the code
['i'.'me'.'my'.'myself'.'we'.'our'.'ours'.'ourselves'.'you'."you're"."you've"."you'll"."you'd".'your'.'yours'.'yourself'.'yourselves'.'he'.'him'.'his'.'himself'.'she'."she's".'her'.'hers'.'herself'.'it'."it's".'its'.'itself'.'they'.'them'.'their'.'theirs'.'themselves'.'what'.'which'.'who'.'whom'.'this'.'that'."that'll".'these'.'those'.'am'.'is'.'are'.'was'.'were'.'be'.'been'.'being'.'have'.'has'.'had'.'having'.'do'.'does'.'did'.'doing'.'a'.'an'.'the'.'and'.'but'.'if'.'or'.'because'.'as'.'until'.'while'.'of'.'at'.'by'.'for'.'with'.'about'.'against'.'between'.'into'.'through'.'during'.'before'.'after'.'above'.'below'.'to'.'from'.'up'.'down'.'in'.'out'.'on'.'off'.'over'.'under'.'again'.'further'.'then'.'once'.'here'.'there'.'when'.'where'.'why'.'how'.'all'.'any'.'both'.'each'.'few'.'more'.'most'.'other'.'some'.'such'.'no'.'nor'.'not'.'only'.'own'.'same'.'so'.'than'.'too'.'very'.'s'.'t'.'can'.'will'.'just'.'don'."don't".'should'."should've".'now'.'d'.'ll'.'m'.'o'.'re'.'ve'.'y'.'ain'.'aren'."aren't".'couldn'."couldn't".'didn'."didn't".'doesn'."doesn't".'hadn'."hadn't".'hasn'."hasn't".'haven'."haven't".'isn'."isn't".'ma'.'mightn'."mightn't".'mustn'."mustn't".'needn'."needn't".'shan'."shan't".'shouldn'."shouldn't".'wasn'."wasn't".'weren'."weren't".'won'."won't".'wouldn'."wouldn't"]
Copy the code
norm_corpus = normalize_corpus(corpus)
norm_corpus
#The sky is blue and beautiful.
Copy the code
array(['sky blue beautiful'.'love blue beautiful sky'.'quick brown fox jumps lazy dog'.'brown fox quick blue dog lazy'.'sky blue sky beautiful today'.'dog lazy brown fox quick'],
      dtype='<U30')
Copy the code

The word bag model

from sklearn.feature_extraction.text import CountVectorizer
print (norm_corpus)
cv = CountVectorizer(min_df=0., max_df=1.)
cv.fit(norm_corpus)
print (cv.get_feature_names())
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix = cv_matrix.toarray()
cv_matrix
Copy the code
['sky blue beautiful' 'love blue beautiful sky'
 'quick brown fox jumps lazy dog' 'brown fox quick blue dog lazy'
 'sky blue sky beautiful today' 'dog lazy brown fox quick']
['beautiful'.'blue'.'brown'.'dog'.'fox'.'jumps'.'lazy'.'love'.'quick'.'sky'.'today']

array([[1.1.0.0.0.0.0.0.0.1.0],
       [1.1.0.0.0.0.0.1.0.1.0],
       [0.0.1.1.1.1.1.0.1.0.0],
       [0.1.1.1.1.0.1.0.1.0.0],
       [1.1.0.0.0.0.0.0.0.2.1],
       [0.0.1.1.1.0.1.0.1.0.0]], dtype=int64)
Copy the code
vocab = cv.get_feature_names()
pd.DataFrame(cv_matrix, columns=vocab)
Copy the code
beautiful blue brown dog fox jumps lazy love quick sky today
0 1 1 0 0 0 0 0 0 0 1 0
1 1 1 0 0 0 0 0 1 0 1 0
2 0 0 1 1 1 1 1 0 1 0 0
3 0 1 1 1 1 0 1 0 1 0 0
4 1 1 0 0 0 0 0 0 0 2 1
5 0 0 1 1 1 0 1 0 1 0 0

N – Grams model

bv = CountVectorizer(ngram_range=(2.2))
bv_matrix = bv.fit_transform(norm_corpus)
bv_matrix = bv_matrix.toarray()
vocab = bv.get_feature_names()
pd.DataFrame(bv_matrix, columns=vocab)
Copy the code
beautiful sky beautiful today blue beautiful blue dog blue sky brown fox dog lazy fox jumps fox quick jumps lazy lazy brown lazy dog love blue quick blue quick brown sky beautiful sky blue
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
2 0 0 0 0 0 1 0 1 0 1 0 1 0 0 1 0 0
3 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 0
4 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1
5 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0

TF – IDF model

from sklearn.feature_extraction.text importTfidfVectorizer # Chinese bee farming their slice frequency are20TV = TfidfVectorizer (min_df =0., max_df=1.. use_idf=True) tv_matrix = tv.fit_transform(norm_corpus) tv_matrix = tv_matrix.toarray() vocab = tv.get_feature_names() pd.DataFrame(np.round(tv_matrix,2), columns=vocab)
Copy the code
beautiful blue brown dog fox jumps lazy love quick sky today
0 0.60 0.52 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.60 0.00
1 0.46 0.39 0.00 0.00 0.00 0.00 0.00 0.66 0.00 0.46 0.00
2 0.00 0.00 0.38 0.38 0.38 0.54 0.38 0.00 0.38 0.00 0.00
3 0.00 0.36 0.42 0.42 0.42 0.00 0.42 0.00 0.42 0.00 0.00
4 0.36 0.31 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.72 0.52
5 0.00 0.00 0.45 0.45 0.45 0.00 0.45 0.00 0.45 0.00 0.00

Similarity characteristics

from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df
Copy the code
0 1 2 3 4 5
0 1.000000 0.753128 0.000000 0.185447 0.807539 0.000000
1 0.753128 1.000000 0.000000 0.139665 0.608181 0.000000
2 0.000000 0.000000 1.000000 0.784362 0.000000 0.839987
3 0.185447 0.139665 0.784362 1.000000 0.109653 0.933779
4 0.807539 0.608181 0.000000 0.109653 1.000000 0.000000
5 0.000000 0.000000 0.839987 0.933779 0.000000 1.000000

Clustering characteristics

from sklearn.cluster import KMeans

km = KMeans(n_clusters=2)
km.fit_transform(similarity_df)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
pd.concat([corpus_df, cluster_labels], axis=1)
Copy the code
Document Category ClusterLabel
0 The sky is blue and beautiful. weather 1
1 Love this blue and beautiful sky! weather 1
2 The quick brown fox jumps over the lazy dog. animals 0
3 The brown fox is quick and the blue dog is lazy! animals 0
4 The sky is very blue and the sky is very beaut… weather 1
5 The dog is lazy but the brown fox is quick! animals 0

Topic model

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_topics=2, max_iter=100, random_state=42)
dt_matrix = lda.fit_transform(tv_matrix)
features = pd.DataFrame(dt_matrix, columns=['T1'.'T2'])
features
Copy the code
T1 T2
0 0.190548 0.809452
1 0.176804 0.823196
2 0.846184 0.153816
3 0.814863 0.185137
4 0.180516 0.819484
5 0.839172 0.160828

Subject and word weights

tt_matrix = lda.components_
for topic_weights in tt_matrix:
    topic = [(token, weight) for token, weight in zip(vocab, topic_weights)]
    topic = sorted(topic, key=lambda x: -x[1])
    topic = [item for item in topic if item[1] > 0.6]
    print(topic)
    print(a)Copy the code
[('brown'.1.7273638692668465), ('dog'.1.7273638692668465), ('fox'.1.7273638692668465), ('lazy'.1.7273638692668465), ('quick'.1.7273638692668465), ('jumps'.1.0328325272484777), ('blue'.0.7731573162915626)]

[('sky'.2.264386643135622), ('beautiful'.1.9068269319456903), ('blue'.1.7996282104933266), ('love'.1.148127242397004), ('today'.1.0068251160429935)]
Copy the code

Word embedding model

from gensim.models import word2vec

wpt = nltk.WordPunctTokenizer()
tokenized_corpus = [wpt.tokenize(document) for document in norm_corpus]

# Set values for various parameters
feature_size = 10    # Word vector dimensionality  
window_context = 10          # Context window size                                                                                    
min_word_count = 1   # Minimum word count                        
sample = 1e-3   # Downsample setting for frequent words

w2v_model = word2vec.Word2Vec(tokenized_corpus, size=feature_size, 
                          window=window_context, min_count = min_word_count,
                          sample=sample)
Copy the code
w2v_model.wv['sky']
Copy the code
array([ 0.04776765.0.04441591.0.0468228 , 0.04031719.0.04735648.0.00321561.0.03345697.0.0451241 ,  0.03330296.0.03037446],
      dtype=float32)
Copy the code
def average_word_vectors(words, model, vocabulary, num_features):
    
    feature_vector = np.zeros((num_features,),dtype="float64")
    nwords = 0.
    
    for word in words:
        if word in vocabulary: 
            nwords = nwords + 1.
            feature_vector = np.add(feature_vector, model[word])
    
    if nwords:
        feature_vector = np.divide(feature_vector, nwords)
        
    return feature_vector
    
   
def averaged_word_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)
Copy the code
w2v_feature_array = averaged_word_vectorizer(corpus=tokenized_corpus, model=w2v_model,
                                             num_features=feature_size)
pd.DataFrame(w2v_feature_array) #lstm
Copy the code
0 1 2 3 4 5 6 7 8 9
0 0.024127 0.017077 0.026422 0.029402 0.001112 0.002655 0.004409 0.008712 0.009802 0.012457
1 0.029736 0.022432 0.018225 0.022848 0.010878 0.004652 0.012399 0.000851 0.003157 0.018775
2 0.002641 0.005180 0.005764 0.001435 0.018988 0.010134 0.001064 0.004431 0.002646 0.000689
3 0.000735 0.004496 0.008730 0.011999 0.018633 0.017814 0.000056 0.001533 0.001793 0.016195
4 0.030725 0.006659 0.023489 0.032785 0.000014 0.010089 0.016119 0.005245 0.009107 0.007049
5 0.001140 0.015327 0.001268 0.008928 0.012809 0.013047 0.001205 0.001266 0.003366 0.010224

conclusion

Word bag models are easy to understand and calculate, and are useful for classification and search tasks. But sometimes a single word is too simple to encapsulate certain information in the text. In order to solve this problem, one hopes to have a relatively long sequence. Bag-of-ngram is a natural generalization of BOW, the concept is still easy to understand, and its computational overhead is as easy as BOW.

Bag of- Ngram generates more different Ngrams. It increases the cost of feature storage, as well as the computational cost of model training and prediction phases. While the number of data points remains the same, the dimension of the feature space is now larger. Therefore, the data density is more sparse. The higher the n, the higher the storage and computing costs, and the more sparse the data. For these reasons, longer N-grams do not always improve model accuracy (or any other performance indicator). People usually stop at n = 2 or 3. The lesser N-gram is rarely used.

One way to prevent sparsity and increased cost is to filter n-grams and keep the most meaningful phrases. This is the goal of collocation extraction. In theory, collocations (or phrases) can form discontinuous sequences of markers in text. In practice, however, finding discontinuous phrases is much more computationally expensive and does not yield much benefit. So collocation extraction usually starts with a list of candidates, and statistical methods are used to filter them.

All of these methods convert a series of text tags into a set of broken counts. A set has much less structure than a sequence; They lead to plane eigenvectors.

In this chapter, we describe text characterization techniques in simple language. These techniques transform a piece of natural language text with rich semantic structure into a simple plane vector. We discuss some common filtering techniques to reduce vector dimensions. We also introduced Ngram and collocation extraction as methods to add more structure to the plane vector. The next chapter details another common text characterization technique, called TF-IDF. Subsequent sections will discuss more ways to add structure back to a plane vector.

reference

Dunning, Ted. 1993. “Accurate methods for the statistics of surprise and

“ACM Journal of Computational Linguistics, special issue on using large corpora, 19:1 (61 — 74).

“Hypothesis Testing and p-values.” Khan Academy, May 31,

2016, www.khanacademy.org/math/probab… .

Manning,Christopher d. and Hinrich Schutze. 1999. Foundations of StatisticalNatural Language Processing. Cambridge, China, Massachusettes: MIT Press.

Sometimes people called it the document “vector.” The vector extends from the original and ends at the Specified point. For our purposes, “vector” and “point” are the same thing.

The related resources

Book Address:

www.oreilly.com/library/vie…

The code can be downloaded at Github:

Github.com/fengdu78/Da…

Baidu Cloud of data set:

Link: pan.baidu.com/s/1uDXt5jWU…

You are not alone in the battle. The path and materials suitable for beginners to enter artificial intelligence download machine learning online manual Deep learning online Manual note:4500+ user ID:92416895), please reply to knowledge PlanetCopy the code