Natural language processing (NLP) is one of the most interesting subfields in data science, with a growing number of data scientists looking to develop solutions involving unstructured text data. Despite this, many applied data scientists (all with STEM and social science backgrounds) still lack NLP experience.

In this article, I’ll explore some basic NLP concepts and show how to implement them using the increasingly popular Python spaCy package. This article is suitable for NLP beginners, but it assumes that the reader has a knowledge of Python.

Are you talking about spaCy?

SpaCy is a relatively new package, “Industrial-grade Python natural language kit”, created by Matt Honnibal at Explosion AI. Development. It was designed primarily for application data scientists, which means it doesn’t require users to decide which algorithm to use for common tasks, and it’s incredibly fast (it’s done in Cython). If you’re familiar with the Python data science stack, spaCy is NLP’s NUMpy, and while it’s certainly at the bottom of the heap, it’s intuitive and fairly high performance.

So, what can it do?

SpaCy provides a one-stop shop for tasks commonly used in any NLP project. Include:
































I’ll give you a high-level overview of these capabilities and show you how to access them using spaCy.

So let’s get started.

First, we load spaCy’s pipeline, which by convention is stored in a variable called NLP. It takes a few seconds to declare this variable because spaCy preloads the model and data to the front end to save time. In fact, doing so can do some of the heavy lifting ahead of time, making it less expensive for NLP to parse the data. Note that the language model we use here is English, as well as a full-featured German model that can be tokenized in multiple languages (discussed below).

We call NLP in the sample text to create the Doc object. The Doc object is the NLP task container for the text itself, which splits the text into literals (Span objects) and elements (Token objects), which actually contain no data. It is worth noting that the Token and Span objects actually have no data. Instead, they contain Pointers to data in Doc objects and are lazily evaluated (that is, on request). Most of spaCy’s core functions are implemented by means of Doc (n=33), Span (n=29), and Token (n=78) objects.

In[1]:import spacy ... : nlp = spacy.load("en") ... : doc = nlp("The big grey dog ate all of the chocolate, but fortunately he wasn't sick!" )Copy the code

Word segmentation (tokenization)

Word segmentation is a basic step in many natural language processing tasks. Segmentation is the process of breaking up a piece of text into words, symbols, punctuation marks, Spaces, and other elements to create tokens. An easy way to do this is to split strings over Spaces:

In[2]:doc.text.split() ... : Out[2]: ['The', 'big', 'grey', 'dog', 'ate', 'all', 'of', 'the', 'chocolate,', 'but', 'fortunately', 'he', "wasn't", 'sick!']Copy the code

On the face of it, the word segmentation with blank space works fine. Notice, however, that it ignores punctuation and doesn’t separate verbs from adverbs (“was”, “n’t”). In other words, it’s too naive to recognize the elements of text that help us (and machines) understand their structure and meaning. Let’s look at how spaCy deals with this:

In[3]:[token.orth_ for token in doc] ... : Out[3]: ['The', 'big', 'grey', 'dog', 'ate', 'all', 'of', 'the', 'chocolate', ',', 'but', 'fortunately', 'he', 'was', "n't", ' ', 'sick', '!']Copy the code

Here, we access the.orth_ method for each token, which returns a string representing the token instead of a SpaCytoken object. This may not always be desirable, but it’s worth noting. SpaCy recognizes punctuation marks and is able to separate them from the tokens of words. Many of SpaCy’s token methods return both string and integer values for the literal to be processed: methods with an underscore suffix return a string and methods without an underscore suffix return an integer. Such as:

In[4]:[(token, token.orth_, token.orth) for token in doc] ... : Out[4]:[(The, 'The', 517), (big, 'big', 742), (grey, 'grey', 4623), (dog, 'dog', 1175), (ate, 'ate', 3469), (all, 'all', 516), (of, 'of', 471), (the, 'the', 466), (chocolate, 'chocolate', 3593), (,, ',', 416), (but, 'but', 494), (fortunately, 'fortunately', 15520), (he, 'he', 514), (was, 'was', 491), (n't, "n't", 479), ( , ' ', 483), (sick, 'sick', 1698), (!, '!', 495)] In[5]: [token.orth_ for token in doc if not token.is_punct | token.is_space] ... : Out[5]: ['The', 'big', 'grey', 'dog', 'ate', 'all', 'of', 'the', 'chocolate', 'but', 'fortunately', 'he', 'was', "n't", 'sick']Copy the code

Cool, right?

stemming

The task associated with word segmentation is stem extraction. Stem extraction is the process of reducing a word to its basic form, the parent word. Words used in different ways often have roots with the same meaning. For example, practice, practice, and observation practically refer to the same thing. It is often necessary to standardize words of similar meaning down to their basic form. With SpaCy, we access the basic form of each word using the tagged.lemma_ method.

In[6]:practice = "practice practiced practicing" ... : nlp_practice = nlp(practice) ... : [word.lemma_ for word in nlp_practice] ... : Out[6]: ['practice', 'practice', 'practice']Copy the code

Why is this useful? One immediate use case is machine learning, especially text categorization. For example, stem extraction of text is required before creating “word bags” to avoid word repetition. Therefore, the model can more clearly describe word usage patterns across multiple documents.

POS Tagging

Part-of-speech tagging is the process of assigning grammatical attributes (such as nouns, verbs, adverbs, adjectives, etc.) to words. Words that share the same part-of-speech markers tend to follow similar syntactic structures, which can be useful in rule-based processing.

For example, in a given event description, we might want to determine who owns what. We can do this by making use of the possessive case (to provide the syntax of the text). SpaCy uses the popular Penn Treebank POS tag (see here). With SpaCy, coarse-grained POS tags and fine-grained POS tags can be accessed using the.pos_ and.tag_ methods, respectively. Here, I access the fine-grained POS tag:

In[7]:doc2 = nlp("Conor's dog's toy was hidden under the man's sofa in the woman's house") ... : pos_tags = [(i, i.tag_) fori indoc2] ... : pos_tags ... : Out[7]: [(Conor,'NNP'), ('s, 'POS'), (dog,'NN'), ('s, 'POS'), (toy,'NN'), (was,'VBD'), (hidden,'VBN'), (under,'IN'), (the,'DT'), (man,'NN'), ('s, 'POS'), (sofa,'NN'), (in,'IN'), (the,'DT'), (woman,'NN'), ('s, 'POS'), (house,'NN')]Copy the code

We can see that the’s tag is marked POS. We can use this tag to extract the owner and what they own:

In[8]:owners_possessions = [] ... : for i in pos_tags: ... : if i[1] == "POS": ... : owner = i[0].nbor(-1) ... : possession = i[0].nbor(1) ... : owners_possessions.append((owner, possession)) ... :... : owners_possessions ... : Out[8]: [(Conor, dog), (dog, toy), (man, sofa), (woman, house)]Copy the code

This returns a list of tuples owned by the owner. If you want to be super Python expert at this, you can make a full list of it (which I think is the best!). :

In[9]: [(i[0].nbor(-1), i[0].nbor(+1)) for i in pos_tags if i[1] == "POS"] ... : Out[9]: [(Conor, dog), (dog, toy), (man, sofa), (woman, house)]Copy the code

In this case, we use the.nbor method for each tag, which returns a tag adjacent to the tag.

Entity recognition

Entity recognition is the process of classifying specified entities in text into pre-defined categories, such as person, place, organization, date, etc. SpaCy uses statistical models to categorize various models, including individuals, events, works of art, and nationality/religion (see full list file).

For example, let’s pick the first two sentences from Barack Obama’s Wikipedia entry. We will parse this text and then use the.ents method of the Doc object to access the identified entity. By calling Doc’s method, we can access other labeling methods, especially.label_ and.label:

In[10]:wiki_obama = """Barack Obama is an American politician who served as ... : the 44th President of the United States from 2009 to 2017.He is the first ... : African American to have served as president, ... : As well as the first born outside the contiguous United States."""... :... Nlp_obama = NLP (wiki_obama)... : [(I, i.label_, I.label) for I in NLP_obama.ents]... : Out[10]: [(Barack Obama, 'PERSON', 346), (American, 'NORP', 347), (the United States, 'GPE', 350), (2009 to 2017, 'DATE', 356), (first, 'ORDINAL', 361), (African, 'NORP', 347), (American, 'NORP', 347), (first, 'ORDINAL', 361), (United States, 'GPE', 350)]Copy the code

You can see what entities are identified by the model and how accurate they are in this case. PERSON is self-explanatory; NORP is a national or religious group; GGPE identifies location (city, country, etc.); DATE identifies a specific DATE or range of dates, and ORDINAL identifies a word or number that indicates some type of order.

While we’re on the subject of Doc methods, it’s worth mentioning spaCy’s sentence identifiers. It is not uncommon for NLP tasks to want to break documents into sentences. Accessing the Doc’s s. ents method with SpaCy is not difficult:

In[11]:for ix, sent in enumerate(nlp_obama.sents, 1): ... : print("Sentence number {}: {}".format(ix, sent)) ... : Sentence number 1: Barack Obama is an American politician who served as the 44th President of the United States from 2009 to 2017.Sentence number 2: He is the first African American to have served as president, as well as the first born outside the contiguous United States.Copy the code

So far. In future articles, I’ll show you how to use spaCy in complex data mining and ML tasks.

TrueSight is an AIOps platform powered by machine learning and analytics that improves the efficiency of IT operations by addressing the complexity of multiple clouds and increasing the speed of digital transformation.


The original post was published on March 28, 2018

Jayesh Ahire

This article is from the cloud community partner “Datapai THU”. For relevant information, you can follow the wechat public account of “Datapai THU”