Cleaning and preprocessing data is the most important performance aspect in the life cycle of any data science project. For example, if you’re dealing with unstructured text data, which is complicated among all data, and you do the same modeling, two things happen. Either you’ll make a big mistake, or your model won’t behave as you expect. You’re probably already wondering how modern voice assistance systems like Google Assist, Alexa, and Siri understand, process, and respond to human language, so there’s a big piece here, natural language processing.

Natural language processing, OR NLP, is a technique for semantic analysis of data with the help of computer science and artificial intelligence. It is basically the art of extracting meaningful information from raw data, and its purpose is to make interconnections between natural language and computers, which means analyzing and modeling large amounts of natural language data. By harnessing the power of NLP, real-world business problems such as summary files, title generators, title generators, fraud detection, speech recognition and, importantly, neuromachine translation can be solved.

Text processing is a method used under NLP to clean up text and prepare for modeling. It is versatile and contains various forms of noise such as emotion, punctuation and text written in the form of numbers or special characters. We have to deal with these major problems because machines don’t understand them just by asking for numbers. Starting with text processing, some libraries written in Python simplify the process, with a straightforward syntax that provides a lot of flexibility. The first is NLTK, which stands for natural language toolkit and is useful for all tasks such as word drying, POS, tokenization, lexicalization, and so on.

You probably know the acronyms; There is no sentence without an abbreviation, which means that we tend to use words one way every time, like did’t instead of did not, so what happens when we tag those words, it comes in the form of ‘did”t’ and there is nothing to be done about them. To deal with such words, there is a library called abbreviations. BeautifulSoup is a library for web scraping, but sometimes we tend to get data with HTML tags and urls, and BeautifulSoup is used to deal with this. To convert numbers into words, we use the Inflect library.

Implementing text preprocessing

In the Python code below, we remove noise from the raw text data of the Twitter sentiment analysis dataset. After that, we move on to pause word removal, drying, and lexical processing.

Import all dependencies.
! pip install contractions
import nltk
import contractions
import inflect
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer
from bs4 import BeautifulSoup
import re, string, unicodedata
Copy the code
Get rid of noise.

The first step is to remove noise from the data; In the field of text, noise refers to things that have nothing to do with human language text and have various properties such as special characters, the use of parentheses, the use of square brackets, whitespace, urls, and punctuation.

Here’s the sample text we’re working on.

As you can see, first there are many HTML tags and a URL; We need to delete them, and for that we use BeautifulSoup. The code snippet below removes both.

# to remove HTML tag
def html_remover(data):
  beauti = BeautifulSoup(data,'html.parser')
  return beauti.get_text()

# to remove URL
def url_remover(data):
  return re.sub(r'https\S','',data)

def web_associated(data):
  text = html_remover(data)
  text = url_remover(text)
  return text

new_data = web_associated(data)
Copy the code

After removing HTML tags and urls, there is still some punctuation and white space noise, as well as text data in parentheses; This also needs to be addressed.

def remove_round_brackets(data): return re.sub('\(.*? \)','',data) def remove_punc(data): trans = str.maketrans('','', string.punctuation) return data.translate(trans) def white_space(data): return ' '.join(data.split()) def complete_noise(data): new_data = remove_round_brackets(data) new_data = remove_punc(new_data) new_data = white_space(new_data) return new_data  new_data = complete_noise(new_data)Copy the code

Now, as you can see, we’ve succeeded in removing all the noise from the text.

Normalize the text.

Normally, text normalization starts with tagging the text, and our longer corpus is now broken up into chunks of words, which NLTK’s tagger class can do. After that, we need to lowercase every word in the corpus, convert numbers to words, and finally make abbreviations.

def text_lower(data):
  return data.lower()

def contraction_replace(data):
  return contractions.fix(data)

def number_to_text(data):
  temp_str = data.split()
  string = []
  for i in temp_str:
    # if the word is digit, converted to 
    # word else the sequence continues
    if i.isdigit():
      temp = inflect.engine().number_to_words(i)
      string.append(temp)
    else:
      string.append(i)
  return temp_str

def normalization(data):
  text = text_lower(data)
  text = number_to_text(text)
  text = contraction_replace(text)
  nltk.download('punkt')
  tokens = nltk.word_tokenize(text)
  return tokens

tokens = normalization(new_data)
print(tokens)
Copy the code

Now, we are nearing the end of basic text preprocessing; Now, we have only one important thing left: the stop word. When analyzing textual data, stop words make no sense at all; It is used for decorative purposes only. Therefore, in order to further reduce the dimension, it is necessary to delete the pause word from the corpus.

Finally, we have two options, namely, to express our corpus in terms of lexical or phrasal form. Stemification usually attempts to convert words into their root form, and most often by simply cutting up the words. And rooting also does the dry job, but in a proper way means that it converts the word to the root format, for example ‘scenes’ will be converted to ‘scene’. One can choose between desiccation and affix.

def stopword(data):
  nltk.download('stopwords')
  clean = []
  for i in data:
    if i not in stopwords.words('english'):
      clean.append(i)
  return clean

def stemming(data):
  stemmer = LancasterStemmer()
  stemmed = []
  for i in data:
    stem = stemmer.stem(i)
    stemmed.append(stem)
  return stemmed

def lemmatization(data):
  nltk.download('wordnet')
  lemma = WordNetLemmatizer()
  lemmas = []
  for i in data:
    lem = lemma.lemmatize(i, pos='v')
    lemmas.append(lem)
  return lemmas  

def final_process(data):
  stopwords_remove = stopword(data)
  stemmed = stemming(stopwords_remove)
  lemm = lemmatization(stopwords_remove)
  return stemmed, lemm
stem,lemmas = final_process(tokens)
Copy the code

Below we can see words that have been dried and affixed.

The epilogue.

In this article, we discuss how preprocessing of text is necessary for modeling. From the beginning, we learned how to remove HTML tags and remove noise from urls. First, in order to remove noise, we had to outline our corpus to customize the noise composition. We have observed a huge tradeoff between lexicalization and lexicalization, and we should always use lexicalized words.

Refs.

  • Link to the code above
  • BeautifulSoup
  • NLTK

The postComplete Tutorial on Text Preprocessing in NLPappeared first onAnalytics India Magazine.