Python text data preprocessing practices

Before data analysis and visualization, we must deal with the data well, and most of the time we need to deal with the text data, this paper summarizes some methods of text preprocessing.

Converts letters appearing in text to lowercase

input_str = """ There are some people who think love is sex And marriage And six o'clock-kisses And children, And perhaps it is, Miss Lester. But do you know what I think? I think love is a touch and yet not a touch """
input_str = input_str.lower()
print(input_str)
Copy the code

The results are as follows:

Removes or extracts numbers from text

If the numbers in the text are not relevant to the text analysis, delete them.

import re

input_str = 'Hello Python123 666 Hi jupyter notebook 1111'
result = re.sub(r'\d+'.' ', input_str)
print(result)
Copy the code

The results are as follows:

In some cases, for example, in the data obtained, the salary of the job recruitment information is 15K, and in the product purchase information, the number of people buying the product is 8500+ people, so we need to extract the number.

input_str = 'Salary: 15K, 8500+ people paid, 30,000 + people paid'
result = re.findall("-? \d+\.? \d*e? -? \d*?", input_str)

print(result)
Copy the code

The results are as follows:

Filter punctuation from text

import re

input_str = """This &is [an] example? \ Yip Ting Yun << 1"! "" . ; 11??? 【 】 > > 1 * yetingyun/p:? | {of} string. with.? punctuation!!!!""" 
s = re.sub(r'[^\w\s]'.' ', input_str)
print(s)
Copy the code

The results are as follows:

Use regular expressions to filter out punctuation marks in text. If whitespace also needs filtering, use r'[^\w]’. The principle is simple: in regular expressions, \w matches letters or numbers or underscores or Chinese characters (depending on the character set), and ^\w represents the reverse match.

Delete useless Spaces at both ends

input_str = " \t yetingyun \t "
input_str = input_str.strip()
input_str
Copy the code

The results are as follows:

Chinese word segmentation, filter stop words and single words

# download stop word data from Github//github.com/zhousishuo/stopwords
import jieba
importRe # read text data for testing'comments.txt'New_data = re.findall() as f: data = f.read()'[\u4e00-\u9fa5]+', data, re.S)
new_data = "/"Join (new_data) # seg_list_exact = jieba.cut(new_data, cut_all=False) #'stop_words.txt', encoding='utf-8'Con = f.read().split()'\n')
    stop_words = set()
    forI in con: stop_words. Add (I) # list = [wordfor word in seg_list_exact if word not in stop_words and len(word) > 1]
result_list
Copy the code

The results are as follows:

First, read the text data used for the test, which is the product review data. This kind of data usually has a lot of meaningless words and symbols. Filter out useless symbols through regular expression, and only extract Chinese out. Jieba library is used to perform text word segmentation, load stop words data into the set, and then filter out stop words and single words analytically in a one-line list, which is efficient. Stop word data can be downloaded some public, and then according to the actual text processing needs, add words into the corpus, so that the filtering effect is better.

Github download stop words data: github.com/zhousishuo/…

SnowNLP is a Python class library that can easily process Chinese text. It was inspired by TextBlob. Since most natural language processing libraries are aimed at English, SnowNLP is a class library that can easily process Chinese text. NLTK is not used here, all algorithms are self-implemented and come with some trained dictionaries. Note that this procedure is dealing with Unicode code, so use your own decode into Unicode code.

SnowNLP is very convenient for processing Chinese text data. Take pos tagging and keyword extraction as an example:

from snownlp import SnowNLP

word = u'It's a beautiful day, this girl looks good.'
s = SnowNLP(word)
print(s. ords) # participleprint# Part of speech taggingCopy the code

from snownlp import SnowNLP

text = u' 'Natural language processing is an important field in the field of computer science and artificial intelligence. It studies various theories and methods that can realize effective communication between human and computer in natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the study of this field will involve natural language, that is, the language that people use every day, so it has a close connection with the study of linguistics, but there are important differences. Natural language processing is not the study of natural language in general, but the development of computer systems, especially software systems, which can effectively realize natural language communication. So it's part of computer science. '' '

s = SnowNLP(text)
print(s.keywords(limit=6) # Keyword extractionCopy the code

Recommended reading:

Github.com/isnowfy/sno…

Docs.python.org/3/library/r…

The Python toolchain lets you write more formal code

Simple recursive algorithms in Python \

5 minutes to quickly master Adam optimization algorithm \

Special recommendation \

Click below to read the article and join the community

Python text data preprocessing practices

Related Posts

RocketMQ cluster stomp pits

Monitor alarm Prometheus

Tensorflow – Gradient descent, this one is enough