(Python libraries and versions used in this article: Python 3.6, Numpy 1.14, SciKit-Learn 0.19, Matplotlib 2.2, NLTK 3.3)

The process of using NLP to identify a hidden pattern in a text document can discover the hidden theme of the document for analysis. The realization process of topic modeling is to identify the most meaningful words in a text document that can best represent the topic to achieve topic classification, that is, to find the key words in the text document, through which the hidden theme of a document can be identified.


1. Prepare data sets

The data set used in this time is stored in a TXT document, so it is necessary to load the text content from the TXT document, and then preprocess the text. Since there are many pre-processing steps, a class is created here to complete the data loading and pre-processing process, which also makes the code look simpler and more general.

Create a class to load the dataset and preprocess the data
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from gensim import models, corpora

class DataSet:
    
    def __init__(self,txt_file_path):
        self.__txt_file=txt_file_path
    
    def __load_txt(self): Load text content from TXT document, read in line by line
        with open(self.__txt_file,'r') as file:
            content=file.readlines() # read all rows at once
        return [line[:- 1] for line in content] # Remove \n at the end of each line
    
    def __tokenize(self,lines_list): # Preprocessing one: word segmentation for each line of text
        tokenizer=RegexpTokenizer('\w+') 
        The reason we use the regular expression descriptor here instead of word_tokenize is to exclude words with punctuation
        return [tokenizer.tokenize(line.lower()) for line in lines_list]
    
    def __remove_stops(self,lines_list): Preprocessing # 2: Extract the stop word for each line
        # We need to delete some stop words to avoid the noise of these words, so we need a stop words list
        stop_words_list=stopwords.words('english')  Get the list of English stop words
        return [[token for token in line if token not in stop_words_list]
                for line in lines_list] 
        Lines_list is also a list, and this list is a line of text,
        Lines_list can be seen as a two-dimensional array with two layers of generator
    
    def __word_stemm(self,lines_list): # 3 preprocessing: Stem extraction for each participle
        stemmer=SnowballStemmer('english')
        return [[stemmer.stem(word) for word in line] for line in lines_list]
    
    def prepare(self):
        A function to be called externally to prepare the data set.
        # First load text content from TXT file, then segmentation, then remove stop words, and then stem extraction
        stemmed_words=self.__word_stemm(self.__remove_stops(self.__tokenize(self.__load_txt())))
        The following modeling requires dict-based word matrix, so dict word matrix is built in Corpora first
        dict_words=corpora.Dictionary(stemmed_words)
        matrix_words=[dict_words.doc2bow(text) for text in stemmed_words]
        return dict_words, matrix_words 
    
    The following functions are mainly used to test whether the above functions work properly
    def get_content(self):
        return self.__load_txt()
    
    def get_tokenize(self):
        return self.__tokenize(self.__load_txt())
    
    def get_remove_stops(self):
        return self.__remove_stops(self.__tokenize(self.__load_txt()))
    
    def get_word_stemm(self):
        return self.__word_stemm(self.__remove_stops(self.__tokenize(self.__load_txt())))
Copy the code

Does the class work and get the results we expect? You can test this with the following code

Verify that the above DataSet class is running properly
dataset=DataSet("E:\PyProjects\DataSet\FireAI\data_topic_modeling.txt")

Load_txt () = load_txt(
content=dataset.get_content()
print(len(content))
print(content[:3])

The following tests whether the __tokenize() function works
tokenized=dataset.get_tokenize()
print(tokenized)

Test the __remove_stops() function
removed=dataset.get_remove_stops()
print(removed)

The __word_stemm() function is properly tested
stemmed=dataset.get_word_stemm()
print(stemmed)

Test whether the prepare function is normal
_,prepared=dataset.prepare()
print(prepared)
Copy the code

The output run result is longer, you can see my Github source code.


2. Build models and train data sets

We use LDA model (Latent Dirichlet Allocation) to model the topic as follows:

Get the data set
dataset=DataSet("E:\PyProjects\DataSet\FireAI\data_topic_modeling.txt")
dict_words, matrix_words =dataset.prepare()

# Modeling using LDAModel
lda_model=models.ldamodel.LdaModel(matrix_words,num_topics=2,
                           id2word=dict_words,passes=25) 
# Here we assume that the original document has two themes
Copy the code

The above code will create LDAModel and train the model. Note that LDAModel is located in the gensim module, which needs to be installed by PIP Install Gensim before it can be used.

LDAModel calculates the importance of each word and then builds the importance calculation equation, which is used to predict the topic.

The importance equation can be printed with the following code:

# View the N most important words in the model
print('Most important words to topics: ')
for item in lda_model.print_topics(num_topics=2,num_words=5) :Print only the 5 most important words here
    print('Topic: {}, words: {}'.format(item[0],item[1]))
Copy the code

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — – a — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Most important words to topics: Topic: 0, words: 0.075*”need” + 0.053*”order” + 0.032*”system” + 0.032*”work” Topic: 1, Words: 1.04 *” work “+ 1.04 *” work” + 1.04 *” work “+ 1.04 *” work”

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — – — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –

# # # # # # # # # # # # # # # # # # # # # # # # small * * * * * * * * * * and # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

1. General machine learning projects require us to deal with data sets, so we can write the data set processing process into a special class. For example, I wrote the text preprocessing process in the class above, and each function represents a preprocessing method, which is clear and universal.

2. Here, we use LDAModel in GenSim module to do theme modeling. Gensim module is a very useful NLP processing tool, which is widely used in text content analysis.

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #


Note: This part of the code has been uploaded to (my Github), welcome to download.

References:

1, Classic Examples of Python machine learning, by Prateek Joshi, translated by Tao Junjie and Chen Xiaoli