These reviews

The previous article introduced the knowledge related to naive Bayes algorithm, including the following aspects:

  • The basic principle of naive Bayes algorithm
  • Formula derivation Bayes criterion (conditional probability formula)
  • Construct training and testing simple text classification algorithm
  • Laplace smoothing correction

Formula derivation is more important, and the use of conditional probability to solve problems is also the basic idea of naive Bayes, so it is very important to understand how bayes criterion is obtained and how to apply, which is also the basis for the later construction of algorithms.

Naive Bayes algorithm is widely used in real life, such as text classification, spam classification, credit evaluation, phishing website detection and so on. In terms of text classification, among many classification algorithms, naive Bayes classification algorithm is also one of the classifiers with better learning efficiency and classification effect. Because the naive Bayes principle is simple, it is relatively easy to construct the algorithm and has good interpretability.

However, naive Bayes algorithm is characterized by the assumption that all features are independent of each other and each feature is equally important. But in fact this assumption does not work in the real world: first, the necessary connection between two adjacent words cannot be independent; Secondly, for an article, a few representative words determine its theme, without having to read the whole article and look at all the words. Therefore, it is necessary to adopt appropriate methods for feature selection, so that naive Bayes classifier can achieve higher classification efficiency.

In this paper, the background

In this paper, naive Bayes method is used to construct an emotion classifier, which is used to judge whether an unknown statement expresses positive emotion or negative emotion. The accuracy of the classifier is obtained by comparing the predicted result with the real result.

Recently on the trill accidentally saw a movie clips – hunger platform, the background is in dystopian country of the future, the prisoners were held in the vertical stack cells and hungry to see food from the top down, close to the top of people eat too full, and is located at the bottom of the people from hunger and become aggressive, focuses on the darkness and hungry side of human nature. Personally, I think this movie is good, so I choose the short comments of this movie on Douban as the data of this paper. Unfortunately, the data is not much, but it is mainly about thoughts.

The douban crawler is relatively easy, so the crawler part is not much of a summary. I used requests and BeautifulSoup here, but it is important to note that in the simulated login part, if you do not simulated login, you can only get the first 10 pages of short comments, and after simulated login, you can get a total of 24 pages of short comments. There is no conflict between popular posts and the latest posts. The latest posts can get 100 posts, so we can sample more data.

Finally, the data set was obtained with 580 samples and three attributes, as shown in the screenshot below:

Text preprocessing

In this little practical exercise to build the emotion classifier, the algorithm part is not very complicated, most of it is mentioned above, and most of the operation is preprocessing the data set. If it is a public data source access to data sets, may only need simple processing, because most of the questions the author of the data set has been resolved, but personal crawled data sets, the problems existing in the relatively more, we hope all the essay text into in vocabulary list of format, the following text preprocessing.

In the original data set, the column of rating is composed of rating + recommendation index, and the format is not what we need. Therefore, a custom function is used here to divide it into five levels from 1 to 5. We can regard the rating level as the emotional classification corresponding to short comments.

# Divide the score into five scales: 1-5

def rating(e):

if '50' in e:

return 5

elif '40' in e:

return 4

elif '30' in e:

return 3

elif '20' in e:

return 2

elif '10' in e:

return 1

else:

return 'none'

Create a new column based on the rating function using map

data['new_rating'] = data['rating'].map(rating)

Copy the code

After grading, we need to mark the emotion of each short comment. Here, we choose to delete the short comment with grade 3, because the category of sentiment of the short comment cannot be determined. Then, the short comments with a rating of 4 and 5 were marked with 1 as positive emotions. Short comments with a rating of 1 or 2 are marked with 0 as negative emotions.

Delete the short comment with a score of 3 and judge the emotion with a score of 3 to be neutral

data = data[data['new_rating'] != 3]

# Mark 4 and 5 scores as 1, which are regarded as positive emotions; Score 1 and 2 were marked as 0, indicating negative emotions

data['sentiment'] = data['new_rating'].apply(lambda x: +1 if x > 3 else 0)

Copy the code

The pie chart below shows the proportion of the five rating levels. It can be seen that the proportion of 3 is relatively large, so deleting the operation of 3 makes the data set lose a lot of data samples. The proportion of 4 and 5 points is much higher than that of 1 and 2 points, so the positive emotions in the data set take up a large proportion. The movie may be really good, but it also highlights the problem of unbalanced data proportion.

Crawler for the essay may contain a lot of English symbols, words, letters, the sentiment analysis for Chinese is no help, so before the word segmentation, using two custom function cancel symbols and English letters in the essay, there is no digital operation because of stop words below contains the delete operation of Numbers,jiebaThe default precise mode is selected for word segmentation, which can cut sentences accurately and is more suitable for text analysis.

# Delete the symbols and English letters in the short comments

punc = '~ `! # $% ^ & * () _ + - = | \ ';" : /.? > < ~! @ # $%...... &* () -- +-= ":"; ,. ,? " The {} '

def remove_fuhao(e):

return re.sub(r"[%s]+" % punc, "", e)

def remove_letter(new_short):

return re.sub(r'[a-zA-Z]+'.' ', new_short)

# Cut text using jieba

def cut_word(text):

text = jieba.cut(str(text))

return ' '.join(text)

The apply method creates a new column based on the three custom functions above

data['new_short'] = data['short'].apply(remove_fuhao).apply(remove_letter).apply(cut_word)

Copy the code

There must be a lot of irrelevant words like “one”, “this”, “people” and so on, so the stop word function filters these words out of the comment. The main idea of this function is to split the comment into terms by space, and then iterate through the list of terms. If a term does not appear in the stop term list, is longer than 1, and is not Tab, then the string outstr will be concatenated. If a term already exists in outstr, it is not added to outstr.

At the end of the article provides the Chinese stop words table to obtain

# read stop words table function

def stopwordslist(filepath):

stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]

return stopwords

# Delete the stop word in the short comment

def sentence_div(text):

# Divide the comments into words by space and form lists

sentence = text.strip().split()

# path to load stop words

stopwords = stopwordslist(R 'Chinese stop words list. TXT')

Create an empty string

outstr = ' '

Iterate over each word in the short comment list

for word in sentence:

if word not in stopwords: Check if words are in the stop list

if len(word) > 1: # Words should be longer than 1

ifword ! ='\t': # the word cannot be TAB

if word not in outstr: # deduplicate: if the word is in outstr, it is not added

outstr += ' ' # segmentation

outstr += word Add the term to outstr

# return string

return outstr

data['the_short'] = data['new_short'].apply(sentence_div)

Copy the code

There may be a short comment that says a lot of nonsense, which happens to be filtered by the stop word function. There are fewer words left, which is of little help to the sentiment analysis of this short comment, so the short comment with less than 4 words will be deleted here. Since many new attributes were created based on custom functions, the content was too complicated, so the two columns needed for sentiment analysis (processed comments and annotations) were combined into a new DataFrame.

data['split'] = data['the_short'].apply(lambda x: 1 if len(x.split()) > 3 else 0)

data = data[~data['split'].isin(['0']]

Merge the required two columns into a new DataFrame

new_data1 = data.iloc[:, 3]

new_data2 = data.iloc[:, 5]

new_data = pd.DataFrame({'short': new_data2, 'sentiment': new_data1})

Copy the code

All that’s left is the preprocessed data set280Sample, screenshots are as follows:As mentioned above, the proportion of positive emotions in short comments is much larger than that of negative emotions. In order to avoid that all samples in the test data set are positive emotions, the data set is divided by random selection. usingrandomIn the librarysampleMethods The index of **10%** was randomly selected as the index of the test data set, and the rest as the index of the training data set. The data set is then cut into two parts according to two types of indexes and saved separately.

def splitDataSet(new_data):

Get a random 10% of the dataset as the test set, get the index of the test dataset

test_index = random.sample(new_data.index.tolist(), int(len(new_data.index.tolist()) * 0.10))

Get the index of the training data set

train_index = [i for i in new_data.index.tolist() if i not in test_index]

# Index training set and test set separately

test_data = new_data.iloc[test_index]

train_data = new_data.iloc[train_index]

Save them as CSV files

train_data.to_csv('bayes_train.csv', encoding='utf_8_sig', index=False)

test_data.to_csv('bayes_test.csv', encoding='utf_8_sig', index=False)

Copy the code

Building classifiers

The section on building classifiers conflicts with the code in the previous article, so the algorithms section below won’t talk too much about how this works; If you are new to Naive Bayes or want to understand how it works, I recommend reading the previous article: Notes on Machine Learning (5) — Easy to See Through Naive Bayes; If you are familiar with naive Bayes’ principle, you can skip this section to the end of the article if you are only interested in source code and data.

Word vector construction

The loadDataSet function is used to convert the short comments into the required vector format, that is, the terms of each short comment constitute a list, and then add all the lists to a list to form a set of terms. ClassVec is a list composed of sentiment annotations corresponding to the short comments.

def loadDataSet(filename):

data = pd.read_csv(filename)

postingList = []

# Text statement segmentation

for sentence in data['short'] :

word = sentence.strip().split()The # split method returns a list

postingList.append(word)Add each list of terms to a list

# vector of category tags

classVec = data['sentiment'].values.tolist()

return postingList,classVec

Copy the code

The createVocabList function returns a set of all occurrences of non-repeating words in the text as the set method has already taken the union.

Create a vocabulary

def createVocabList(dataSet):

Create an empty non-repeat list

vocabSet = set([])

for document in dataSet:

# take the union of the two

vocabSet = vocabSet | set(document)

return list(vocabSet)

Copy the code

SetOfWords2Vec function is used to quantify the essay, the input parameters for the total vocabulary and an essay, output the text vector, vector elements include 1 or 0, respectively whether the vocabulary words appeared in the input text, idea is to first create a vector with his vocabulary, etc, and its elements are set to 0, It then iterates through the words of the input text and replaces the 0 in the corresponding position with 1 if the words of this article appear in the vocabulary.

Vectorization function

def setOfWords2Vec(vocabList, inputSet):

# create a vector with 0 elements

returnVec = [0] * len(vocabList)

for word in inputSet:

if word in vocabList:

# If the vocabulary contains the word, change the 0 at that position to 1

returnVec[vocabList.index(word)] = 1

return returnVec

Copy the code

The function of getMat is to summarize all the processed term vectors into a term vector matrix, which is convenient to call when testing the algorithm.

# vector summary

def getMat(inputSet):

trainMat = []

vocabList = createVocabList(inputSet)

for Set in inputSet:

returnVec = setOfWords2Vec(vocabList,Set)

trainMat.append(returnVec)

return trainMat

Copy the code

Training algorithm

The input parameters of trainNB function include the short comment matrix trainMatrix and the vector trainCategory composed of the emotion annotation of each term. First, the probability of positive sentiment is simply divided by the number of positive sentiment essays; | C1 to calculate P (W) and P (W | C0), needs to be molecules and the denominator initialization, traverse the input text, once a word (positive or negative feelings of emotions) appear in a document, then the word corresponding to the number of (p1Num or p0Num) + 1, and in total in the text, the total number of terms which also add 1 accordingly.

def trainNB(trainMatrix,trainCategory):

# Train the number of texts

numTrainDocs = len(trainMatrix)

# Number of entries per text

numWords = len(trainMatrix[0])

# Probability that the document is a positive emotion (1)

pAbusive = sum(trainCategory)/float(numTrainDocs)

Create two zero groups of numWords

p0Num = np.ones(numWords)

p1Num = np.ones(numWords)

Initialize the denominator

p0Denom = 2.0

p1Denom = 2.0

for i in range(numTrainDocs):

if trainCategory[i] == 1:

# statistical data needed for the positive emotions of conditional probability, namely, P (w0 | 1), P (w1 | 1), P (w2 | 1)...

p1Num += trainMatrix[i]

#print(p1Num)

p1Denom += sum(trainMatrix[i])

#print(p1Denom)

else:

# statistical data needed for the conditional probability of emotion, namely, P (w0 | 0), P (w1 | 0), P (w2 | 0)...

p0Num += trainMatrix[i]

p0Denom += sum(trainMatrix[i])

# Calculate the probability of an entry's occurrence

p1Vect = np.log(p1Num/p1Denom)

p0Vect = np.log(p0Num/p0Denom)

#print("\n",p0Vect,"\n\n",p1Vect,"\n\n",pAbusive)

return p1Vect,p0Vect,pAbusive

Copy the code

The test algorithm

ClassifyNB function is a function to judge categories. The input parameters are test data in vector format and three return values of training function trainNB. For example, if the probability of P1 is greater than p0, it means that the test data is positive emotion and the return value is 1. Negative emotions are returned, with a return value of 0.

def classifyNB(ClassifyVec, p1V,p0V,pAb):

Multiply the corresponding elements

print(pAb)

p1 = sum(ClassifyVec * p1V) + np.log(pAb)

p0 = sum(ClassifyVec * p0V) + np.log(1.0 - pAb)

print('p1:',p1)

print('p0:',p0)

if p1 > p0:

return 1

else:

return 0

Copy the code

TestNB is a test function. The above functions are called to predict the test set, and the accuracy of the classifier is obtained by comparing the real results with the test results.

def testNB(a):

Load training set data

train_postingList,train_classVec = loadDataSet('bayes_train4.csv')

Create a vocabulary

vocabSet = createVocabList(train_postingList)

# Summarize the training sample entry vector

trainMat = getMat(train_postingList)

# Training algorithm

p1V,P0V,PAb = trainNB(trainMat,train_classVec)

Load test set data

test_postingList,test_classVec = loadDataSet('bayes_test4.csv')

# Vectorize the test text

predict = []

for each_test in test_postingList:

testVec = setOfWords2Vec(vocabSet,each_test)

# Judge the category

if classifyNB(testVec,p1V,P0V,PAb):

print(each_test,"Positive emotions")

predict.append(1)

else:

print(each_test,"Negative emotions")

predict.append(0)

corr = 0.0

for i in range(len(predict)):

if predict[i] == test_classVec[i]:

corr += 1

print("The accuracy of naive Bayes classifier is:" + str(round((corr/len(predict)*100),2)) + "%")

Copy the code

The final program running screenshot is as follows:

Since we divide the training set and test set by random selection, the accuracy of naive Bayes classifier will change every time the program is run, and the average value can be taken as the accuracy of this model. Finally, the word cloud map based on this data set is attached. I wonder if the genre of this movie can arouse your interest.

conclusion

When using naive Bayes algorithm for similar sentiment analysis or text classification, original data should be kept sufficient as far as possible. For example, there are only 280 original data left after the above 580 original data are preprocessed by text. Only when the data is sufficient, can the model be practical. Too little data will lead to a large fluctuation in the accuracy of the model and a high contingency.

Concern public number [sugar cat] background reply “hungry platform” can be obtained source code and data for reference, thank you for reading.