Sentiment Analysis

Sentiment analysis, also known as propensity analysis, Opinion extraction, Opinion mining, Sentiment mining, Subjectivity analysis refers to the process of analyzing, processing, induction and reasoning with subjective texts with emotional colors.

  • Mainstream ideas:

Emotion-based dictionary: it refers to extracting emotion-based words from the analysis text through text processing according to the constructed emotion-based dictionary, and calculating the emotion-based tendency of the text, that is, quantifying the emotion-based color of the text according to semantics and dependency relations. The final classification effect depends on the completeness of the sentiment lexicon. In addition, it requires a good linguistic foundation, that is, it is necessary to know when a sentence is usually Positive or Negative.

Machine learning-based: It refers to the selection of emotional words as feature words, the text matrix, logistic Regression, Naive Bayes, support vector machine (SVM) and other methods for classification. The final classification effect depends on the choice of training text and correct emotion labeling.

The data set

There are three data sets, and at first I only did two and four. However, finding that the effect of classification 4 was very poor, we used the data of classification 9 to verify whether it was the problem of data or the problem of the model.

participles

Chinese word segmentation can be jieba, THULAC, ICTCLAS

Jieba: Easy to use, you can add custom thesaurus or delete “stopWords”. Custom thesaurus can be added as required, but not applicable. However, when “invalid words” express emotion, they do not have clear emotional direction but are often used (word frequency is very high). Therefore, the TF-IDF of these words may still be very high, so it is necessary to delete them actively to avoid introducing noise.

Vector construction

Tf-idf is used to calculate the most representative words in the thesaurus.

Term Frequency — Inverse Document Frequency (TF-IDF) is a commonly used weighting technique for information retrieval and information exploration. Tf-idf is a statistical method used to assess the importance of a word to one of the documents in a document set or a corpus. The importance of a word increases with the number of times it appears in the document, but decreases inversely with the frequency of its occurrence in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of correlation between files and user queries. In addition to TF-IDF, search engines on the Internet also use link analysis-based rating methods to determine the order in which documents appear in search results

Model selection

Text preprocessing

  • Data Dimension Reduction (StopWords)
  • Sample disequilibrium (oversampling: direct copy negative sample, SMOTE)

Here we see that the sample is uneven, with significantly fewer negative comments than positive ones. The solution is oversampling. The two ideas of text classification are word frequency and word vector. It corresponds to naive Bayes and LSTM.

Naive Bayes

The original data

In the case of unbalanced samples, the recall rate of negative samples is obviously problematic, and the ROC curve is also problematic. The false positive rate is zero, and the true rate is small. After oversampler, the recall rate of negative samples is high, and the true rate is high

Oversampling

The recall rate and F1-score were high after oversampling. The ROC curve area is also large, and the true rate is also large when the false positive rate is zero

LSTM

If dichotomies work so well, how about dichotomies? The four categories of datasets include: joy, anger, disgust, and depression. Intuitively, the latter three are not so easy to distinguish.

Micro-blog four classification data

Let’s take a look at the length of the micro-blog, which mainly focuses on 50-100. Here, we will exclude many micro-blogs with obscure features.

The loss of training set decreases gradually, while the loss of verification set increases gradually. It shows that there is a problem, one may be the over-fitting model, the other may be the data set itself

In addition to joy, recall and accuracy of the other three categories were not very good.

Text 9 Classification

Training loss decreased, verification loss increased. Either the model is faulty, overfitted, or the data itself is faulty. And then I found a data set with nine categories.