Pyhanlp text Classification and sentiment Analysis

This time we need to use HanLP for text classification and sentiment analysis. This is also the penultimate article in the PyhanLP user guide on interfaces and Python implementations, followed by an introduction, a summary of working tips, and several examples. I’m so glad to hear that.

Text classification

In HanLP, text classification and sentiment analysis both use a classifier, naive Bayes classifier. Maybe this classifier is a bit generic, but the end result is pretty good.

Because the bottom layer uses the word bag mode, so when the text is large, it may have a large memory opening effect, but it doesn’t matter, the author has written a feature detection method in advance. Using chi-square detection, using thresholds to filter features, reduce memory overhead.

The original author only gave an example of text categorization. Here we modify the original example to make it more suitable for the categorization task.

corpus

Corpus in this paper refers to text categorization corpus, corresponding to IDataSet interface. Text categorization corpus contains two concepts: document and category. A document belongs to only one category, and a category may contain multiple documents. For example, sogou text classification corpus mini-version. Zip, please read the Sogou laboratory data license agreement before downloading.

The data format

The root directory of the classified corpus. The directory must have the following structure :\

Classification of root ├ ─ ─ A │ └ ─ ─ 1. TXT │ └ ─ ─ 2. TXT │ └ ─ ─ 3. TXT ├ ─ ─ category B │ └ ─ ─ 1. TXT │ └ ─ ─... └ ─ ─...Copy the code

The file does not need to be named with numbers or suffix TXT, but it must be a text file.

participles

Currently, there are two implementations of the tokenizer interface in this system: BigramTokenizer and HanLPTokenizer. But is word segmentation necessary for text classification? The answer is no. We can sequentially pick two adjacent words from the text as a “word” (the term is bigram). These two words can reflect the topic of the article when there are a lot of them (see a 2016 paper from Tsinghua University entitled Zhipeng Guo, Yu Zhao, Yabin Zheng, Xiance Si, Zhiyuan Liu, Maosong Sun. THUCTC: An Efficient Chinese Text Classifier. 2016). This corresponds to BigramTokenizer in the code. Of course, you can also use a traditional tokenizer, such as HanLPTokenizer. Alternatively, users can implement their own tokenizers by implementing ITokenizer and validate them with IDataSet#setTokenizer

Feature extraction

Feature extraction refers to selecting the most helpful words from all the words. Ideally, all words are helpful for classification decisions, but in reality, if all words are included in the calculation, the training speed is very slow, the memory overhead is very large, and the final model is very large.

The system adopts Chi-square detection, through which features whose Chi-square value is lower than a threshold value are removed, and the final number of features is limited to no more than 1 million.

To predict

Classify returns a String of the most likely categories, while Predict returns scores for all categories (a Map with keys and scores or probabilities). Categorize returns scores for all categories (an array of doubles, The classification score is in lexicographical order by category name), and the label method returns the lexicographical order for the most likely class.

Thread safety

Similar to HanLP’s design, the internal implementation of this system does not use any thread locks, but any prediction interface is thread-safe (it is designed to store no intermediate results and put all intermediate results into the parameter stack).

from pyhanlp import SafeJClass
import zipfile
import os
from pyhanlp.static import download, remove_file, HANLP_DATA_PATH

Set the path, otherwise it will look in the configuration file
HANLP_DATA_PATH = "/home/fonttian/Data/CNLP"

"" "to obtain test data path, in $root/data/textClassification/sogou - mini, the root directory, designated by the configuration file or equal to our front HANLP_DATA_PATH manually. "" "
DATA_FILES_PATH = "textClassification/sogou-mini"


def test_data_path() :
    data_path = os.path.join(HANLP_DATA_PATH, DATA_FILES_PATH)
    if not os.path.isdir(data_path):
        os.mkdir(data_path)
    return data_path


def ensure_data(data_name, data_url) :
    root_path = test_data_path()
    dest_path = os.path.join(root_path, data_name)
    if os.path.exists(dest_path):
        return dest_path
    if data_url.endswith('.zip'):
        dest_path += '.zip'
    download(data_url, dest_path)
    if data_url.endswith('.zip') :with zipfile.ZipFile(dest_path, "r") as archive:
            archive.extractall(root_path)
        remove_file(dest_path)
        dest_path = dest_path[:-len('.zip')]
    return dest_path


NaiveBayesClassifier = SafeJClass('com.hankcs.hanlp.classification.classifiers.NaiveBayesClassifier')
IOUtil = SafeJClass('com.hankcs.hanlp.corpus.io.IOUtil')
sogou_corpus_path = ensure_data('Sogou Text Classification Corpus Mini Version'.'http://hanlp.linrunsoft.com/release/corpus/sogou-text-classification-corpus-mini.zip')


def train_or_load_classifier(path) :
    model_path = path + '.ser'
    if os.path.isfile(model_path):
        return NaiveBayesClassifier(IOUtil.readObjectFrom(model_path))
    classifier = NaiveBayesClassifier()
    classifier.train(sogou_corpus_path)
    model = classifier.getModel()
    IOUtil.saveObjectTo(model, model_path)
    return NaiveBayesClassifier(model)


def predict(classifier, text) :
    print("%16s" \t belong to classification \t [%s]" % (text, classifier.classify(text)))
    To obtain the distribution of discrete random variables, use the PREDICT interface
    # print(" %s "\t" \t [%s] "% (classifier. Predict (text))


if __name__ == '__main__':

    classifier = train_or_load_classifier(sogou_corpus_path)
    predict(classifier, "Cristiano Ronaldo wins Ballon d 'Or award for second consecutive year ahead of Messi Neymar)
    predict(classifier, "Britain's aircraft carrier has taken eight years to build and is still out of service and China is way ahead of it.")
    predict(classifier, "Further specialization is urgently needed in the postgraduate admissions model")
    predict(classifier, "If you really want to decompress with food, I suggest you eat oats.")
    predict(classifier, "Gm and some of its competitors are now considering a solution to their inventory problems.")
    
    
    print("\n Here we use the trained model to even test the new several news headlines from the Internet \n")
    predict(classifier, "This year, the pressure to take the national entrance exam has increased further. Maybe it is turning into the second national entrance exam.")
    predict(classifier, "Zhang Jike was shouted by Liu Guoliang: wake up! The Olympics have begun.")
    predict(classifier, "At last Ford understood! The new 1.5T deploys 184 HP, which is less than 110,000, putting the Civic to shame.")
Copy the code

"Cristiano Ronaldo wins golden Ball for the second year in a row ahead of Messi Neymar" belongs to the category of [sports] "Britain's aircraft carrier is still out of service after 8 years of construction, but China lags far behind" belongs to the category of [military] "Postgraduate examination model needs further specialization" belongs to the category of [education] "If you really want to use food decompression, the proposal can eat oats" belongs to a classification [health] the general part and its rival, is considering inventory problem "belongs to a classification [car] we're here again even using the trained model to test the new literally from the Internet to find some news title" one's deceased father grind pressure further increase this year, Perhaps the postgraduate entrance examination is becoming the second college entrance examination "belongs to the classification [education]" Zhang Jike was Liu Guoliang shouted wake up: wake up! The Olympic Games began." The new 1.5T puts out 184 horsepower, less than 110,000, and the Civic puts it to shame.Copy the code

Judging from the few headlines we added ourselves at the end, the classifiers worked pretty well. This is thanks to Word2vec.

Sentiment analysis

Our implementation of sentiment analysis has a high degree of similarity to the previous text classification, and as mentioned earlier, they are actually a classifier used. In Python, they are almost identical.

Because of this, as long as we have the same format of corpus, we can use this classifier to do any text classification we need

Corpus sources

Shallow emotion analysis can be performed using models trained on emotional polarity corpus by text classification. At present, public sentiment analysis corpora include: Chinese sentiment mining corpora -ChnSentiCorp, published by Tan Songbo.

"" "to obtain test data path, in $root/data/textClassification/sogou - mini, the root directory, designated by the configuration file or equal to our front HANLP_DATA_PATH manually. ChnSentiCorp Reviews Hotel Sentiment Analysis """
DATA_FILES_PATH = "sentimentAnalysis/ChnSentiCorp"

if __name__ == '__main__':
    
    ChnSentiCorp_path = ensure_data('Sentiment Analysis of Hotel Reviews', \
        					'http://hanlp.linrunsoft.com/release/corpus/ChnSentiCorp.zip')
    # Thanks for the download link
    In this example, if you need to use local data, use the DATA_FILES_PATH variable above
    classifier = train_or_load_classifier(ChnSentiCorp_path)
    predict(classifier, 'It is close to chuansha Highway, but the bus instruction is wrong, if it is "CAI Lu line", it will be very troublesome. Suggest another route. The rooms are simpler.')
    predict(classifier, "Business bed room, the room is very large, the bed is 2M wide, the overall feeling of economic benefits is good!")
    predict(classifier, "Standard rooms are so bad they're not even three stars and the facilities are so old. The hotel is advised to renovate the old standard rooms.")
    predict(classifier, "The service attitude was extremely poor. The receptionist seemed to be untrained and did not even know basic manners. He was able to serve several guests at the same time.")
    
    
    print("\n Here we use the trained model and test the 'new' text I wrote myself \n")
    predict(classifier, "Service attitude is very good, serious reception of us, attitude can!")
    predict(classifier, "It's a little unhygienic. It doesn't feel good.")
Copy the code

"Distance from chuansha road is close, but the bus indication is wrong, if it is" CAI Lu line ", it will be very troublesome. It is suggested to use other routes. The room is relatively simple." It is recommended that the hotel renovate the old standard room from scratch. The service attitude is extremely poor. The receptionist seems to be untrained and does not even know basic manners. Unexpectedly receive several guests at the same time "belong to classification [negative] we here reoccupy training good model to even test my own" new "text" service attitude is very good, serious reception we, attitude can!" It's a little unhygienic. It doesn't feel good. Be classified [negative]Copy the code

reference

HanLP Text Classification and Sentiment Analysis – Wiki