Python NLPIR2016 is combined with WordCloud to generate Chinese word clouds

Before Syria

This blog post continues the previous article and further introduces the use of NLPIR2016, including three contents (use NLPIR’s new word discovery function to automatically extract new words in the text, solve the problem of worldCloud mixing Chinese and English, and combine NLPIR and WorldCloud to generate word cloud). This blog will take an hour to write and 12 minutes to read.

Get the new word using NLPIR2016

Get new words. The second parameter controls the number of new words. The ranking is sorted according to TF-IDF (Term Frequency -- Inverse Document Frequency) # strs1 = GetNewWords(text,c_int(10),[c_char_p, c_int, C_bool]) # print # get new words (from TXT file), the second parameter controls the number of new words, ranking according to TF-IDF (term frequency -- inverse document frequency GetFileNewWords(text,c_int(10),[c_char_p, c_int, c_bool]) # print strs10 # WindowsError: exception: Access violation Reading 0x0000000000000000 # Get the keyword, the second parameter controls the number of new words, and the rankings are sorted according to TF-IDF (Term Frequency -- Inverse Document Frequency)  # strs2= GetKeyWords(text,c_int(10),[c_char_p, c_int, Def getNewWordsByNLPIR(text, number): def getNewWordsByNLPIR(text, number) def getNewWordsByNLPIR(text, number) txt1 = GetNewWords(text, c_int(number), [c_char_p, c_int, c_bool]) txt2 = txt1.split('#') txt3 = [] txt4 = [] txt5 = [] for item2 in txt2: txt3.append(item2.encode('utf-8').split('/')) if txt3 ! = []: txt4.append(txt3) txt3 = [] for i in txt4: for j in i: if j[0] ! = [] and j[0] ! = ' ': Txt5. append(j[0]) return txt5 # Note that the list type of numpy is returned # then you can add it to NLPIR's dictionary through a simple loop for item in NewWord AddUserWord(item)Copy the code

Worldcloud cannot display the problem in Chinese

I am writing this blog using python2.7. When NLPIR2016 is combined with wordcloud, I find that world only displays English in text, but not Chinese. This is not the same as the previous stammer, which can not be solved by setting the font. So I checked the encoding format of the whole program, and finally confirmed that every link was UTF-8 encoding, and the last input worldCloud used wordcloud.generate(text.encode(‘ UTF-8 ‘)), but the result was still not good. When I read a sentence that python2.x processing text data had better set the encoding at the beginning of reading the file, it suddenly occurred to me that even using encode(‘ UTF-8 ‘) is not actually the encoding grid in memory Is that still wrong? So I tried using Codecs to save the text to the hard disk and read it out according to UTF-8, and sure enough, it worked. It was obvious that the code was really bad, but I did determine that it was actually the text encoding that was entered into wordCloud that was wrong, at least not the actual usable UTF-8 encoding (encode(‘ UTF-8 ‘) didn’t work), so I tried to replace it directly with the Unicode pairs used internally in Python Elephant, the result is sure success.
Causes and solutions

# Note that the output text needs to be converted to UTF-8 encoding because NLPIR2016 is the default encoding for Chinese word segmentation. Encoding =' utF8 ') converts it to Unicode objects, whereas encode(' UTF-8 ') does notCopy the code

Generate word cloud instances

# - * - coding: Utf-8 -*- # # Created by :'2017/5/23' # email: [email protected] # CSDN: http://blog.csdn.net/fontthrone from os import path from nlpir import * from scipy.misc import imread import matplotlib.pyplot as plt from wordcloud import WordCloud, ImageColorGenerator from ctypes import * import sys reload(sys) sys.setdefaultencoding('utf-8') d = Path.dirname (__file__) text_path = 'TXT /lztest.txt' # set the path of text to be parsed stopwords_path = 'stopwords\stopwords1893.txt' # stopwords glossary text = open(path.join(d, Text_path).read() TXT = seg(text) kw_list = [] Seg_list = [] # Get new words, the second parameter controls the number of new words, ranking according to TF-IDF (Term frequency -- inverse # print strs1 # print type(strs1) # print type(strs1) # Masashi Toyama/N_new /28.45# White King blood origin/N_new /19.83# Secret Ritual Mantra/N_new /19.19# Discipline Committee Chairman/N_new /15.30# dragon/N_new /14.48# Dragon type/N_new /13.79# Dragon blood origin/N_new /13.78# Dragon # # <type 'STR '> def getNewWordsByNLPIR(text, number): txt1 = GetNewWords(text, c_int(number), [c_char_p, c_int, c_bool]) txt2 = txt1.split('#') txt3 = [] txt4 = [] txt5 = [] for item2 in txt2: txt3.append(item2.encode('utf-8').split('/')) if txt3 ! = []: txt4.append(txt3) txt3 = [] for i in txt4: for j in i: if j[0] ! = [] and j[0] ! = '': txt5.append(j[0]) return txt5 strs1 = getNewWordsByNLPIR(text, 10) for i in strs1: print i for t in txt: Def NLPIRclearText(seg_list.append(t[0].encode('utf-8')) mywordlist = [] liststr = "/ ".join(seg_list) f_stop = open(stopwords_path) try: f_stop_text = f_stop.read() f_stop_text = unicode(f_stop_text, 'utf-8') finally: f_stop.close() f_stop_seg_list = f_stop_text.split('\n') for myword in liststr.split('/'): if not (myword.strip() in f_stop_seg_list) and len(myword.strip()) > 1: Join (mywordList) # print s font_path = STRS = NLPIRclearText(seg_list) # print s font_path = Wc = WordCloud(font_path=font_path, # set font background_color="white", # background color max_words=200, # word cloud display maximum number of words max_font_size=100, # font maximum random_state=42, width=1000, height=860, margin=2, # Set the default image size, but if you use the background image, the saved image size will be) # Because NLPIR automatically changes the text encoding when processing Chinese, so you need to change the text encoding when using WorldCloud Wc-generate (STRS, unicode) generate(STRS, unicode) encoding='utf8')) plt.imshow(wc) plt.axis("off") plt.show() wc.to_file('test001.png')Copy the code

Python NLPIR2016 is combined with WordCloud to generate Chinese word clouds

Before Syria

Get the new word using NLPIR2016

Worldcloud cannot display the problem in Chinese

Generate word cloud instances

Related Posts

Microsoft Research China is considering making AI more human

Python: Clustering algorithm implementation

001 — Data Preprocessing techniques (Mean Removal, range scaling, Normalization, binarization, independent thermal coding)