Illustrations by Leon movprint

♚ \

Author: Wen Jianhua, xiaowen’s data journey, data analysis enthusiasts, do not want to be code farmers pseudo-code farmers. Blog: zhihu.com/c_188462686

First, let’s briefly introduce jieba Chinese word segmentation package. Jieba Chinese word segmentation mainly has three modes:

  • Accurate mode: The default is accurate mode, accurate word segmentation, suitable for text analysis;
  • Full mode: all the words that can become words are divided out, but the words will have ambiguity;
  • Search engine mode: on the basis of accurate mode, the long words are segmented again, which is suitable for search engine segmentation.

Jieba jieba

  • Cut (text,cut_all = False) : Full mode when cut_all = True
  • Custom dictionary: jieba.load_userdict(file_name)
  • Jieba.add_word (seg,freq,flag)
  • Jieba.del_word (seg)

Harry Potter is a series of fantasy novels written by The British author J. K. Rowling. It follows the adventures of the main character Harry Potter during his seven years at Hogwarts School of Witchcraft and Wizardry. The following will take the complicated relationship between the characters in Harry Potter as an example to practice jieba bag.

Import numpy as NP import pandas as pd import jieba,codecs import jieba.posseg as pseg Import Bar,WordCloud # Import names, stop words, renmings = Pd. read_csv(' people.txt ',engine='python',encoding=' UTF-8 ',names=['renming'])['renming'] stopwords = pd.read_csv('mystopwords.txt',engine='python',encoding='utf-8',names=['stopwords'])['stopwords'].tolist() book = Open (' Harry Potter.txt ',encoding=' utF-8 ').read() jieba.load_userdict(' Harry Potter.txt ') # Def words_cut(book): Words = list(jieba.cut(book)) Stopwords1 = [w for w in words if len(w)==1] # seg = set(words) - set(stopwords) - Set (stopwords1) # Result = [I for I in words if I in seg] return result # bookwords = words_cut(book) renming = [i.split(' ')[0] for I in set(renmings)] # Nameswords = [I for I in bookwords if I in set(renming)] # bookwords_count = pd.Series(bookwords).value_counts().sort_values(ascending=False) nameswords_count = pd.Series(nameswords).value_counts().sort_values(ascending=False) bookwords_count[:100].indexCopy the code

\

After the initial segmentation, we found that most of the words were ok, but there were still a few names that were not properly divided, such as’ Bree ‘, ‘RON said’, ‘Voldemort’, ‘Snape’, ‘Ground said’, and ‘Umbridge’, ‘Hogwarts’, which were split into two words.

Add_word (' Hogwarts ',100,'n') jieba.add_word(' Umbridge ',100,'nr') jieba.add_word(' Dumbledore ',100,'nr') Jieba. Add_word (' la tonks', 100, 'nr) jieba. Add_word (' voldemort' 100, 'nr) jieba. Del_word (' said RON) jieba. Del_word ()' to say ' Jieba.del_word (' snake ') # jiejiebookwords = words_cut(book) nameswords = [I for I in bookwords if I in set(renming)] bookwords_count = pd.Series(bookwords).value_counts().sort_values(ascending=False) nameswords_count = pd.Series(nameswords).value_counts().sort_values(ascending=False) bookwords_count[:100].indexCopy the code

\

After the second word segmentation, we can see that the mistakes in the first word segmentation have been corrected, and then we make statistical analysis.

Bar = bar (' background_color = 'white',title_pos = 'center',title_text_size = 20 bookwords_count[:15].index.tolist() y = bookwords_count[:15].values.tolist() bar.add('',x, y,xaxis_interval = 0,xaxis_rotate = 30,is_label_show = True) barCopy the code

\

Harry, hermione, RON, dumbledore, wand, magic, malfoy, snape and Sirius make up the TOP15 most common words in the book.

Our own string, probably can know the main content of “Harry Potter”, is Harry in the companion Hermione, RON, after the wizard Dumbledore’s help and training, the use of magic wand to use the big boss Lord Voldemort K.O. story. Of course, Harry Potter is wonderful.

Bar = bar (' TOP20 ',background_color = 'white',title_pos = 'center',title_text_size = 20) x = nameswords_count[:20].index.tolist() y =nameswords_count[:20].values.tolist() bar.add('',x, y,xaxis_interval = 0,xaxis_rotate = 30,is_label_show = True) barCopy the code

\

In terms of appearances throughout the book, we find Harry unassailable as the main character, beating Hermione in second place by more than 13,000 times, which is perfectly normal, given that the book is Harry Potter, not Hermione Granger.

Name = bookwords_count.index.tolist() value = bookwords_count.values.tolist() wc = WordCloud(background_color = 'white') wc.add("", name, value, word_size_range=[10, 200],shape = 'diamond') wcCopy the code




\

# relationships = {} relationships = {} lineNames = [] with codecs.txt (' harry Potter. TXT ','r','utf8') as f: n = 0 for line in f.readlines(): Format (n)) poss = pseg.cut(line) lineNames. Append ([]) for w in poss: if w.word in set(nameswords): lineNames[-1].append(w.word) if names.get(w.word) is None: names[w.word] = 0 relationships[w.word] = {} names[w.word] += 1 for line in lineNames: for name1 in line: for name2 in line: if name1 == name2: continue if relationships[name1].get(name2) is None: relationships[name1][name2]= 1 else: relationships[name1][name2] = relationships[name1][name2]+ 1 node = pd.DataFrame(columns=['Id','Label','Weight']) edge =  pd.DataFrame(columns=['Source','Target','Weight']) for name,times in names.items(): node.loc[len(node)] = [name,name,times] for name,edges in relationships.items(): for v, w in edges.items(): if w > 3: edge.loc[len(edge)] = [name,v,w]Copy the code

After processing, we found different names for the same character, so we combined and counted 88 nodes.

Node. Loc [node [' Id '] = = 'harry potter', 'Id'] = 'harry potter' node. The loc [node [' Id '] = = 'potter', 'Id'] = 'harry potter' node. The loc [node [' Id '] = = 'arbus',' Id '] = 'dumbledore' node. Loc [node [' Label '] = = 'harry potter', 'Label'] = 'harry potter' node. The loc [node [' Label '] = = 'potter', 'Label'] = 'harry potter' Node. Loc [node [' Label '] = = 'arbus',' Label '] = 'dumbledore' edge. The loc [edge [' Source '] = = 'harry potter', 'Source'] = 'harry potter' Edge. Loc [edge [' Source '] = = 'potter', 'Source'] = 'harry potter' edge. The loc [edge [' Source '] = = 'arbus',' Source '] = 'dumbledore' Edge. Loc [edge [' Target '] = = 'harry potter', 'Target'] = 'harry potter' edge. The loc [edge [' Target '] = = 'potter' and 'Target'] = 'harry potter' Edge. Loc [edge [' Target '] = = 'arbus',' Target '] = 'dumbledore nresult = node['Weight'].groupby([node['Id'],node['Label']]).agg({'Weight':np.sum}).sort_values('Weight',ascending = False) eresult = edge.sort_values('Weight',ascending = False) nresult.to_csv('node.csv',index = False) eresult.to_csv('edge.csv',index = False)Copy the code

With node and edge, the relationship between characters in Harry Potter was analyzed by Gephi:

(The size of the node indicates the number of appearances of characters, and the thickness of the line indicates the relationship between characters)

**** the most recent hot door push recommendation with Python analysis of NBA player technology in-depth interpretation of Python deep copy and shallow copy problems good hi oh! Welcome to cooperate with the Python Chinese Community public account! \

Click below to read the original article

Free to become a community registered member, members can enjoy more rights \