Python Chinese Community

The spiritual tribe of Python Chinese developers around the world

.jpg”)

\

More recently, natural language processing (NLP) is a larger branch of machine learning. There are many challenges. For example, how to divide words, identify entity relationships, inter-entity relationships, relationship network display and so on.

I did a natural language analysis on Jieba + Word2vec + NetworkX. The corpus is to lean on the sky and slay the dragon. Before, many people used Jin Yong’s wuxia novels to analyze and deal with, hoping to bring something different. Take a few screenshots:

Link the similarity diagram of all characters. \

Same relation as above. The display form is polycentric

Network diagram with different identities of Zhang Wuji as the center. \

The main differences in this analysis are:

1. Similarity result of Word2Vec – as the weight of social network in the later stage

2, NetworkX analysis and presentation

The combination of the above two methods can significantly reduce the amount of time spent reading articles in your daily work. With machine learning, entity information in articles can be extracted semi-automatically from beginning to end, saving a lot of time and cost. In a variety of work has the use of the scene, if interested in friends, can contact cooperation.

Let’s start with what we can find with Word2Vec+NetworkX.


I. Analysis results

Different attributes of entities (Zhang Wuji’s Total Multiple Vests)

Zhang Wuji, Wuji, Master Zhang, brother Wuji, Childe Zhang. The same Zhang Wuji has multiple identities, different identities and different people, there is different similarity.

First look at the picture:

Brother Mowgli is too close to the name, generally do not shout. It’s almost as if the similarities are with strange characters. \

Wuji is the name that equals or elders can call after the relationship is mature. There are Miss Zhou, Miss Yin and so on

Zhang Wuji is a common name, everyone can call and majia close contact.

Childe Zhang is a polite and respectful title. For example, yellow woman, Ruyang Wang and so on

Master Zhang is a title. It’s respectful, but it’s not very familiar, and sometimes it’s hostile. For example: Zhu Yuanzhang

Note:

1, The graph is drawn by Networkx based on Word2vex, the above description is my manual analysis.

2. Zhao Min is not in the network diagram above. Word2Vec calculated that Zhang Wuji and Zhao Min similar degree is not too high. Something I didn’t expect. Careful recall, that year when reading this book, suddenly found two people together, seems more abrupt. Presumably, in the book, two people get married, but in the real world, their relationship is more precarious.

Second, the implementation process

Main steps:

Prepare corpora

  1. The text file of the novel “Heaven Leaning and Dragon Slaying”
  2. Custom participle dictionary (fictional character names, about 180 available online)
  3. Stop list

The preparation of the instruments

  1. Python Pandas, Numpy,Scipy
  2. Jieba (Chinese word segmentation)
  3. Word2vec (Word vectorization tool to calculate the detail between words)
  4. Networks (network diagram tool, used to show complex network relationships

Data preprocessing

  1. Forward text files to UTF8 (pandas)
  2. Jieba text File Clause
  3. Text file clause, word segmentation, part of speech analysis, mainly name (Jieba)
  4. Update the custom dictionary and reclassify the words (this process takes several times until you are satisfied)
  5. A small number of manual deletion (segmentation out of the name of the misjudgment rate is not high, but there are some. For example: Zhao Minxiao said, can be recognized by a person named Zhao Minxiao. This part of the work has to be done by hand. Unless there is a better word segmentation tool, or a word segmentation tool that can be trained, this problem will not be solved.

Word2Vec training model. This model can calculate the similarity between two people

  1. Use 300 dimensions
  2. Filter word frequency less than 20 times
  3. The sliding window is 20
  4. Lower sampling: 0.001

Generate entity relationship matrix.

  1. I couldn’t find a library online, so I wrote one myself.
  2. N * N dimensions. N is the number of names.
  3. Use WordVec’s model above to populate the entity-relationship matrix

NetworkX Generates a network diagram

  1. The node is the name of the person
  2. An edge is a line between two nodes. That’s the relationship between two people.

Three, part of the code implementation (due to space is limited, to obtain the complete code please pay attention to the public number programming dog reply 0321 to obtain)

Initialize the

  

import numpy as np

import pandas as pd

import jieba

import jieba.posseg as posseg

%matplotlib inline
Copy the code

Data segmentation, cleaning

renming_file = "yttlj_renming.csv" jieba.load_userdict(renming_file) stop_words_file = "stopwordshagongdakuozhan.txt" stop_words = pd.read_csv(stop_words_file,header=None,quoting=3,sep="\t")[0].values corpus = "yttlj.txt" yttlj = pd.read_csv(corpus,encoding="gb18030",header=None,names=["sentence"]) def cut_join(s): New_s =list(jieba.cut(s,cut_all=False)) #print(list(new_s)) stop_words_extra =set([""]) for seg in new_s: if len(seg)==1: #print("aa",seg) stop_words_extra.add(seg) #print(stop_words_extra) #print(len(set(stop_words)| stop_words_extra)) new_s =set(new_s) -set(stop_words)-stop_words_extra result = ",". Join (new_s) return result def Words =[] flags=[] for k,v in new_s: if len(k)>1: words.append(k) flags.append(v) full_wf["word"].extend(words) full_wf["flag"].extend(flags) return len(words) def check_nshow(x): nshow = yttlj["sentence"].str.count(x).sum() #print(x, nshow) return nshow # extract name & filter times full_wf={"word":[],"flag":[]} possible_name = yttlj["sentence"].apply(extract_name) #tmp_w,tmp_f df_wf = pd.DataFrame(full_wf) df_wf_renming = df_wf[(df_wf.flag=="nr")].drop_duplicates() df_wf_renming.to_csv("tmp_renming.csv",index=False) df_wf_renming = pd.read_csv("tmp_renming.csv") df_wf_renming.head() df_wf_renming["nshow"] = df_wf_renming.word.apply(check_nshow) df_wf_renming[df_wf_renming.nshow>20].to_csv("tmp_filtered_renming.csv",index=False) Df_wf_renming [df_wf_renming.nshow>20]. Shape Df_wf_renming =pd.read_csv("tmp_filtered_renming.csv") my_renming = df_wf_renming.word.tolist() external_renming =  pd.read_csv(renming_file,header=None)[0].tolist() combined_renming = set(my_renming) |set(external_renming) pd.DataFrame(list(combined_renming)).to_csv("combined_renming.csv",header=None,index=False) combined_renming_file ="combined_renming.csv" jieba.load_userdict(combined_renming_file) # tokening yttlj["token"]=yttlj["sentence"].apply(cut_join) yttlj["token"].to_csv("tmp_yttlj.csv",header=False,index=False) sentences = yttlj["token"].str.split(",").tolist()Copy the code

Word2Vec vectorization training

  

# Set values for various parameters

num_features = 300    # Word vector dimensionality                      

min_word_count = 20   # Minimum word count                        

num_workers = 4       # Number of threads to run in parallel

context = 20          # Context window size                                                                                    

downsampling = 1e-3   # Downsample setting for frequent words

# Initialize and train the model (this will take some time)

from gensim.models import word2vec

model_file_name = 'yttlj_model.txt' 

#sentences = w2v.LineSentence('cut_jttlj.csv') 

model = word2vec.Word2Vec(sentences, workers=num_workers, \

            size=num_features, min_count = min_word_count, \

            window = context, \

            sample = downsampling

            )

model.save(model_file_name)  
Copy the code

Establish entity relationship matrix

entity = pd.read_csv(combined_renming_file,header=None,index_col=None) entity = entity.rename(columns={0:"Name"}) entity  = entity.set_index(["Name"],drop=False) ER = pd.DataFrame(np.zeros((entity.shape[0],entity.shape[0]),dtype=np.float32),index=entity["Name"],columns=entity["Name"]) ER["tmp"] = entity.Name def check_nshow(x): nshow = yttlj["sentence"].str.count(x).sum() #print(x, nshow) return nshow ER["nshow"]=ER["tmp"].apply(check_nshow) ER = ER.drop(["tmp"],axis=1) count = 0 for i in entity["Name"].tolist(): count +=1 if count % round(entity.shape[0]/10) ==0: print("{0:.1f}% relationship has been checked".format(100*count/entity.shape[0])) elif count == entity.shape[0]: print("{0:.1f}% relationship has been checked".format(100*count/entity.shape[0])) for j in entity["Name"]: relation =0 try: relation = model.wv.similarity(i,j) ER.loc[i,j] = relation if i! =j: ER.loc[j,i] = relation except: relation = 0 ER.to_hdf("ER.h5","ER")Copy the code

NetworkX shows people diagrams

  

import networkx as nx

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import pygraphviz

from networkx.drawing.nx_agraph import graphviz_layout
Copy the code

\

In this paper, the author

Yong Wang, Python Chinese community columnist, snowball ID: happy dad, currently interested in business analysis, Python, machine learning, Kaggle. 17 years project management, 11 years in communications project manager contract delivery, 6 years in manufacturing project management: PMO, change, production transfer, liquidation and asset disposal. MBA, PMI – the PBA, PMP.

\

Due to limited space, to obtain the complete code, please pay attention to the public number programming dog reply 0321 to obtain

Become a free member of the Python Chinese community