Use Python to climb conan barrage +Gephi to sort out the main plot

Guide language: 9000What’s in your later childhood?

Red and white Machine, World of Warcraft, fantasy journey to the West, cross fire, league of Heroes, plants versus zombies, Super Mary, minesweeper and other games or street play glass marbles, hit a picture, jump house, or a pile of small partners guarding the TV home has children or summer endless cycle of return bead division But I believe that there are many little cute children can not do without comics!! It’s not enough to spend an afternoon in a bookstore reading comic books as a child

Conan is Xiaobian’s favorite comic character

Conan’s classic catchphrases:

Conan Edogawa is a detective

“Who the hell are you?” is always the question the killer or someone else asks after Conan uncovers the killer or helps solve time. “And conan would strike a cool pose and then, with the corners of his mouth slightly raised, say the words” Conan Edogawa is a detective.” It’s too bad our Goro Mori doesn’t ask “Who the hell are you?” after so many pricks.

2. So that’s it!

Every time he meets a case, Conan can’t solve the case easily under normal circumstances. He will encounter some bottleneck. And at this time, the other people a casual behavior or a word will let Conan flash, after the eyes reflective “so, it is so ah!” . At this point many of the audience must be in a state of confusion, conan how to understand again.

3) So the prisoners only have…

In most cases, Conan solved the crime, but lacked conclusive evidence and could not deduce the real culprit. After his “careful” investigation, he will always find the key clues and evidence, and then based on this clue and evidence to deduce the real identity of the prisoner, at this time he will say “so the prisoner only…” .

4. There is only one truth

“There’s only one Truth” is one of the most famous lines in Detective Conan. It’s used at the beginning of almost every episode of the show, and conan always says it with a sense of legitimacy. In real life, many fans will imitate Conan and say, “There are so many new words!” .

5. Ah LeLe

When it comes to conan’s most impressive catchphrase in Detective Conan, apart from the above “there is only one truth”, it is this “ah 嘞嘞”. In the plot, Conan will deliberately use a child’s voice to say “ah 嘞嘞” when he finds any clues at the crime scene to remind others. In the most recent episode, Conan, after recovering the identity of Shinichi Kudo, accidentally said “ah 嘞嘞” while solving the crime, which shows how much this phrase has influenced him.

Ok ~ recall to this bottom dry business 😎👌

One, crawl introduction

By using Chrome browser to capture packages, it can be seen that the bullet-screen files of a certain station are stored as XML documents, as shown below (there are 3,000 real-time bullet-screen files in total).

The URL is: comment.bilibili.com/183362119.x…

The number 183362119 represents the exclusive ID of the video, and the corresponding bullet screen file can be obtained by changing the number. Open the video for episode 1 and view the source code, as shown below.

It is easy to see that CID is the ID of each video, which can be extracted with the re.

The complete crawl code is as follows

Import requests import re from BS4 import BeautifulSoup as BS import OS path='C:/Users/dell/Desktop/ Conan 'if os.path.exists(path)==False: os.makedirs(path) os.chdir(path) def gethtml(url,header): r=requests.get(url,headers=header) r.encoding='utf-8' return r.text def crawl_comments(r_text): txt1=gethtml(url,header) pat='"cid":(\d+)' chapter_total=re.findall(pat,txt1)[1:-2] count=1 for chapter in chapter_total: url_base='http://comment.bilibili.com/{}.xml'.format(chapter) txt2=gethtml(url_base,header) soup=BS(txt2,'lxml') all_d=soup.find_all('d') with open('{}.txt'.format(count),'w',encoding='utf-8') as f: for d in all_d: F.w rite (d.g et_text () + '\ n') print (' first {} words barrage finished writing '. The format (count) count + = 1 if __name__ = = "__main__ ': Url = 'https://www.bilibili.com/bangumi/play/ep321808' header = {' the user-agent ':' Opera / 12.80 (Windows NT 5.1; U; En) Presto/2.10.289 Version/12.02'} r_text= gethTML (URL,header) crawl_comments(r_text)Copy the code

All the final barrage files are under the “Conan” file on the desktop

Note: a total of 980 bullet barrage files were retrieved here. [Since episode 941, it has been jumped to episode 994 (for older members only). Although it has been updated to 1032, there is no episode 1032, as shown below.]

Second, bullet screen visualization

I. Analysis of the total number of discussions by the main characters

(1) Total number of statistics Note: Role-txt is the name file of the main characters (it should be taken into account that the full name of characters is not usually addressed in danmu, most of them are nicknames, otherwise there may be a big difference with the actual situation).

import jieba import os import pandas as pd os.chdir('C:/Users/dell/Desktop') jieba.load_userdict('role.txt') role=[ I.r eplace (' \ n ', ') for (I) in the open (' role. TXT ', 'r', encoding = "utf-8"). The readlines ()] txt_all = OS. Listdir ('/conan/') Txt_all.sort (key=lambda x:int(x.split('.')[0])) # def role_count(): df = pd.DataFrame() for chapter in txt_all: Names ={} data=[] with open('./ Conan /{}'. Format (chapter),'r',encoding=' UTF-8 ') as f: for line in F.readlines (): poss=jieba.cut(line) for word in poss: if word in role: if names.get(word) is None: names[word]=0 names[word]+=1 df_new = pd.DataFrame.from_dict(names,orient='index',columns=['{}'.format(count)]) df = Pd. concat([df,df_new],axis=1) print(' format(count)) count+=1 df.T.to_csv('role_count.csv',encoding='gb18030')Copy the code

(2) Visualization

import numpy as np import matplotlib.pyplot as plt plt.rcParams['font.sans-serif']=['kaiti'] plt.style.use('ggplot') Df = pd read_csv (' role_count. CSV, encoding = 'GBK') df = df fillna (0). Set_index (' episode came) PLT. Figure (figsize = (10, 5)) role_sum=df.sum().to_frame().sort_values(by=0,ascending=False) G =sns.barplot(role_sum. Index, ROle_sum [0], Palette ='Set3',alpha=0.8) index= Np.arange (len(role_sum)) for name,count in zip(index,role_sum[0]): G ext(name,count+50,int(count),ha='center',va='bottom',) plt.title('B ') plt.ylabel(' discussion times ') plt.show()Copy the code

Even though conan is a schoolboy, there are times when he will return to Shinichi, and it’s not just about finding criminals — catching criminals. Next from the point of view of data, pick up some wonderful story set number.

II. Conan changes back to the new episode statistics

In view of the fact that the new one appeared in the recall of some sets, the threshold value for discussion was set to 250 times to reduce the deviation, and the distribution diagram was drawn as follows

The discussion results and episode titles are shown in the table below

Interested in the little cute people can code, in addition to 235 episodes, are the number of conan changed back to the new one.

The relevant codes are as follows:

df=pd.read_csv('role_count.csv',encoding='gbk') df=df.fillna(0).set_index('episode') Xinyi = df [df [' new '] > = 250] [' new '] to_frame () print (xinyi) # a new appearance for several PLT. Figure (figsize = (10, 5)) PLT. The plot (df) index, df [' new '], label = 'new' color = 'blue', alpha = 0.6) PLT, annotate (' episodes: 50, discuss number: 309 ', x, y = (50309). Xytext =(40,330), arrowprops=dict(color='red',headwidth=8,headlength=8)) plt.annotate(' cent-count: 200, parenthesis: 200 ', xy=(40, 340), Xytext =(195,280), arrowprops=dict(color='red',headwidth=8,headlength=8)) plt.annotate(' cent-count: 391, CMP: 391 ', xy=(185,391), Xytext = (585310). arrowprops=dict(color='red',headwidth=8,headlength=8) ) plt.hlines(xmin=df.index.min(),xmax=df.index.max(),y=250,linestyles='--',colors='red') Plt.legend (loc='best',frameon=False) plt.xLabel (' set number ') plt.ylabel(' discussion number ') plt.title(' Kido new discussion number ') plt.show()Copy the code

The word cloud map of the most discussed 572 sets (excluding the high-frequency word “xinyi” to avoid omission of other information) is drawn as follows:

As can be seen from the graph, words with high frequency include plastic surgery, clothing, voice, love and so on. (It seems that the murderer is the whole xinyi appearance of the crime, and the feelings of the new LAN play in it, worth a look)

III. Main line set number content analysis

The main plot mainly revolves around the members of the organization (gin, vodka and Belmod), and the distribution map is as follows:

PLT. Figure (figsize = (10, 5)) names = [' gin, vodka, 'Mr. Elder sister] colors = [' # 090707', '# 004 e66', '# EC7357] alphas for = [0.8, 0.7, 0.6] name,color,alpha in zip(names,colors,alphas): plt.plot(df.index,df[name],label=name,color=color,alpha=alpha) plt.legend(loc='best',frameon=False) PLT. Annotate (' episodes: {}, discussed: {} '. The format (df) [r]. 'Mr Elder sister idxmax (), int (df) [r].' Mr Elder sister 'Max ())), y = (df) [r].' Mr Elder sister idxmax (), df [r]. 'Mr Elder sister' Max ()), Xytext =(df[' xytext '].idxmax()+30,df[' xytext '].max()), Arrowprops =dict(color='red', headWidth =8, headLength =8) plt.xLabel (' set ') plt.ylabel(' discussion times ') plt.title(' discussion times ') Plt.hlines (xmin=df.index.min(),xmax=df.index.max(),y=200,linestyles='--',colors='red') plt.ylim(0,400) # Mainline =set(list(df[' df ']>=200].index)+list(df[' df ']>=200].index))Copy the code

As can be seen from the above analysis, the actions of the members of the organization are basically the same, among which Bermode (Belmode) is the most popular among the three, especially in the 375 episode (Confrontation with the Dark Organization series), which has been discussed 379 times. In addition, the number of episodes discussed more than 200 times was counted, and the results were as follows:

Based on the 375 episodes with the highest discussion frequency, a word cloud map is drawn as follows (the high-frequency word “Beijie” is excluded to avoid missing other information)

As can be seen from the picture, angel, gin wine, godmother, distressed, sniper and other words appeared frequently. As can be seen from the low frequency of the failure line, the winery operation should have ended in failure.

3. Network analysis of character image

I. Merge TXT files

In order to reflect the audience’s description of characters as much as possible, considering that there are 3000 bullets in an episode, in order to reduce operating costs, only 20 sets discussed most frequently by specific characters are selected and analyzed later.

import os import pandas as pd df=pd.read_csv('role_count.csv',encoding='gbk') df=df.fillna(0).set_index('episode') Gefiledir =' C:/Users/dell/Desktop/ conan 'huiyuan_ep=list(df. file=open('txt_all.txt','w',encoding='UTF-8') count=0 for filename in huiyuan_ep: filepath=mergefiledir+'/'+str(filename)+'.txt' for line in open(filepath,encoding='UTF-8'): Write ('\n') count+=1 print(' {} set end '.format(count)) file.close()Copy the code

II. Character image visualization

Using the idea of a co-occurrence matrix, the occurrence of two specified words in the same sentence counts 1. (Note: where, stopWods. TXT is the stop word file, role-. TXT is the character nickname file)

import codecs
import csv
import jieba
linesName=[]
names={}
relationship={}
jieba.load_userdict('role.txt')
txt=[ line.strip() for line in open('stopwords.txt','r',encoding='utf-8')]
name_list=[ i.replace('\n','') for i in open('role.txt','r',encoding='utf-8').readlines()]
 
def base(path):
    with codecs.open(path,'r','UTF-8') as f:
        for line in f.readlines():
            line=line.replace('\r\n','')
            poss = jieba.cut(line)
            linesName.append([])
            for word in poss:  
                if word in txt:
                    continue
                linesName[-1].append(word)
                if names.get(word) is None:
                    names[word]=0
                    relationship[word]={}
                names[word]+=1
    return linesName,relationship
 
def relationships(linesName,relationship,name_list):          
    for line in linesName:
        for name1 in line:
            if name1 in name_list:
                for name2 in line:
                    if name1==name2:
                        continue
                    if relationship[name1].get(name2) is None:
                        relationship[name1][name2]=1
                    else:
                        relationship[name1][name2]+=1
    return relationship
 
def write_csv(relationship):
    csv_writer2=open('edges.csv','w',encoding='gb18030')
    writer=csv.writer(csv_writer2,delimiter=',',lineterminator='\n')
    writer.writerow(['Source','Target','Weight'])
    for name,edges in relationship.items():
        for k,v in edges.items():
            if v>10:
                writer.writerow([name,k,v])
    csv_writer2.close()    
 
if __name__=='__main__':
    linesName,relationship=base('txt_all.txt')
    data=relationships(linesName,relationship,name_list)
    write_csv(data) 
Copy the code

Import the generated file into Gephi to get the following character image

The thicker the line, the more obvious the character. It is not hard to see that people’s comments on aijiang are mainly greasy, lovely and distressed.

The END:

Well, that’s all the content of PYTHON combat! Remember to give xiaobo three consecutive yo 🤞😘

In the future, the support of the family who will be updated online daily is the biggest motivation of 🍒!!

There is also a private source code share for PYTHON

Use Python to climb conan barrage +Gephi to sort out the main plot

Guide language: 9000What’s in your later childhood?

II. Character image visualization

The END:

Related Posts

The Developer.bitcoin platform has launched two new BCH development kits

Java generics address JDK design flaws

Linked lists – Bi-directional non-generic linked lists