This is the 7th day of my participation in the August More Text Challenge

preface

  • WordCloud is a graphical visual display of words with high frequency in text, which is a common method of text mining. There are a variety of data analysis tools that support such graphics, such as Matlab, SPSS, SAS, R and Python, and there are many online web pages that can generate WordCloud.

1. Website analysis and source data acquisition — video barrage from Station B of “What is Page?

  • Analysis of the
    1. First of all, it can be seen from the figure that there are 882 bullets in this video
    2. Next, press F12 to enter developer mode and get the source data
      • Note: The barrage data of station B has a ready-made interface, so you only need to find the CID value of the corresponding video
      • There are 882 bullets in total, cid=72342029

2. Data capture — Get data around the interface that requires cookies

# -! - coding: utf-8 -! -
from bs4 import BeautifulSoup Parse HTML, XML documents
import numpy as np 
import requests 

# Fetch data

# target url
url = 'http://comment.bilibili.com/72342029.xml' 
Send a get request to the target URL
html = requests.get(url).content 
html_data = str(html, 'utf-8') 
soup = BeautifulSoup(html_data, 'lxml') 
results = soup.find_all('d') 

comments = [comment.text for comment in results] 
comments_dict = {'comments': comments} 

df = pd.DataFrame(comments_dict) 
df.to_csv('bilibili.csv', encoding='utf-8')
Copy the code

Successful acquisition of 882 barrage

3. Data visualization

from PIL import Image 
from wordcloud import WordCloud, ImageColorGenerator 
import matplotlib.pyplot as plt 
import pandas as pd 
import jieba 

# jieba word segmentation, generate word cloud for bullet screen data
df = pd.read_csv('bilibili.csv', header=None) 

text = ' ' 
for line in df[1]: 
    text += ' '.join(jieba.cut(line, cut_all=False)) 

#background_Image = plt.imread('peiqi_1.jpg') 
background_Image = np.array(Image.open("peiqi_1.jpg")) 
wc = WordCloud( 
    background_color='white', 
    mask=background_Image, 
    font_path='C:\Windows\Fonts\simhei.ttf', 
    max_words=2000, 
    max_font_size=80, 
    random_state=30, 
) 

wc.generate_from_text(text) 
# Look at the high frequency of words and remove useless information
process_word = WordCloud.process_text(wc, text) 
sort = sorted(process_word.items(), key=lambda e:e[1], reverse=True) print(sort[:50]) 
mg_colors = ImageColorGenerator(background_Image) wc.recolor(color_func=img_colors) 
plt.imshow(wc) 
plt.axis('off') 
wc.to_file("peggy.jpg") 
print('Yeah Successfully! ')
Copy the code