Actual combat | crawl in Python "valley of yunnan bug 36000 comments, and analysis of the data visualization display, good-looking!

Hello, everyone, I am the talented brother.

Recently ghost blow up series network drama “Yunnan Bug Valley” online, as a ghost blow up series of works, to undertake the upper “Longling Fan Caves” content, and or the original cast iron Triangle starring, netizens call very good!

Today, we will use Python to crawl all the current series reviews (including trailers), and do statistical and visual analysis of the show. Let’s follow the netizens to watch the show!

This article will explain in detail crawler and data processing visualization, edutainment!

1. Web page analysis

This comment is all from Tencent Video (after all, only broadcast)

Open the “Yunnan Bug Valley” playing page, F12 entered the developer mode, we slide click “view more comments”, you can find the real address of the comment request.

Let’s find several comment interface addresses for comparative analysis to find the rule

https://video.coral.qq.com/varticle/7313247714/comment/v2?callback=_varticle7313247714commentv2&orinum=10&oriorder=o&pag eflag=1&cursor=6838089132036599025&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132&_=1630752996851 https://video.coral.qq.com/varticle/7313247714/comment/v2?callback=_varticle7313247714commentv2&orinum=100&oriorder=o&pa geflag=1&cursor=6838127093335586287&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132&_=1630752996850 https://video.coral.qq.com/varticle/7313258351/comment/v2?callback=_varticle7313258351commentv2&orinum=10&oriorder=o&pag eflag=1&cursor=6838101562707822837&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132&_=1630753165406Copy the code

Finally, we find that this address can be simplified into the following section

url = f'https://video.coral.qq.com/varticle/{comment_id}/comment/v2? '
params = {
    'orinum': 30.'cursor': cursor,
    'oriorder': 't'
}  
Copy the code

The meanings of the four parameters are as follows:

orinumIs the number of comments per request, the default is 10, I tried to find a maximum of 30, so HERE I set the value to 30
cursorIs the initial comment ID of each request, and the initial value can be set to 0 in the rule of change. The next request will use the ID of the last comment in the comment list obtained in the last request
oriorderIs the order of requested comments (t means in chronological order, default is the other hottest)
In addition to the above three parameters, there is actually one morecomment_idThere’s one in every episodeidFor collecting corresponding reviews, so we need to look at thatidWhere to get

We mentioned that we needed to get the comment ID data for each episode, and we found that we could get it by requesting the page data for each episode.

Note that we used Requests directly to request data for each set of pages, which is not the same as the actual web page, but we can find relevant data.

For example, the list of episode ids would look like this:

Episode review ID where:

Then, we can use re regular expression data parsing can be obtained.

2. Crawler process

Through web page analysis and our collection requirements, the whole process can be divided into the following parts:

Crawl the episode page data
Parse to get episode ids and episode review ids
Collect all episode reviews
Save data locally

2.1. Introduce the required libraries

import requests
import re
import pandas as pd
import os
Copy the code

2.2. Crawl the episode page data

# used to crawl episode page data
def get_html(url) :
    headers = {
        "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36",
    }

    r = requests.get(url, headers=headers)
    # Garbled correction
    r.encoding = r.apparent_encoding
    text = r.text
    # Remove non-character data
    html = re.sub('\s'.' ', text)

    return html
Copy the code

2.3. Parse episode ids and episode review ids

# pass in the id of the drama, etc., used to crawl the episode ID and comment ID
def get_comment_ids(video_id) :
    # Play address
    url = f'https://v.qq.com/x/cover/{video_id}.html'
    html = get_html(url)
    data_list = eval(re.findall(r'"vip_ids":(\[.*?\])', html)[0])    
    data_df = pd.DataFrame(data_list)
    
    comment_ids = []
    for tid in data_df.V:
        # address per episode
        url = f'https://v.qq.com/x/cover/{video_id}/{tid}.html'
        html = get_html(url)
        comment_id = eval(re.findall(r'"comment_id":"(\d+)"', html)[0])
        comment_ids.append(comment_id)
    
    data_df['comment_id'] = comment_ids
    data_df['show'] = range(1.len(comment_ids)+1)
    return data_df
Copy the code

2.4. Collect all episode reviews

# Get full episode reviews
def get_comment_content(data_df) :
    for i, comment_id in enumerate(data_df.comment_id):
        i = i+1
        # initial cursor
        cursor = 0
        num = 0
        while True:
            url = f'https://video.coral.qq.com/varticle/{comment_id}/comment/v2? '
            params = {
                'orinum': 30.'cursor': cursor,
                'oriorder': 't'
            }    
            r = requests.get(url, params=params)
            data = r.json()
            data = data['data']
            if len(data['oriCommList'= =])0:
                break
            # Comment data
            data_content = pd.DataFrame(data['oriCommList'])
            data_content = data_content[['id'.'targetid'.'parent'.'time'.'userid'.'content'.'up']]
            # Commentator info
            userinfo = pd.DataFrame(data['userList']).T
            userinfo = userinfo[['userid'.'nick'.'head'.'gender'.'hwlevel']].reset_index(drop=True)
            # Merge comment information with commentator information
            data_content = data_content.merge(userinfo, how='left')
            data_content.time = pd.to_datetime(data_content.time, unit='s') + pd.Timedelta(days=8/24)
            data_content['show'] = i
            data_content.id = data_content.id.astype('string')
            save_csv(data_content)
            # Next cursor
            cursor = data['last']
            num =num + 1
            pages = data['oritotal'] / /30 + 1
            print('the first f{i}Set the first{num}/{pages}Page comments have been collected! ')
Copy the code

2.5. Save data to the local PC

# Save comment data locally
def save_csv(df) :
    file_name = 'Comment data.csv'
    if os.path.exists(file_name):
        df.to_csv(file_name, mode='a', header=False,
          index=None, encoding='utf_8_sig')
    else:
        df.to_csv(file_name, index=None, encoding='utf_8_sig')
    print('Data saved! ')
Copy the code

3. Data statistics and visual display

The data statistics and visual display methods of this time can be referred to the previous tweets “” and” “

3.1. Data preview

Let’s take 5 samples

df.sample(5)
Copy the code

Look at the data

df.info()
Copy the code

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35758 entries, 0 to 35757
Data columns (total 12 columns):
 #Column Non-Null Count Dtype--- ------ -------------- ----- 0 id 35758 non-null int64 1 targetid 35758 non-null int64 2 parent 35758 non-null int64 3 time 35758 non-null object 4 userid 35758 non-null int64 5 content 35735 non-null object 6 up 35758 non-null int64 7 nick 35758 non-null object 8 head 35758 non-null object 9 gender 35758 non-null int64 10 hwlevel 35758 non-null int64 11 Episode35758 Non-NULL INT64 dtypes: INT64 (8), Object (4) Memory Usage: 3.3+ MBCopy the code

Just brother also made a comment, let’s see if we have collected

Only brother’s userID is 1296690233. Let’s check and find that only brother’s VIP level is actually 6

df.query('userid==1296690233')
Copy the code

The head field is the avatar, let’s see if it’s the avatar

from skimage import io
# display avatar
img_url = df.query('userid==1296690233') ['head'].iloc[0]
image = io.imread(img_url)
io.imshow(image)
Copy the code

Got it, got it!!

3.2. Number of diversity comments

The text is written for Pandas. The text is written for Pandas.

import pandas as pd
import pandas_bokeh

pandas_bokeh.output_notebook()
pd.set_option('plotting.backend'.'pandas_bokeh')
Copy the code

Next, the formal data statistics and visual presentation began

from bokeh.transform import linear_cmap
from bokeh.palettes import Spectral
from bokeh.io import curdoc
# curdoc().theme = 'caliber'

episode_comment_num = df.groupby('show') ['id'].nunique().to_frame('Number of comments')
y = episode_comment_num['Number of comments']
mapper = linear_cmap(field_name='Number of comments', palette=Spectral[11] ,low=min(y) ,high=max(y))
episode_bar = episode_comment_num.plot_bokeh.bar(
    ylabel="Number of comments", 
    title="Number of diversity comments", 
    color=mapper,
    alpha=0.8,
    legend=False    
)
Copy the code

As we can see, the first episode had the highest number of comments, with 17,000, accounting for half of all comments; Followed by the number of comments for episode 7, which aired this week!

3.3. Number of comments by date

df['date'] = pd.to_datetime(df.time).dt.date
date_comment_num = df.groupby('date') ['id'].nunique().to_frame('Number of comments')
date_comment_num.index = date_comment_num.index.astype('string')

y = date_comment_num['Number of comments']
mapper = linear_cmap(field_name='Number of comments', palette=Spectral[11] ,low=min(y) ,high=max(y))
date_bar = date_comment_num.plot_bokeh.bar(
    ylabel="Number of comments", 
    title="Number of comments by date", 
    color=mapper,
    alpha=0.8,
    legend=False    
)
Copy the code

Starting on August 30, members can watch five episodes on the first day, and AS a member, I watched them all in one sitting. We found that the number of comments in the first two days after the broadcast was relatively high, and the number of comments in these days was also relatively high as it was updated 1-3 per week.

3.4. Number of timeshare comments

df['time'] = pd.to_datetime(df.time).dt.hour
date_comment_num = pd.pivot_table(df,
                                  values='id',
                                  index=['time'],
                                  columns=['date'],
                                  aggfunc='count'
                                 )
time_line = date_comment_num.plot_bokeh(kind="line",
                            legend="top_left",
                            title="Timeshare comments"
                           )
Copy the code

Through the time-sharing review number curve, we find that the hourly review number hits the highest at 8 o ‘clock on the premiere day, and then it is more consistent with TV series viewing behavior: noon, evening and midnight are higher.

3.5. Distribution of COMMENTATOR VIP levels

vip_comment_num = df.groupby('hwlevel'Agg (Number of users =('userid'.'nunique'), number of comments =('id'.'nunique')
                                           )
vip_comment_num['Comments per capita'] = round(vip_comment_num['Number of comments']/vip_comment_num['Users'].2)
usernum_pie = vip_comment_num.plot_bokeh.pie(
    y="Number of users",
    colormap=Spectral[9],
    title="Commentator VIP Rating Distribution".)Copy the code

Have to say, most of the comments are VIP users, no wonder Tencent video to do what ahead of the so-called VIP broadcast ON the VIP…

Is there a difference in the number of comments per VIP user?

y = vip_comment_num['Comments per capita']
mapper = linear_cmap(field_name='Comments per capita', palette=Spectral[11] ,low=min(y) ,high=max(y))
vipmean_bar = vip_comment_num.plot_bokeh.bar(
    y = 'Comments per capita',
    ylabel="Comments per capita", 
    title="Number of comments per VIP user", 
    color=mapper,
    alpha=0.8,
    legend=False    
)
Copy the code

The basic presentation of a VIP level higher evaluation will be higher! But why?

3.6. Comment length

Most of the netizens who write comments are 666, good-looking words, such as just brother is waiting for the update of three words, so how many characters are generally evaluated?

import numpy as np

df['Comment Length'] = df['content'].str.len()
df['Comment Length'] = df['Comment Length'].fillna(0).astype('int')

contentlen_hist = df.plot_bokeh.hist(
    y='Comment Length',
    ylabel="Number of comments", 
    bins=np.linspace(0.100.26),
    vertical_xlabel=True,
    hovertool=False,
    title="Histogram of likes on comments",
    color='red',
    line_color="white",
    legend=False.# normed=100,
    )
Copy the code

Let’s take a look at some of the comments

(df.sort_values(by='Comment Length',ascending=False)
 [['show'.'content'.'Comment Length'.'nick'.'hwlevel']].head(3)
 .style.hide_index()
)
Copy the code

I mean, is that a plagiarized review, or is it really brilliant?

3.7. Number of likes on comments

Let’s take a look at the most liked ones

# pd.set_option('display.max_colwidth',1000)
(df.sort_values(by='up',ascending=False)
 [['show'.'content'.'up'.'nick'.'hwlevel']].head()
 .style.hide_index()
)
Copy the code

Read the map and don’t get lost! Is that a meme? Over 8000 likes ~~

3.8. Users with the most comments

user_comment_num = df.groupby('userid').agg(Comments =('id'.'nunique'), VIP level =('hwlevel'.'max')
                                           ).reset_index()
user_comment_num.sort_values(by='Number of comments',ascending=False).head()
Copy the code

userid	comments	VIP level
640014751	33	5
1368145091	24	1
1214181910	18	3
1305770517	17	2
1015445833	14	5

Evaluation 33, this user is also very cow!! Let’s see what he said:

df.query('userid==640014751') [['nick'.'show'.'time'.'content']].sort_values(by='time')
Copy the code

A little boring, is to brush the praise bar!! Let’s take a look at our second most rated friend.

df.query('userid==1368145091') [['nick'.'show'.'time'.'content']].sort_values(by='time')
Copy the code

I gotta say, looks normal. However, feel a little talkative, ha ha!

What are these two pictures, interested to see:

from skimage import io
# display avatar
img_url = df.query('userid==640014751') ['head'].iloc[0]
image = io.imread(img_url)
io.imshow(image)
Copy the code

from skimage import io
# display avatar
img_url = df.query('userid==1368145091') ['head'].iloc[0]
image = io.imread(img_url)
io.imshow(image)
Copy the code

Ahem, I won’t make a judgment, after all, my avatar and nickname are also very…

3.9. Comment word cloud

This part refers to “”, and we will start from the whole word cloud and the leading word cloud

Let’s take a look at the number of mentions of our three main characters

df.fillna(' ',inplace=True)

hu = ['khufu'.'Hu Bayi'.'Pan Yueming'.'hu'.'pan']
yang = ['Zhang Yuqi'.'Shirley'.'杨']
wang = ['Jiang Chao'.'fat']

df_hu = df[df['content'].str.contains('|'.join(hu))]
df_yang = df[df['content'].str.contains('|'.join(yang))]
df_wang = df[df['content'].str.contains('|'.join(wang))]

df_star = pd.DataFrame({The 'role': ['Hu Bayi'.'Shirley Yang'.Fat Wang].'weight': [len(df_hu),len(df_yang),len(df_wang)]
                       })

y = df_star['weight']
mapper = linear_cmap(field_name='weight', palette=Spectral[11] ,low=min(y) ,high=max(y))
df_star_bar = df_star.plot_bokeh.bar(
    x = The 'role',
    y = 'weight',
    ylabel="Mention weights", 
    title="The main characters mention weights.", 
    color=mapper,
    alpha=0.8,
    legend=False    
)
Copy the code

Wang Pangzi is funny point bear, have to say mirror number is really high!!

The whole word cloud

Hu eight word cloud

Shirley Yang word cloud

Wang Pangzi word cloud

Word cloud core code

import os   
import stylecloud
from PIL import Image
import jieba
import jieba.analyse
import pandas as pd
from wordcloud import STOPWORDS

def ciYun(data,addWords,stopWords) :
    print('Mapping... ')
    comment_data = data
    
    for addWord in addWords:
        jieba.add_word(addWord)

    comment_after_split = jieba.cut(str(comment_data), cut_all=False)
    words = ' '.join(comment_after_split)
    
    # word cloud stop words
    stopwords = STOPWORDS.copy()
    for stopWord in stopWords:
        stopwords.add(stopWord)

    Get the parameters that meet the type requirements
    stylecloud.gen_stylecloud(
                              text=words,
                              size = 800,
                              palette='tableau.BlueRed_6'.# Set the color scheme
                              icon_name='fas fa-mountain'.# paper-plane mountain thumbs-up male fa-cloud
                              custom_stopwords = stopwords,
                              font_path='FZZJ-YGYTKJW.TTF'
                              # bg = bg, 
                              # font_path=font_path, # word cloud font (Chinese needs to be set to the native Chinese font)
                             )
    
    print('Word cloud generated ~')
    pic_path = os.getcwd()
    print(F 'word cloud map file has been saved in{pic_path}')
    
data = df.content.to_list()
addWords = [Teacher Pan.'Insect Valley of Yunnan'.'Degree of reduction'.'khufu'.'Hu Bayi'.'Pan Yueming'.'Zhang Yuqi'.'Shirley'.'Shirley Yang'.'General Yang'.Fat Wang.'fat']
# add stop words
stoptxt = pd.read_table(r'stop.txt',encoding='utf-8',header=None)
stoptxt.drop_duplicates(inplace=True)
stopWords = stoptxt[0].to_list()
words = ['said'.'years'.'VIP'.'true'.'this is'.'no'.'dry'.'like']
stopWords.extend(words)    
    
# run ~
ciYun(data,addWords,stopWords)
Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Actual combat | crawl in Python “valley of yunnan bug 36000 comments, and analysis of the data visualization display, good-looking!

1. Web page analysis

2. Crawler process

2.1. Introduce the required libraries

2.2. Crawl the episode page data

2.3. Parse episode ids and episode review ids

2.4. Collect all episode reviews

2.5. Save data to the local PC

3. Data statistics and visual display

3.1. Data preview

3.2. Number of diversity comments

3.3. Number of comments by date

3.4. Number of timeshare comments

3.5. Distribution of COMMENTATOR VIP levels

3.6. Comment length

3.7. Number of likes on comments

3.8. Users with the most comments

3.9. Comment word cloud

Actual combat | crawl in Python “valley of yunnan bug 36000 comments, and analysis of the data visualization display, good-looking!

1. Web page analysis

2. Crawler process

2.1. Introduce the required libraries

2.2. Crawl the episode page data

2.3. Parse episode ids and episode review ids

2.4. Collect all episode reviews

2.5. Save data to the local PC

3. Data statistics and visual display

3.1. Data preview

3.2. Number of diversity comments

3.3. Number of comments by date

3.4. Number of timeshare comments

3.5. Distribution of COMMENTATOR VIP levels

3.6. Comment length

3.7. Number of likes on comments

3.8. Users with the most comments

3.9. Comment word cloud

Related Posts

Pandas series: It all starts with exploding functions

Why is learning math important for artificial intelligence?

Tf.data optimized training data input Pipeline at Google Developer Conference 2018