“This is the 24th day of my participation in the November Gwen Challenge. See details of the event: The Last Gwen Challenge 2021”.

preface

Using Python to achieve the data visualization of dajianghe Commentary. Without further ado.

Let’s have a good time

The development tools

Python version: 3.6.4

Related modules:

Requests the module

proxy2808

Pandas module

Pyecharts module;

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

Because douban reverse climb or more serious

2808PROXY provides the proxy service

It’s almost impossible without an agent

Analysis of web page

Although there are more than 20,000 comments, douban only releases 500 data pieces when it is logged in.

This time, only all comments and the data under the comment TAB of bad comments were obtained, totaling about 900 pieces.

Then it’s time to get the user’s registration.

900 + users, 900 + requests.

I believe no agency is absolutely Game Over.

To get the data

Comments and user information access part of the code

import time
import requests
import proxy2808
from bs4 import BeautifulSoup

USERNAME = 'Username'
PASSWORD = 'password'

headers = {
    'Cookie': 'Your Cookie value'.'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}


def get_comments(page, proxy_url_secured) :
    """ Comment capture """
    # Get popular comments
    url = 'https://movie.douban.com/subject/26797690/comments?start=' + str(page) + '&limit=20&sort=new_score&status=P'
    # Good reviews
    # url = 'https://movie.douban.com/subject/26797690/comments?start=' + str(page) + '&limit=20&sort=new_score&status=P&percent_type=h'
    # General comment acquisition
    # url = 'https://movie.douban.com/subject/26797690/comments?start=' + str(page) + '&limit=20&sort=new_score&status=P&percent_type=m'
    # Get bad reviews
    # url = 'https://movie.douban.com/subject/26797690/comments?start=' + str(page) + '&limit=20&sort=new_score&status=P&percent_type=l'
    Use 2808Proxy
    response = requests.get(url=url, headers=headers, proxies={'http': proxy_url_secured, 'https': proxy_url_secured})
    soup = BeautifulSoup(response.text, 'html.parser')
    for div in soup.find_all(class_='comment-item'):
        time.sleep(3)

Copy the code

Get all the data under the comments TAB (500).

The red box shows the user’s registration time.

Assuming I can crawl all the comments, I’m going to catch you.

Personally, the water army is too many new registered users…

Douban, however, didn’t give us that chance.

Get the bad review TAB page data (482).

Look at the sign-up time for users with bad reviews.

Compared to the favorable user registration time, it is a little bit interesting.

Registration is relatively late.

Analysis of the emotional

The sentiment analysis of the comments uses Baidu’s natural language processing.

Here’s an example using a website.

Specific can go to the official website to see the document, here is just a brief introduction.

Log in to baidu AI development platform through your Baidu account and create a new natural language processing project.

Obtain THE API Key and Secret Key.

Call emotion tendency analysis interface, get emotion result.

Part of the code

import urllib.request
import pandas
import json
import time


def get_access_token() :
    """ Access Token of Baidu AI platform """
    # Use your API Key and Secret Key
    host = 'at https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id= / API Key & client_secret = [Secret Key]'
    request = urllib.request.Request(host)
    request.add_header('Content-Type'.'application/json; charset=UTF-8')
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')
    rdata = json.loads(content)
    return rdata['access_token']


def sentiment_classify(text, acc) :
    """ Get the emotional bias of the text (negative or positive or neutral) : text: STR text """
    raw = {"text":"Content"}
    raw['text'] = text
    data = json.dumps(raw).encode('utf-8')
    # Affective propensity analysis interface
    host = "https://aip.baidubce.com/rpc/2.0/nlp/v1/sentiment_classify?charset=UTF-8&access_token=" + acc
    request = urllib.request.Request(url=host, data=data)
    request.add_header('Content-Type'.'application/json')
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')
    rdata = json.loads(content)
    return rdata

Copy the code

The sentiment analysis results are as follows.

Overall, 5-star ratings tend to be positive (2).

Of course, there are also some negative (0) results.

But it’s still within acceptable limits.

Reviews with 1 star ratings tend to have a negative emotional bias.

I’ve circled the positive one in red, so you can see for yourself.

After all, the machine’s recognition level is limited, and the possibility of achieving 100% recognition is almost zero.

Data visualization

Comment on the distribution of dates

As the TV series began, nothing changed.

But the bad reviews have some fluctuations in the back.

Comment on the time distribution

Most of the comments were made at night, in line with the norm.

Review rating

The 5-star rating of all the essays accounted for the majority.

The majority of all negative 1 and 2 stars.

Comment on sentiment analysis

“2” means positive, “1” means neutral, and “-2” means negative.

The positive results of all the essays accounted for the majority.

All comments are ranked based on the number of likes.

So for the whole play, people are relatively recognized.

Comment on user registration time

Generate a comment word cloud

Praise cloud

Bad review word cloud