Python crawls microblogging comments (no duplicate data)

  • preface
  • First, the overall idea
  • Second, get the weibo address
    • 1. Get the Ajax address
    • 2. Analyze the weibo address in the page
    • 3. Obtain the weibo address of the specified user
  • Get the main comment
  • 4. Get sub-reviews
    • 1. Parsing sub-comments
    • 2. Get sub-comments
  • Main function call
    • 1. Import related libraries
    • 2. Main function execution
    • 3, the results
  • Write in the last

This article is for learning and communication only. Do not use it for illegal purposes!!

preface

Some time ago, there was a serious polarization in the comments on a diary on Weibo. Out of curiosity, I wanted to make a simple analysis on the comments and related users, so I found the relevant code on the Internet and simply modified the cookies and other parameters to Run.

Since there is no error!! I was shocked, happier than EVER

And then they analyzed it, and after a while they found that it was not simple,The data is duplicate!!

Yes, the repetition rate is outrageous. So, I began my journey of exploration

The whole code is used for reference before the big guy, mainly to solve the problem of repeated data, if there is infringement, please contact!

First, the overall idea

I have drawn a very simple flow chart:



As for why the main comments and sub-comments should be obtained separately, this is also the key to solve the problem of duplication. Through the test, we can know that the page parameters can be directly modified according to the rule or the page crawl of.cn. After reaching a certain number (about hundreds), the data cannot be obtained or there will be data duplication.

Second, get the weibo address

The request URL for accessing weibo user page is:

https://weibo.com/xxxxxxxxx?is_search=0&visible=0&is_all=1&is_tag=0&profile_ftype=1&page=1
Copy the code

The page number can be controlled by modifying the page parameter, but careful friends should notice that in addition to the HTML directly loaded, there are two times of dynamic access through Ajax:

start_ajax_url1 = 'https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=%s&is_all=1&page={0}&pagebar=0&pl_name=Pl_Official_MyProfileFe ed__20&id=%s&script_uri=/%s&pre_page={0}'%(domain,page_id,user)
start_ajax_url2 = 'https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=%s&is_all=1&page={0}&pagebar=1&pl_name=Pl_Official_MyProfileFe ed__20&id=%s&script_uri=/%s&pre_page={0}'%(domain,page_id,user)
Copy the code

In other words, each page of data consists of three parts:

1. Get the Ajax address

From the main interface, get the corresponding Ajax request address:

def get_ajax_url(user) :
    url = 'https://weibo.com/%s?page=1&is_all=1'%user
    res = requests.get(url, headers=headers,cookies=cookies)
    html  = res.text
    page_id = re.findall("CONFIG\['page_id'\]='(.*?) '",html)[0]
    domain = re.findall("CONFIG\['domain'\]='(.*?) '",html)[0]
    start_ajax_url1 = 'https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=%s&is_all=1&page={0}&pagebar=0&pl_name=Pl_Official_MyProfileFe ed__20&id=%s&script_uri=/%s&pre_page={0}'%(domain,page_id,user)
    start_ajax_url2 = 'https://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=%s&is_all=1&page={0}&pagebar=1&pl_name=Pl_Official_MyProfileFe ed__20&id=%s&script_uri=/%s&pre_page={0}'%(domain,page_id,user)
    return start_ajax_url1,start_ajax_url2
Copy the code

2. Analyze the weibo address in the page

After sending the request, parse the weibo address in the page (same as the main page request or AJAX request) :

def parse_home_url(url) : 
    res = requests.get(url, headers=headers,cookies=cookies)
    response = res.content.decode().replace("\ \"."")
    every_id = re.compile('name=(\d+)', re.S).findall(response) Get the id required by the secondary page
    home_url = []
    for id in every_id:
        base_url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={}&from=singleWeiBo'
        url = base_url.format(id)
        home_url.append(url)
    return home_url
Copy the code

3. Obtain the weibo address of the specified user

By integrating the above two functions, we get:

def get_home_url(user,page) : 
    start_url = 'https://weibo.com/%s?page={}&is_all=1'%user
    start_ajax_url1,start_ajax_url2 = get_ajax_url(user)
    for i in range(page): 
        home_url = parse_home_url(start_url.format(i + 1)) # Get every page of twitter
        ajax_url1 = parse_home_url(start_ajax_url1.format(i + 1)) # Ajax loading page of the microblog
        ajax_url2 = parse_home_url(start_ajax_url2.format(i + 1)) # Ajax second page loads the page of the microblog
        all_url = home_url + ajax_url1 + ajax_url2
        print('page %d parsed complete'%(i+1))
    return all_url
Copy the code

The parameters are the user ID and the number of pages to crawl, and the result is the address of each tweet.

Get the main comment

A simple analysis of the request data shows that the interface for obtaining weibo comments is as follows:

https://weibo.com/aj/v6/comment/big?ajwvr=6&id=4498052401861557&root_comment_max_id=185022621492535&root_comment_max_id_type=0&root_comment_ext_param=&page=1&from=singleWeiBo
Copy the code

A dazzling page parameter appears, and it seems that the request is OK after removing the other parameters. Maybe your first reaction is to write a loop to fetch emmmm directly, and then you will fall intoData duplicationThe origin of terror. It seems that the root_comment_max_id parameter is also important to get. Further analysis shows that, in fact, in the data returned by the request, theContains the address for the next request, just extract it and continue:



The code is as follows:

def parse_comment_info(data_json) : 
    html = etree.HTML(data_json['data'] ['html'])
    name = html.xpath("//div[@class='list_li S_line1 clearfix']/div[@class='WB_face W_fl']/a/img/@alt")
    info = html.xpath("//div[@node-type='replywrap']/div[@class='WB_text']/text()")
    info = "".join(info).replace(""."").split("\n")
    info.pop(0)
    comment_time = html.xpath("//div[@class='WB_from S_txt2']/text()") # Comment time
    name_url = html.xpath("//div[@class='WB_face W_fl']/a/@href")
    name_url = ["https:" + i for i in name_url]
    ids = html.xpath("//div[@node-type='root_comment']/@comment_id")    
    try:
        next_url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&from=singleWeiBo&'+html.xpath('/html/body/div/div/div[%d]/@action-data'% (len(name)+1))0] +'&__rnd='+str(int(time.time()*1000))
    except:
        try:
            next_url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&from=singleWeiBo&'+html.xpath('/html/body/div/div/a/@action-data') [0] +'&__rnd='+str(int(time.time()*1000))
        except:
            next_url = ' '
    comment_info_list = []
    for i in range(len(name)): 
        item = {}
        item["id"] = ids[i]
        item["name"] = name[i] Store the reviewer's screen name
        item["comment_info"] = info[i][1:] # Store comment information
        item["comment_time"] = comment_time[i] # Store comment time
        item["comment_url"] = name_url[i] # Store the reviewer's relevant home page
        try:
            action_data = html.xpath("/html/body/div/div/div[%d]//a[@action-type='click_more_child_comment_big']/@action-data"%(i+1))0]
            child_url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&from=singleWeiBo&' + action_data
            item["child_url"] = child_url
        except:
            item["child_url"] = ' '    
        comment_info_list.append(item)
    return comment_info_list,next_url
Copy the code

The parsed JSON data is returneddata, as well asNext address, the data format is as follows:



Child_url is the address of the corresponding child comment to obtain further child comment.

4. Get sub-reviews

The way to get the child comments is the same as the way to get the main comments. After we get all the main comments, we iterate over the results. When child_URL is not empty (i.e. there are child comments), we request to get the child comments.

1. Parsing sub-comments

def parse_comment_info_child(data_json) : 
    html = etree.HTML(data_json['data'] ['html'])
    name = html.xpath("//div[@class='list_li S_line1 clearfix']/div/div[1]/a[1]/text()")
    info=html.xpath("//div[@class='list_li S_line1 clearfix']/div/div[1]/text()")
    info = "".join(info).replace(""."").split("\n")
    info.pop(0)
    comment_time = html.xpath("//div[@class='WB_from S_txt2']/text()") # Comment time
    name_url = html.xpath("//div[@class='list_li S_line1 clearfix']/div/div[1]/a[1]/@href")
    name_url = ["https:" + i for i in name_url]
    ids = html.xpath("//div[@class='list_li S_line1 clearfix']/@comment_id")
    try:
        next_url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&from=singleWeiBo&'+html.xpath('/html/body/div[%d]/div/a/@action-data'% (len(name)+1))0] +'&__rnd='+str(int(time.time()*1000))
    except:
        next_url = ' '
    comment_info_list = []
    for i in range(len(name)): 
        item = {}
        item["id"] = ids[i]
        item["name"] = name[i] Store the reviewer's screen name
        item["comment_info"] = info[i][1:] # Store comment information
        item["comment_time"] = comment_time[i] # Store comment time
        item["comment_url"] = name_url[i] # Store the reviewer's relevant home page
        comment_info_list.append(item)
    return comment_info_list,next_url
Copy the code

2. Get sub-comments

Call the previous function to get the corresponding subcomment:

def get_childcomment(url_child) :
    print('Start getting sub-comments... ')
    comment_info_list = []
    res = requests.get(url_child, headers=headers, cookies=cookies)
    data_json = res.json()
    count = data_json['data'] ['count']
    comment_info,next_url = parse_comment_info_child(data_json)
    comment_info_list.extend(comment_info)
    print('%d has been obtained'%len(comment_info_list))
    while len(comment_info_list) < count:
        if next_url == ' ':
            break
        res = requests.get(next_url,headers=headers,cookies=cookies)
        data_json = res.json()
        comment_info,next_url = parse_comment_info_child(data_json)
        comment_info_list.extend(comment_info)
        print('%d has been obtained'%len(comment_info_list))
    return comment_info_list
Copy the code

The argument is child_URL and the corresponding child comment is returned.

Main function call

1. Import related libraries

import re
import time
import json
import urllib
import requests
from lxml import etree
Copy the code

2. Main function execution

if "__main__" == __name__: 
    # Set parameters
    headers = {
            'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.3; Win64; x64; The rv: 75.0) Gecko / 20100101 Firefox 75.0 / '.'Accept': '* / *'.'Accept-Language': 'zh-CN,zh; Q = 0.8, useful - TW; Q = 0.7, useful - HK; Q = 0.5, en - US; Q = 0.3, en. Q = 0.2 '.'Content-Type': 'application/x-www-form-urlencoded'.'X-Requested-With': 'XMLHttpRequest'.'Connection': 'keep-alive',
    }
    cookies = {} # weibo cookies(need to get your own request to get)
    userid = ' '  # The weibo user ID to be crawled
    page = 1     # number of pages to climb
    # Start to crawl
    all_urls = get_home_url(userid,page)
    for index in range(len(all_urls)):
        url = all_urls[index] 
        print('Start getting %d weibo comments... '%(index+1))
        comment_info_list = []
        res = requests.get(url, headers=headers, cookies=cookies)
        data_json = res.json()
        count = data_json['data'] ['count']
        comment_info,next_url = parse_comment_info(data_json)
        comment_info_list.extend(comment_info)
        print('%d has been obtained'%len(comment_info_list))
        while True:
            if next_url == ' ':
                break
            res = requests.get(next_url,headers=headers,cookies=cookies)
            data_json = res.json()
            comment_info,next_url = parse_comment_info(data_json)
            comment_info_list.extend(comment_info)
            print('%d has been obtained'%len(comment_info_list))
        for i in range(len(comment_info_list)):
            child_url = comment_info_list[i]['child_url']
            ifchild_url ! =' ':
                comment_info_list[i]['child'] = get_childcomment(child_url)
            else:
                comment_info_list[i]['child'] = []
        with open('Article %d weibo comments.txt'%(index+1),'w') as f:
            f.write(json.dumps(comment_info_list))
Copy the code

3, the results

10 pieces of microblog data were obtained as follows:

Write in the last

Of course, there are many shortcomings, such as the speed is not ideal, the structure is chaotic. Considered joining multithreading and other ways to speed up, but because of the need to login, so there is a risk oh, we carefully ~

Finally, thank you for reading with great patience. Have a cameo appearance with your cute hands (◕ܫ← danjun)