Python crawler | weibo hot search time crawl, fishing artifact

Hello, I’m Ding Xiaojie. I believe that when you are bored at work, you always want to take out your mobile phone and see what interesting topics are discussed in weibo hot search, but it is not convenient to directly open Weibo to browse. Today, I will share with you an interesting crawler to regularly collect weibo hot search list & hot comments. Let’s take a look at the specific implementation method.

Page analysis

Hot search page

Hot list page: https://s.weibo.com/top/summary?cate=realtimehot

There are 50 data points on the hot list front page. On this page, we need to get the ranking, popularity, title, and links to the details page.We need to open the page firstThe login, and then usedF12Open up the developer tools,Ctrl + RRefresh the page to find the first packet. I need to record my own hereCookie 与 User-Agent.To locate labels, use theGoogleTool to get the labelxpathThe expression will do.

Details page

For the details page, we need to get the information of comment time, user name, times of forwarding, times of commenting, times of liking and content of comments.The method is basically the same as the hot search page collection, let’s see how to use the code to achieve!

Acquisition code

Start by importing the required modules.

import requests
from time import sleep
import pandas as pd
import numpy as np
from lxml import etree
import re
Copy the code

Define global variables.

headers: request header
all_df: DataFrame: saves the collected data

headers = {
    'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36'.'Cookie': "' your Cookie" '
}
all_df = pd.DataFrame(columns=['top'.'heat'.'title'.'Comment time'.'User name'.'Forwarding times'.'Number of comments'.'Likes'.'Comment content'])
Copy the code

Hot search list collection code, initiate a request through Requests, and jump to get_detail_page after obtaining the detail page link.

def get_hot_list(url) :
    Param URL: Microblogging hot search page link :return: None ""
    page_text = requests.get(url=url, headers=headers).text
    tree = etree.HTML(page_text)
    tr_list = tree.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr')
    for tr in tr_list:
        parse_url = tr.xpath('./td[2]/a/@href') [0]
        detail_url = 'https://s.weibo.com' + parse_url
        title = tr.xpath('./td[2]/a/text()') [0]
        try:
            rank = tr.xpath('./td[1]/text()') [0]
            hot = tr.xpath('./td[2]/span/text()') [0]
        except:
            rank = The 'top'
            hot = The 'top'
        get_detail_page(detail_url, title, rank, hot)
Copy the code

According to the details page link, the required page data is analyzed and saved to the global variable all_df. For each hot search, only the first three hot comments are collected, and if there are not enough hot comments, they are skipped.

def get_detail_page(detail_url, title, rank, hot) :
    Parsed the required page data according to the detail page link and saved to the global variable all_df :param detail_url: detail page link :param title: title: param rank: rank: param hot: heat :return: None '''
    global all_df
    try:
        page_text = requests.get(url=detail_url, headers=headers).text
    except:
        return None
    tree = etree.HTML(page_text)
    result_df = pd.DataFrame(columns=np.array(all_df.columns))
    # Crawl 3 popular comment messages
    for i in range(1.4) :try:
            comment_time = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[1]/div[2]/p[1]/a/text()') [0]
            comment_time = re.sub('\s'.' ',comment_time)
            user_name = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[1]/div[2]/p[2]/@nick-name') [0]
            forward_count = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[2]/ul/li[1]/a/text()') [1]
            forward_count = forward_count.strip()
            comment_count = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[2]/ul/li[2]/a/text()') [0]
            comment_count = comment_count.strip()
            like_count = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[2]/ul/li[3]/a/button/span[2]/text()') [0]
            comment = tree.xpath(f'//*[@id="pl_feedlist_index"]/div[4]/div[{i}]/div[2]/div[1]/div[2]/p[2]//text()')
            comment = ' '.join(comment).strip()
            result_df.loc[len(result_df), :] = [rank, hot, title, comment_time, user_name, forward_count, comment_count, like_count, comment]
        except Exception as e:
            print(e)
            continue
    print(detail_url, title)
    all_df = all_df.append(result_df, ignore_index=True)
Copy the code

Dispatch code, pass the url of the hot search page to get_hot_list, and finally save it.

if __name__ == '__main__':
    url = 'https://s.weibo.com/top/summary?cate=realtimehot'
    get_hot_list(url)
    all_df.to_excel('Working document.xlsx', index=False)
Copy the code

For some possible errors in the collection process, in order to ensure the normal operation of the program, are ignored through exception processing, the overall impact is not big!

Working document.xlsx

Set timing

Now that the code collection is complete, you can use the task scheduler to automate running the code every hour.

Before this, we need to simply modify the above code in the Cookie and the last file save path (it is recommended to use the absolute path), if inJupyter notebookNeed to export one to run in.pyfile

Open task scheduler, [Create task]Enter a name, just whatever you want.Select trigger >> New >> Set trigger timeSelect [Operation] >> [New] >> [Select Program]Final confirmation is ok. When the time is up, it will run automatically, or right-click the task to run manually.

Running effect

This is the content to share today, the overall difficulty is not difficult, I hope you can gain, the code in the article can be spliced together to run, if you have any questions can contact me through wechat oh!

Python crawler | weibo hot search time crawl, fishing artifact

Page analysis

Hot search page

Details page

Acquisition code

Set timing

Related Posts

Go Context usage and source code analysis

Redis Series (eight) : Publish and subscribe

API Token authentication mode design