preface

Now that you’ve clicked in, I’ll tell you the truth. I made up the title, but you clicked in to learn the routine. I mean it. Today’s goal is to write a crawler that crawls all the micro blog data sent by the target user. Without further ado, let’s begin happily

The development tools

Python version: 3.6.4
Related modules:

The argparse module;

DecryptLogin module;

LXML module;

TQDM module;

Prettytable module;

Pyecharts module;

Jieba module;

Wordcloud module;

And some of the modules that come with Python.

Environment set up

Install Python and add it to the environment variable, and the PIP will install the appropriate module.

Note that the DecryptLogin module is updated frequently, so in order to ensure that it can be used for all relevant examples in the public account, please update the version in time. The command format is as follows:

pip install DecryptLogin --upgrade

Introduction of the principle

Here is a brief description of the whole crawl process. First of all, of course, is the simulation login sina weibo, here or with the help of our previous open source simulation login package to achieve microblogging simulation login. Specifically, the code implements the following:

"" @staticmethod def login(username, password): lg = login.Login() _, session = lg.weibo(username, password, 'mobile') return session

Then, let the program user enter the target user ID that he or she wants to crawl. So how do you get this Weibo user ID? Take Liu Yifei’s micro blog as an example. First, go to her home page, and you can see the link:

So Liu Yifei’s Weibo user ID is 3261134763.

According to the user’s input Weibo user ID, we use the session that has implemented the simulated login to access the following links:

Url = f'https://weibo.cn/{user_id}' res = self.session.get(url, url) Url = f'https://weibo.cn/{user_id}/info' res = self.session. Get (url, headers= self.session.

The link would look something like this in the browser:

Obviously, here we can use XPath to extract some basic information about the target user:

# selector = etree.html (res.content) base_infos = selector. Xpath ("//div[@class='tip2']/*/text()") num_wbs, # selector = etree.html (res.content) base_infos = selector. num_followings, num_followers = int(base_infos[0][3: -1]), int(base_infos[1][3: -1]), int(base_infos[2][3: -1]) num_wb_pages = selector.xpath("//input[@name='mp']") num_wb_pages = int(num_wb_pages[0].attrib['value']) if Len (num_wb_pages) > 0 else 1 selector = etree.html (res.content) NICKNAME = selector.xpath('//title/text()')[0][:-3]

I won’t talk too much about what XPath is. If you look at the source code of the web page, you can easily write it:

After extracting the information, print it out for the program user to confirm whether the user information obtained with the user ID entered by the user is the same as the user information they want to crawl. After confirming the information is correct, the user can start to crawl the user’s microblog data:

TB = prettyTable.prettyTable () tb.field_names = [' user name ', 'number of followers ',' number of followers ', 'number of tweets ', Add_row ([Num_Followings, Num_Followings, Num_Followings, Num_WBS, Num_WBS, Num_WBS, Num_WBS, Num_wb_pages]) print(" print(TB) is_download = input(") print(TB) is_download = input(" (y/n, default: y) -- > ') if is_download == 'y' or is_download == 'yes' or not is_download: userinfos = {'user_id': user_id, 'num_wbs': num_wbs, 'num_wb_pages': num_wb_pages} self.__downloadWeibos(userinfos)

XPath is basically used to extract data from user’s micro blog. To view user’s micro blog, you only need to visit the following link:

Url = f'https://weibo.cn/{user_id}?page={page}' page represents the first page of the visiting user's microblog

In terms of skills, there are only two things worth mentioning:

  • Save the data every 20 pages of Weibo data to avoid unexpected interruption of crawler, resulting in the “empty” of the previously crawled data.
  • The data is paused for x seconds for every n pages, where n is randomly generated and n is always changing, and x is also randomly generated and x is always changing.

Train of thought is such a train of thought, some details of the processing on their own source code

Data visualization

Still the same, the data to climb to the visualization of a wave of bai, for convenience, let’s see Liu Yifei’s micro blog data visualization effect.

Let’s take a look at the word cloud made from all of her tweets (only original tweets) :

Then look at the number of tweets she created and retweeted?

And how many tweets are you tweeting each year?

Sure enough, the number of micro blog posts has become much less. Check out her first post, zoaIU7o2d, 😀 :

“Hello, I’m Liu Yifei” [Image uploading failed…(image-7492DA-1620357151584)]

[Image upload failed…(image-31AA56-1620357151584)]

[Image upload failed…(image-332678-1620357151584)]

How many likes does her original micro blog get every year?

How many retweets?

And how many comments?

After reading this article like friends point a love support, pay attention to me every day to share Python simulation login series, the next article to share netease cloud music automatic check-in

All done~ complete source code see personal profile or private message access to related files.