Use Scrapy to crawl all zhihu user details and save them to MongoDB

This section is to share the user information of zhihu user Scrapy crawler actual combat.

In this section, the target

The contents of this section are as follows:

Starting from a big V user, the detailed information of all zhihu users can be captured by recursive crawling of the list of fans and the list of concerns.
The captured results are stored in MongoDB, and the deduplication operation is performed.

Thought analysis

We all know that everyone has a follow list and a fan list, especially for Big V, there are more followers and followers.

If we start from a large V, first, to get his personal information, then we obtain a list of his fans, and focus on the list, and then traverse the list of each user, further grasping every user’s information and list their respective fan and focus on the list, and then further traversal access to each user in the list, Further crawl their information and follow the list of fans, repeating, recursion, so that you can do a hundred, a million, a million, a million, through social relations naturally formed a crawl network, so that you can climb all the user information. Of course, users with zero followers and zero attention should ignore them

How is the crawl information obtained? Don’t worry. By analyzing zhihu’s requests, you can get relevant interfaces. Through the request interfaces, you can get user details, fans and follow lists.

Now we’re going to start crawling.

Environmental requirements

Python3

The Python version used in this project is Python3. Make sure you have Python3 installed before starting the project.

Scrapy

Scrapy is a powerful crawler framework, installed as follows:

pip3 install scrapy
Copy the code

MongoDB

Non-relational database. Install MongoDB and start the service before starting the project.

PyMongo

Python MongoDB connection library, installed as follows:

pip3 install pymongo
Copy the code

Create a project

With the above environment installed, we are ready to start our project. To start the project we create a project using the command line:

scrapy startproject zhihuuser
Copy the code

Create the crawler

Next we need to create a spider, again using the command line, but this time the command line needs to run inside the project.

cd zhihuuser
scrapy genspider zhihu www.zhihu.com
Copy the code

Ban ROBOTSTXT_OBEY

Next you need to open settings.py and change the ROBOTSTXT_OBEY to False.

ROBOTSTXT_OBEY = False
Copy the code

It defaults to True, which is to follow the rules of robots.txt, so what is robots.txt?

Generally speaking, robots.txt is to follow the Robot protocol of a file, it is saved in the server of the site, its role is to tell the search engine crawler, this site which directory of the web page do not want you to crawl collected. After Scrapy is started, the site’s robots.txt file is first accessed and the site’s crawl scope is determined.

Of course, we are not building a search engine, and in some cases the content we want to access is prohibited by robots.txt. So, at some point, we need to set this configuration item to False and refuse to comply with the Robot protocol!

So set it to False here. It is possible that this crawl may not be restricted by it, but we generally choose to disable it first.

Try the initial scramble

Next we do not modify any code, perform the crawl, run the following command:

scrapy crawl zhihu
Copy the code

You’ll notice an error like this in the crawl result:

500 Internal Server Error
Copy the code

The status code obtained by accessing Zhihu is 500, which indicates that the crawl was not successful. In fact, this is because we did not add the request header, and Zhihu recognized that the User-Agent was not a browser, so it returned an incorrect response.

The next step is to add headers to the Request. You can add headers to the Request or custom_settings in the spider. The easiest way to do this is to add headers to the global Settings.

Open the settings.py file, uncomment DEFAULT_REQUEST_HEADERS and add something like this:

DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}
Copy the code

This is adding headers to your request. If you don’t set headers, it will use this header to request, adding user-agent information so that our crawler can disguise the browser.

Next, re-run the crawler.

scrapy crawl zhihu
Copy the code

At this point, you will find that the returned status code is normal.

With this problem solved, we can now analyze the page logic to formally implement crawlers.

Crawl process

Next we need to explore the interface to get user details and get the focus list.

Go back to the web page, open the browser console, and switch to Network listening mode.

We are the first thing to do is to look for a big V, the elder brother of the wheel, for example, it’s personal information page web site: www.zhihu.com/people/exci…

First go to the front page of The Wheel

We can see here is his basic information, we need to capture these, such as name, signature, occupation, number of followers, number of agreement and so on.

We then need to explore where the focus list interface is, by clicking on the focus TAB and then dropping down to page turn, we find followees’ initial Ajax request in the following request. This is the interface to get the list of concerns.

Let’s look at the request structure

First of all, it is a Get request types, the requested URL is www.zhihu.com/api/v4/memb… “, followed by three parameters: include, offset, and limit.

As you can see, include is a query parameter that gets basic information about the person you are following, including number of answers, number of articles, and so on.

Offset is the offset. What we are analyzing now is the content of the concern list on page 3. Offset is currently 40.

Limit is the number of pages on each page, which is 20, so it can be inferred from the above offset that when offset is 0, the attention list on the first page will be obtained, and when offset is 20, the attention list on the second page will be obtained, and so on.

Then look at the return result:

You can see that there are two fields, data and Paging. Data is data and contains 20 contents. These are the basic information of the user, that is, the user information in the concern list.

In paging, there are several fields: is_end indicates whether the current paging has ended, and next is the link to the next page. Therefore, when interpreting paging, we can first use is_End to determine whether the paging has ended, and then obtain the next link to request the next page.

So our concern list is available through the interface.

Next, let’s take a look at where the user details interface is. If we hover our mouse over any of the profiles in our focus list and look at the network request, we’ll see another Ajax request.

You can see the request link to www.zhihu.com/api/v4/memb… Include = include; include = url_token; include = include; This is available in the returned data.

So to sum up:

To get the attention of the user list, we need to request a similar www.zhihu.com/api/v4/memb… An interface where user is the user’s URl_token, include is a fixed query parameter, offset is the paging offset, and limit is the number of pages taken.
To get the details of the user, we need to request similar to www.zhihu.com/api/v4/memb… An interface where user is the user’s URl_token and include is the query parameter.

With the interface logic sorted out above, we can begin to construct the request.

Generate the first step request

The first thing we need to do is to request basic information for The requests list. We need to construct a formatted URL, extract some variable parameters, and then rewrite the start_requests method to generate the requests for the first step. Next, we need to do further analysis according to the obtained concern list.

import json
from scrapy import Spider, Request
from zhihuuser.items import UserItem

class ZhihuSpider(Spider):
    name = "zhihu"
    allowed_domains = ["www.zhihu.com"]
    user_url = 'https://www.zhihu.com/api/v4/members/{user}?include={include}'
    follows_url = 'https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit}'
    start_user = 'excited-vczh'
    user_query = 'locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,fo llowing_topic_count,following_question_count,following_favlists_count,following_columns_count,answer_count,articles_coun t,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,mar ked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_n ame,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count ,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,indus try_category,org_name,org_homepage,badge[?(type=best_answerer)].topics'
    follows_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'

    def start_requests(self):
        yield Request(self.user_url.format(user=self.start_user, include=self.user_query), self.parse_user)
        yield Request(self.follows_url.format(user=self.start_user, include=self.follows_query, limit=20, offset=0),
                      self.parse_follows)
Copy the code

Then we implement the two parsing methods parse_user and parse_follows.

    def parse_user(self, response):
        print(response.text)
    def parse_follows(self, response):
        print(response.text)
Copy the code

Simply implement their output and run the observations.

scrapy crawl zhihu
Copy the code

That’s when you’ll find it

401 HTTP status code is not handled or not allowed
Copy the code

Access is blocked and we look at the browser request and see that it has an extra OAuth request header compared to the previous request.

OAuth

It stands for Open Authorization.

OAUTH_token: a token obtained by OAUTH in the last step. Through this token request, you can go to the website that owns the resource to fetch any resources that have permission to be fetched.

Here I know hu is not logged in, here the OAuth value is

oauth c3cef7c66a1843f8b3a9e6a1e3160e20
Copy the code

I’ve been watching this for a long time, and it doesn’t change, so it can be used for a long time, so we configure it to DEFAULT_REQUEST_HEADERS, so it becomes:

DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'.'authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20',}Copy the code

Next, if we re-run the crawler, we will see that it can be retrieved normally.

parse_user

Next, let’s take a look at the basic user information. First, let’s look at what the interface information returns.

And you can see that the result is very complete, so we’re just going to declare an Item and save it all.

Declare a new UserItem in items

from scrapy import Item, Field

class UserItem(Item):
    # define the fields for your item here like:
    id = Field()
    name = Field()
    avatar_url = Field()
    headline = Field()
    description = Field()
    url = Field()
    url_token = Field()
    gender = Field()
    cover_url = Field()
    type = Field()
    badge = Field()

    answer_count = Field()
    articles_count = Field()
    commercial_question_count = Field()
    favorite_count = Field()
    favorited_count = Field()
    follower_count = Field()
    following_columns_count = Field()
    following_count = Field()
    pins_count = Field()
    question_count = Field()
    thank_from_count = Field()
    thank_to_count = Field()
    thanked_count = Field()
    vote_from_count = Field()
    vote_to_count = Field()
    voteup_count = Field()
    following_favlists_count = Field()
    following_question_count = Field()
    following_topic_count = Field()
    marked_answers_count = Field()
    mutual_followees_count = Field()
    hosted_live_count = Field()
    participated_live_count = Field()

    locations = Field()
    educations = Field()
    employments = Field()
Copy the code

Therefore, in the parsing method, we parse the response content, and then convert it into JSON object, and then judge whether the field exists in turn, and then assign the value.

result = json.loads(response.text)
item = UserItem()
for field in item.fields:
    if field in result.keys():
        item[field] = result.get(field)
yield item
Copy the code

So when you get the item, you just yield it back.

This saves the user’s basic information.

Then we need to get the user’s focus list here, so we need to make another request to get the focus list

Add the following code after parse_user:

yield Request(
            self.follows_url.format(user=result.get('url_token'), include=self.follows_query, limit=20, offset=0),
            self.parse_follows)
Copy the code

We then generate a request to get the user’s focus list.

parse_follows

Next we deal with the following list. First we parse the text of response, and then we do two things:

Make a request to each user to get their details by following each user on the list.
Process paging, judge paging content, and obtain the next page attention list.

So rewrite parse_follows as follows:

results = json.loads(response.text)

if 'data' in results.keys():
    for result in results.get('data') :yield Request(self.user_url.format(user=result.get('url_token'), include=self.user_query),
                      self.parse_user)

if 'paging' in results.keys() and results.get('paging').get('is_end') == False:
    next_page = results.get('paging').get('next')
    yield Request(next_page, self.parse_follows)
Copy the code

Thus, the overall code is as follows:

# -*- coding: utf-8 -*-
import json

from scrapy import Spider, Request
from zhihuuser.items import UserItem


class ZhihuSpider(Spider):
    name = "zhihu"
    allowed_domains = ["www.zhihu.com"]
    user_url = 'https://www.zhihu.com/api/v4/members/{user}?include={include}'
    follows_url = 'https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit}'
    start_user = 'excited-vczh'
    user_query = 'locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,fo llowing_topic_count,following_question_count,following_favlists_count,following_columns_count,answer_count,articles_coun t,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,mar ked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_n ame,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count ,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,indus try_category,org_name,org_homepage,badge[?(type=best_answerer)].topics'
    follows_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'

    def start_requests(self):
        yield Request(self.user_url.format(user=self.start_user, include=self.user_query), self.parse_user)
        yield Request(self.follows_url.format(user=self.start_user, include=self.follows_query, limit=20, offset=0),
                      self.parse_follows)

    def parse_user(self, response):
        result = json.loads(response.text)
        item = UserItem()

        for field in item.fields:
            if field in result.keys():
                item[field] = result.get(field)
        yield item

        yield Request(
            self.follows_url.format(user=result.get('url_token'), include=self.follows_query, limit=20, offset=0),
            self.parse_follows)

    def parse_follows(self, response):
        results = json.loads(response.text)

        if 'data' in results.keys():
            for result in results.get('data') :yield Request(self.user_url.format(user=result.get('url_token'), include=self.user_query),
                              self.parse_user)

        if 'paging' in results.keys() and results.get('paging').get('is_end') = =False:
            next_page = results.get('paging').get('next')
            yield Request(next_page,
                          self.parse_follows)
Copy the code

So we’re done getting the user’s basic information and then recursively getting the follow list for further requests.

Rerunning the crawler, you can see that it is now possible to achieve cyclic recursive crawling.

followers

After analysis, we find that the API of the fan list is similar, except that the followee is replaced by follower. The rest of the API is completely the same. So we added followers with the same logic.

The final spider code looks like this:

# -*- coding: utf-8 -*-
import json

from scrapy import Spider, Request
from zhihuuser.items import UserItem


class ZhihuSpider(Spider):
    name = "zhihu"
    allowed_domains = ["www.zhihu.com"]
    user_url = 'https://www.zhihu.com/api/v4/members/{user}?include={include}'
    follows_url = 'https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit}'
    followers_url = 'https://www.zhihu.com/api/v4/members/{user}/followers?include={include}&offset={offset}&limit={limit}'
    start_user = 'excited-vczh'
    user_query = 'locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,fo llowing_topic_count,following_question_count,following_favlists_count,following_columns_count,answer_count,articles_coun t,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,mar ked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_n ame,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count ,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,indus try_category,org_name,org_homepage,badge[?(type=best_answerer)].topics'
    follows_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'
    followers_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'

    def start_requests(self):
        yield Request(self.user_url.format(user=self.start_user, include=self.user_query), self.parse_user)
        yield Request(self.follows_url.format(user=self.start_user, include=self.follows_query, limit=20, offset=0),
                      self.parse_follows)
        yield Request(self.followers_url.format(user=self.start_user, include=self.followers_query, limit=20, offset=0),
                      self.parse_followers)

    def parse_user(self, response):
        result = json.loads(response.text)
        item = UserItem()

        for field in item.fields:
            if field in result.keys():
                item[field] = result.get(field)
        yield item

        yield Request(
            self.follows_url.format(user=result.get('url_token'), include=self.follows_query, limit=20, offset=0),
            self.parse_follows)

        yield Request(
            self.followers_url.format(user=result.get('url_token'), include=self.followers_query, limit=20, offset=0),
            self.parse_followers)

    def parse_follows(self, response):
        results = json.loads(response.text)

        if 'data' in results.keys():
            for result in results.get('data') :yield Request(self.user_url.format(user=result.get('url_token'), include=self.user_query),
                              self.parse_user)

        if 'paging' in results.keys() and results.get('paging').get('is_end') = =False:
            next_page = results.get('paging').get('next')
            yield Request(next_page,
                          self.parse_follows)

    def parse_followers(self, response):
        results = json.loads(response.text)

        if 'data' in results.keys():
            for result in results.get('data') :yield Request(self.user_url.format(user=result.get('url_token'), include=self.user_query),
                              self.parse_user)

        if 'paging' in results.keys() and results.get('paging').get('is_end') = =False:
            next_page = results.get('paging').get('next')
            yield Request(next_page,
                          self.parse_followers)
Copy the code

The positions that need to change are

Start_requests added yield Followers message
Add yield followers message to parse_user
Parse_followers do fetch details request and page turn accordingly

This completes the spider, which allows us to recursively crawl through the social network and pull down user details.

summary

With the above spider, we implement the following logic:

The start_requests method implements the first big V user’s request for details as well as his fan and follow list requests.
Parse_user method, to achieve the extraction of detailed information and fans concerned about the list of access.
Paese_follows, which implements the function of rerequesting the user and turning the page by following the list.
Paese_followers allows users to re-request users from their list of followers and turn pages.

Join the pipeline

MongoDB is used for database storage here, so we need to use Item Pipeline here, which is implemented as follows:

class MongoPipeline(object):
    collection_name = 'users'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE'))def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].update({'url_token': item['url_token'] {},'$set': dict(item)}, True)
        return item
Copy the code

The important point is process_item, which uses the update method, passing in the first argument to the query condition, which in this case uses url_token, passing in the second argument to a dictionary object, which is our item, and passing in the third argument to True, which ensures that, Query data is updated if it exists and inserted if it does not. In this way, we can ensure that the weight is removed.

Also remember to open Item Pileline

ITEM_PIPELINES = {
    'zhihuuser.pipelines.MongoPipeline': 300}Copy the code

Then re-run the crawler

scrapy crawl zhihu
Copy the code

This way, normal output can be found, run continuously, and users are saved to the database one by one.

Take a look at MongoDB, where we crawled the user details.

Up to now, the whole crawler is basically finished, and we mainly realize this logic through recursion. The stored results are also de-duplicated in appropriate ways.

Higher efficiency

Of course, what we are running now is stand-alone crawler, and the running speed is limited on only one computer. Therefore, in order to improve the efficiency of crawler in the future, we need to use distributed crawler. Here, we need to use Redis to maintain a public crawl queue.

More distributed crawler implementations can be viewed as do-it-yourself, well-fed! Python3 Web crawler examples