How to crawl the information of millions of Zhihu users in one hour and make a simple analysis

I. Technology stack used:

Crawler: PYTHon27 +requests+ JSON + bS4 +time
Analysis tool: ELK suite
Development tool: PyCharm

2. Data results

573,347 pieces of data were climbed, and thePythonInstead of using a thread pool in the code, I used a starting 10The main ()Method to crawl, that is, 10 processes, 4 hours, crawl 57W + data.

3. Simple visual analysis

1. Gender distribution

The 0 green is male ^. ^
One is for women
-1 Indeterminate gender

It can be seen that zhihu has a large number of male users.

2. Best Companies rank: 61

The top 30 with the most fans are Zhang Jiawei, Lee Kaifu, Huang Jixin and so on. If you go to Zhihu to check these people, you will find almost the same ranking, indicating that the data you have extracted are persuasive.

3. The Best of 2015

4. The distribution of the number of articles written by Zhihu users accounts for 45 million, almost 90%, who do not write articles on Zhihu, indicating that most of the users of Zhihu read articles and read answers, while only 10% of content producers.

4. Crawler architecture

The crawlerarchitectureFigure is as follows:

Description:

Select the URL of an active user (such as Kai-fu Lee) as the entry URL. Store the crawled URL in the set.
Crawls the content, parses the list urls of the users that the user cares about, adds these urls to another set, and uses the crawled urls as filters.
Parses the user’s personal information and accesses the local disk.
Logstash takes the user data from the local disk in real time and gives it to ElsticSearch
Kibana and ElasticSearch work together to convert data into user-friendly visualizations.

5. Coding

To crawl a URL:

def download(url):
    if url is None:
        return None
    try:
        response = requests.get(url, headers={
            'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'.'authorization': 'your authorization '
        })
        print (response.content)
        if (response.status_code == 200) :return response.content
        return None
    except:
        return None
Copy the code

Analysis content:

def parse(response):
    try:
        print (response)
        json_body = json.loads(response);
        json_data = json_body['data']
        for item in json_data:
            if (not old_url_tokens.__contains__(item['url_token')) :if(new_url_tokens.__len__()<2000):
                   new_url_tokens.add(item['url_token'])
            if (not saved_users_set.__contains__(item['url_token'])):
                jj=json.dumps(item)
                save(item['url_token'],jj )
                saved_users_set.add(item['url_token'])

        if (not json_body['paging'] ['is_end']):
            next_url = json_body['paging'] ['next']
            response2 = download(next_url)
            parse(response2)

    except:
        print ('parse fail')
Copy the code

Save local files:

def save(url_token, strs):
    f = file("\\Users\\forezp\\Downloads\\zhihu\\user_" + url_token + ".txt"."w+")
    f.writelines(strs)
    f.close()
Copy the code

* Need to modify the authorization to get the Requests header. * You need to change your file storage path.

Download source code: click here, remember star oh!

Vi. How to obtain authorization

Open chorme, open www.zhihu.com/,
Find a random user and go to his personal home page, F12(or right mouse button, click Check)
Click follow to refresh the page, as shown in the picture:

Seven, can improve the place

It can increase thread pool and improve crawler efficiency
I only use set() to store urls, and use the cache strategy, the maximum storage of 2000 urls, in case of insufficient memory, in fact, can be stored in Redis.
I said the local file is the way to store the users after crawling, the better way should be stored in mongodb.
There should be a filter for users to crawl, for example, the number of fans of the user should be greater than 100 or the number of topics to participate in more than 10 before storage. Avoid capturing too many zombie users.

Eight. About ELK kits

Elk kit installation is not discussed, see the official website for details. Website: www.elastic.co/

The logstash configuration file is as follows:


input {
  # For detail config for log4j as input,
  # See: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-log4j.html

    file {
        path => "/Users/forezp/Downloads/zhihu/*"
    }


}
filter {
  #Only matched data are send to output.
}
output {
  # For detail config for elasticsearch as output,
  # See: https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html
 elasticsearch {
    action => "index"          #The operation on ES
    hosts  => "localhost:9200"   #ElasticSearch host, can be array.
    index  => "zhihu"         #The index to write data to.}}Copy the code

Nine, epilogue

The 570,000 user data can be analyzed in many places, such as region, education background, age and so on, I will not list them all. In addition, I think crawler is a very interesting thing. In this era of content consumption upgrading, how to mine valuable data in the vast ocean of data on the Internet is something worth thinking about and should be constantly practiced.