This crawler is to crawl the attention of a weibo user and the basic public information of fans, including the user’s nickname, ID, gender, location and the number of fans, and then save the crawling data in the MongoDB database, and finally regenerate it into several charts to simply analyze the data we have obtained.



I. Specific steps:

The site we choose to crawl here is m.weibo.cn, which is the mobile site of Microblog. We can directly check the microblog of a certain user, such as m.weibo.cn/profile/572… .

Then look at the users they follow, open the developer tools, switch to the XHR filter, and go down the drop-down list, and you’ll see a lot of Ajax requests. These requests are of type Get and return results in Json format that, when expanded, show that there is a lot of user information.

These requests have two parameters, ContainerID and Page, and by changing the page value, we can get more requests. The steps to obtain the user information of its fans are the same, in addition to the different links requested, the parameters are also different, modify it.

Since these requests only return the user’s name and ID, and do not include the user’s gender and other basic information, we click on someone’s Twitter and look at their basic information, such as this one. Open the developer tool and find the following request:

Since the person id is 6857214856, we can see that when we get the person ID, we can construct the link and parameter to get the basic information as follows:

1 uid_str = “230283” + str(uid)

2 url = “https://m.weibo.cn/api/container/getIndex?containerid={}_-_INFO&title=%E5%9F%BA%E6%9C%AC%E8%B5%84%E6%96%99&luicode=1000 0011&lfid={}&featurecode=10000326”.format(uid_str, uid_str)

3 data = {

4 “containerid”: “{}_-_INFO”.format(uid_str),

5 “Title “:” Basic information “,

6 “luicode”: 10000011,

7 “lfid”: int(uid_str),

8 “featurecode”: 10000326

9}

The result is also in Json format, which is easy to extract, because many people’s basic information is not complete, so I extracted the user’s nickname, gender, location and number of followers. And because some accounts are not personal, there is no gender information, for these accounts, I choose to set their gender to male. However, one problem I found when I crawled was that when the number of pages exceeded 250, the result returned no content, which means that this method can only crawl 250 pages at most. All the user information extracted by crawling is stored in the MongoDB database. After the crawling, these information is read and several charts are drawn, such as the fan chart of male and female ratio, the distribution chart of users’ locations and the bar chart of the number of users’ fans are respectively drawn.

Ii. Main Code:

The json format of the first page is different from that of the other pages, so it should be parsed separately. except… Print out the reason for the error.

The code to crawl the first page and parse is as follows:

1 def get_and_parse1(url):

2 res = requests.get(url)

3 cards = res.json()[‘data’][‘cards’]

4 info_list = []

5 try:

6 for i in cards:

7 if “title” not in i:

8 for j in i[‘card_group’][1][‘users’]:

9 user_name = j[‘screen_name’] # user_name = j[‘screen_name’] #

10 user_id = j[‘id’] #

11 fans_count = j[‘followers_count’] #

12 sex, add = get_user_info(user_id)

13 info = {

14 “Username “: user_name,

15 “sex “: sex,

16 “location “: add,

17 “Fan count “: fans_count,

18}

19 info_list.append(info)

20 else:

21 for j in i[‘card_group’]:

22 user_name = j[‘user’][‘screen_name’] #

23 user_id = j[‘user’][‘id’] # user_id = j[‘user’

24 fans_count = j[‘user’][‘followers_count’] #

25 sex, add = get_user_info(user_id)

26 info = {

27 “Username “: user_name,

28 “sex “: sex,

29 “location “: add,

30 “Fan count “: fans_count,

31}

32 info_list.append(info)

33 if “followers” in url:

34 print(“第1页关注信息爬取完毕…”)

35 else:

36 print(” page 1 fan info completed…” )

37 save_info(info_list)

38 except Exception as e:

39 print(e)

The code to crawl and parse the other pages looks like this:

1 def get_and_parse2(url, data):

2 res = requests.get(url, headers=get_random_ua(), data=data)

3 sleep(3)

4 info_list = []

5 try:

6 if ‘cards’ in res.json()[‘data’]:

7 card_group = res.json()[‘data’][‘cards’][0][‘card_group’]

8 else:

9 card_group = res.json()[‘data’][‘cardlistInfo’][‘cards’][0][‘card_group’]

10 for card in card_group:

11 user_name = card[‘user’][‘screen_name’] #

12 user_id = card[‘user’][‘id’] #

13 fans_count = card[‘user’][‘followers_count’] #

14 sex, add = get_user_info(user_id)

15 info = {

16 “Username “: user_name,

17 “sex “: sex,

18 “location “: add,

19 “Fan count “: fans_count,

20}

21 info_list.append(info)

22 if “page” in data:

23 print(” {} page “) .format(data[‘page’]))

24 else:

25 print(” {} page “) .format(data[‘since_id’]))

26 save_info(info_list)

27 except Exception as e:

28 print(e)

Three, operation results:

There are all kinds of errors that can occur at runtime, sometimes the result is empty, sometimes the parse is wrong, but it still manages to crawl most of the data, so here are the last three images that were generated.

The full code has been uploaded to GitHub: github.com/QAQ112233/W…