A review of microblog crawlers

Today hand in hand to teach you how to write a micro blog crawler code, you can climb micro blog hot topics or comments, convenient to do relevant data analysis and visualization.

The project’s Github address is github.com/Python3Spid… , please do not use the data obtained by the crawler project for any illegal purposes.

The article collection address of the project: mp.weixin.qq.com/mp/appmsgal…

Micro-blog crawler mainly has two directions: one is the crawler of micro-blog content, whose target fields include micro-blog text, publisher, number of forwarding/comments/likes, etc.; the other is the crawler of micro-blog comment, whose target fields are mainly comment text and commenters.

The crawler of micro-blog mainly aims at four websites: PC site weibo.com, Weibo. Cn and corresponding M (mobile) site m.weibo.com (which cannot be browseen on the computer) and m.weibo.cn. Generally speaking,.cn is simpler and uglier than.com. M station is simpler and uglier than PC station.

Github Repository details

The catalog of the repository, as shown in the figure below, is mainly divided into two parts: the GRAPHICAL User Interface(GUI) and the graphical User Interface(GUI).

GUI feature concentration edition

At the beginning, there was only the GUI feature set version, and the main code was gui.py and weiboCommentscrapy.py.

Gui. py includes crawler logic and GUI interface. It is built based on PyQt5. The crawler part is composed of three classes WeiboSearchScrapy, WeiboUserScrapy, WeiboTopicScrapy. Both inherit from the Thread class Thread.

GUI.Py implements user/topic crawler, that is, crawler the micro-blog under the specified user/topic. When we click on the interface to submit a crawler task, the corresponding thread crawler will be started.

The running effect of gui.py is as follows:

Whether it is a microblog user or a topic crawler, there is a limit of about 50 pages.

Scrapy.py scrapy. py scrapy. py scrapy. py scrapy. py

Note that the id of weibo on Weibo. Cn is similar to IjEDU1q3w, which is different from that of m.weibo.cn (it is a long one with pure numbers). The comment crawler can only crawl the first 100 pages.

The crawler data of Weibo users/topics/comments are successively saved in CSV files under the user/topic/comment folder.

20200313 Actually measured GUI function set version of the code is still available, exe is invalid, because every time the code is updated to package, release an EXE, more trouble, I did not update the EXE (public number background exe is the first version of the code packaged, now the third version)

Note that the GUI version is aimed at Weibo. Cn, which has the ugliest interface. Do not put cookies on weibo.com or m.weibo.cn, otherwise the following errors will occur:

encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
I/O error : encoder error
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
encoding error : input conversion failed due to input error, bytes 0xC3 0x87 0xC3 0x82
I/O error : encoder error
Copy the code

Error: [Errno 13] Permission denied: ‘comment/ iayziu0ko. CSV ‘error: if you open the CSV file in Excel and the program continues to append to the file, you can’t get the lock. Pycharm’s CSV Plugin is available.

No GUI feature independent version

. There are three files under the file WeiboCommentScrapy py, WeiboTopicScrapy. Py, WeiboSuperCommentScrapy. Py, front or both for weibo. Cn, Weibotopicscrapy. py has been updated to support time-span searching. For example, if the topic has 1000 pages, we can only crawl 130 pages, but it is possible to split the 1000 pages by year-month-day. No support for granularity of hours or less) 130 pages maximum regardless of length.

And WeiboSuperCommentScrapy. Py is against m. eibo. Cn, the comment crawler without the restrictions of 100 pages, I got a tweet a few w + comments, but some would get only a few (a) what actual comments w + k, this is fan. Run the file can not setting cookies, can be obtained automatically by the account login cookie, this step is to learn a short book blogger article (www.jianshu.com/p/8dc04794e)… However, after debugging, I found that one of the parameters max_id_type will be +1 on page 17, and I also made parsing save for the reply to the comment.

Parse the code as follows to extract json directly:

def info_parser(data): id,time,text = data['id'],data['created_at'],data['text'] user = data['user'] uid,username,following,followed,gender = \  user['id'],user['screen_name'],user['follow_count'],user['followers_count'],user['gender'] return { 'wid':id, 'time':time, 'text':text, 'uid':uid, 'username':username, 'following':following, 'followed':followed, 'gender':gender }Copy the code

Grab the following code:

def start_crawl(cookie_dict,id): base_url = 'https://m.weibo.cn/comments/hotflow?id={}&mid={}&max_id_type=0' next_url = 'https://m.weibo.cn/comments/hotflow?id={}&mid={}&max_id={}&max_id_type={}' page = 1 id_type = 0 comment_count = 0 requests_count = 1 res = requests.get(url=base_url.format(id,id), headers=headers,cookies=cookie_dict) while True: print('parse page {}'.format(page)) page += 1 try: data = res.json()['data'] wdata = [] max_id = data['max_id'] for c in data['data']: comment_count += 1 row = info_parser(c) wdata.append(info_parser(c)) if c.get('comments', None): temp = [] for cc in c.get('comments'): temp.append(info_parser(cc)) wdata.append(info_parser(cc)) comment_count += 1 row['comments'] = temp print(row) with open('{}/{}.csv'.format(comment_path, id), mode='a+', encoding='utf-8-sig', newline='') as f: writer = csv.writer(f) for d in wdata: writer.writerow([d['wid'],d['time'],d['text'],d['uid'],d['username'],d['following'],d['followed'],d['gender']]) Time. Sleep (5) except: print(res.text) id_type += 1 print(' {}'.format(comment_count)) res = requests.get(url=next_url.format(id, id, max_id,id_type), headers=headers,cookies=cookie_dict) requests_count += 1 if requests_count%50==0: print(id_type) print(res.status_code)Copy the code

Anti-crawling measures are mainly set user-agent and Cookie, welcome to try, if you have any questions, please leave a message.