This article was first published on Zhihu
This article is a continuation of the last article, based on multithreading page turning, crawl secondary pages. Using Douban Top250 as an example, in order to prevent the request from being blocked too quickly, we only grab 5 movies per page.
The crawler code is as follows
import requests
import time
from threading import Thread
from queue import Queue
import json
from bs4 import BeautifulSoup
def run_time(func):
def wrapper(*args, **kw):
start = time.time()
func(*args, **kw)
end = time.time()
print('running', end-start, 's')
return wrapper
class Spider():
def __init__(self):
self.start_url = 'https://movie.douban.com/top250'
self.qurl = Queue()
self.data = list()
self.item_num = 5 # Limit the number of pages extracted per page (also determines the number of secondary pages) to prevent excessive page requests
self.thread_num = 10 # fetch number of secondary page threads
self.first_running = True
def parse_first(self, url):
print('crawling', url)
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
movies = soup.find_all('div', class_ = 'info')[:self.item_num]
for movie in movies:
url = movie.find('div', class_ = 'hd').a['href']
self.qurl.put(url)
nextpage = soup.find('span', class_ = 'next').a
if nextpage:
nexturl = self.start_url + nextpage['href']
self.parse_first(nexturl)
else:
self.first_running = False
def parse_second(self):
while self.first_running or not self.qurl.empty():
url = self.qurl.get()
print('crawling', url)
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
mydict = {}
title = soup.find('span', property = 'v:itemreviewed')
mydict['title'] = title.text if title else None
duration = soup.find('span', property = 'v:runtime')
mydict['duration'] = duration.text if duration else None
time = soup.find('span', property = 'v:initialReleaseDate')
mydict['time'] = time.text if time else None
self.data.append(mydict)
@run_time
def run(self):
ths = []
th1 = Thread(target=self.parse_first, args=(self.start_url, ))
th1.start()
ths.append(th1)
for _ in range(self.thread_num):
th = Thread(target=self.parse_second)
th.start()
ths.append(th)
for th in ths:
th.join()
s = json.dumps(self.data, ensure_ascii=False, indent=4)
with open('top_th1.json'.'w', encoding='utf-8') as f:
f.write(s)
print('Data crawling is finished.')
if __name__ == '__main__':
Spider().run()
Copy the code
The overall idea here is no different from the last article. Allocate two queues, one to store the URL of the secondary page and one to store the captured data. A separate page open a thread, the secondary page URL constantly fill in the queue. Open multiple threads to improve grasp speed when parsing secondary page URL.
In addition, there is another point that needs to be noted
In our last article, because THE URL queue is generated beforehand, rather than simultaneously producing and consuming the URL program, the queue terminates the crawler once it is empty. And here the level one page and the level two page analysis is carried out at the same time, that is to say, the level two page URL is the production side consumption, then we have to ensure
- All threads can exit at the end of page parsing (if only pure)
while True
When the URL list is empty, the consuming thread will wait forever. - Because secondary page URL consumption is too fast, the queue will not be empty in advance and the crawler will quit in advance
For the second point, we define self.first_RUNNING, which, if True, indicates that the level 1 page has not finished running. Even if the URL queue of the secondary page is empty, the new secondary page URL must be generated after the primary page is parsed.
Also, since this URL Queue is typical of the production consumer pattern, Queue is used instead of list if you don’t want to implement the Condition lock yourself.
The reader can also try changing self.thread_num to see how the crawler speed changes.
Welcome to my zhihu column
Column home: Programming in Python
Table of contents: table of contents
Version description: Software and package version description