preface

Due to recent has been studying the recommendation system based on machine learning, need a lot of data model to train the AI, but in the process of model test verification, suffer from the lack of Chinese data set (or not at all, but americans really bad at doing this), can only use the foreign public recommendation system data set, There are famous MovieLens movie rating data set and del.icio. us linked recommendation data set. Although we can roughly evaluate the merits and demerits of the recommendation model by calculating the loss function, there is still a certain gap between foreign people and us in the rating of a certain movie due to differences in language environment, culture and so on. When I output the recommendation result, I am still not sure whether the recommendation result is really as accurate as the loss function calculated, even if the link of a certain movie or a certain website is very similar. Therefore, in order to have a Chinese data set that can be used for training, there is the capture process of douban film reviews recorded in this paper.

Web analytics

First of all, we should analyze the website to captureDouban film, mainly through the search engine or browser debugging tools to see if there is any AVAILABLE API, in the premise of not finding any API began to analyze the website page structure, to find the information that can be extracted.

Through a search engine, I found itDouban developer platformIn theDouban filmThere are detailed interfaces for getting movies, getting reviews, etc., and just when I thought that the next step in data collection was going to be very simple, the following image calmed me down

If you registered as a Developer on Douban before 2015, then congratulations, you can access any data you want through the API or SDK provided by Douban

Film information acquisition

Although it is impossible to GET the APIKey, I still found through the documents that some GET requests do not need AUTH authentication, that is to say, the use of APIKey is not affected. Of these, the interface to get the list of TOP250 movies is:

http://api.douban.com/v2/movie/top250
Copy the code

The interface returns something like this:



It contains some detailed information about the movie, which is enough for the recommendation system.

In addition, by using Chrome’s debugging tool, I found a JQuery interface they use in the Douban Movie-Classification page, which is also a GET request and does not require AUTH.



The detailed interface is as follows. You can iterate to get all movie entries by changing start

https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start=20
Copy the code

This interface can get all the movies included in Douban. After my test, when start is changed to 10000, the returned data is already empty

Film review information acquisition

With the method of obtaining movie information, the next step is to analyze how to obtain movie review information. In the details page for each movie, such as the Truman Show, we found the following details page, which displays information about the movie and short reviews

The https://movie.douban.com/subject/1292064/reviews page https://movie.douban.com/subject/1292064/comments # # # # reviews page essayCopy the code

As usual, I used the debugger to see if there was an API available, but this time I wasn’t so lucky. The review page was rendered by the server.

Without the interface, we can analyze the page, still using debugging tools:



Each comment is in a review-item div block, and all comments are in a review-list div block. We can easily locate the details of each comment through xpath syntax. Here is the xpath statement for all information, which is used to extract the content when we write the crawler

"//div[contains(@class,'review-list')]//div[contains(@class,'review-item')]" ". / / div [@ class = 'main - bd'] / / div [@ class = 'review - short'] / @ data - rid "author's head: ". / header [@ class = 'main - hd'] / / a [@ class = 'avator'] / / img / @ SRC "author's nickname:". / / headers / / a [@ class = 'name'] / text () "recommended degree (score) : ". / / headers / / span [contains (@ class, 'the main - the title - rating)] / @ title "film title:". / / div [@ class =' main - bd '] / / h2 / / a/text () "the reviews: [@ ". / / div class = 'main - bd'] / / div [@ class = 'short - content'] / text () "reviews details page links:". / / div [@ class = 'main - bd'] / / h2 / / a / @ href"Copy the code

We can not get the specific score directly, but only get the specific text description. After my verification, the specific relationship is as follows:

'Recommended ': 5,' Recommended ': 4, 'ok ': 3,' poor ': 2, 'poor ': 1,Copy the code

In the subsequent code writing process, we will convert this correspondence into the corresponding score information

To realize the crawler

Now that the analysis is almost done, the information we need basically has a way to obtain, so next we start concrete crawler implementation, we use the Python crawler framework to help us simplify the crawler development process. The installation of Scrapy and the VirtualEnv environment will not be discussed in detail in this article

Create project project

## Create DoubanSpider project scrapy startProject DoubanCopy the code

The created project directory looks like this:



Among them:

Spiders store the spider code. The two crawlers we will write (movie information and review information) need to be placed in this folder. Kites.py: the model class, where all data that needs to be structured is defined in the file middlewares. > < span style = "box-sizing: border-box; color: RGB (74, 74, 74); line-height: 22px; font-size: 14px! Important; word-break: inherit! Important;" Settings class, some global crawler Settings, if each crawler needs to have a custom place, you can directly set the custom_settings property in the crawlerCopy the code

Movie crawler

Since there are API interfaces available to obtain movie information, the page can process data without crawler.

Spiders create a new file of movies.py to define our crawlers

Since we get the movie through the FORM of API interface, there is no need to follow up the parsing, so our crawler can inherit Spider directly

Define the crawler

class MovieSpider(Spider): Name = 'movie' # crawler name allow_dominas = ["douban.com"] # allowed domain name # custom crawler Settings, Will override the global setting custom_settings = {" ITEM_PIPELINES ": {' Douban. Pipelines. MoviePipeline ': 300 }, "DEFAULT_REQUEST_HEADERS": { 'accept': 'application/json, text/javascript, */*; Q =0.01', 'accept-encoding': 'gzip, deflate', 'accept-language': 'zh-cn,zh; Q = 0.8, en. Q = 0.6, useful - TW; Q = 0.4 ', 'referer' : 'https://mm.taobao.com/search_tstar_model.htm?spm=719.1001036.1998606017.2.KDdsmP', 'the user-agent' : 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/ 577.36 (KHTML, like Gecko) Chrome/48.0.2564.97 ', 'x-requested-with': 'XMLHttpRequest',}, "ROBOTSTXT_OBEY":False # Need to ignore the robots.txt file}Copy the code

< span style = “box-sizing: border-box; line-height: 22px; display: block; word-break: inherit! Important; word-break: inherit! Important;” ROBOTSTXT_OBEY tells our crawlers to ignore the warnings of robots.txt

Next, start_request tells the crawler which link to crawl:

def start_requests(self): Url = "' https://movie.douban.com/j/new_search_subjects?sort=T&range=0, 10 & tags = & start = {start} ' 'requests = [] for I in range(500): request = Request(url.format(start=i*20), callback=self.parse_movie) requests.append(request) return requestsCopy the code

Since we have already analyzed the site before, the data is not available when the start parameter reaches 10000, so we simply use this number to loop through all links

Next, parse what each interface returns:


def parse_movie(self, response):
    jsonBody = json.loads(response.body)
    subjects = jsonBody['data']
    movieItems = []
    for subject in subjects:
        item = MovieItem()
        item['id'] = int(subject['id'])
        item['title'] = subject['title']
        item['rating'] = float(subject['rate'])
        item['alt'] = subject['url']
        item['image'] = subject['cover']
        movieItems.append(item)
    return movieItems
Copy the code

In the request, we specify a parse_movie method to parse the returned content. Here we need to use an Item defined in items.py as follows:

To define the Item

Class MovieItem(scrapy.item): id = scrapy.Field() title = scrapy.Field() rating = scrapy.Field() genres = scrapy.Field() original_title = scrapy.Field() alt = scrapy.Field() image = scrapy.Field() year = scrapy.Field()Copy the code

After the items returned to the Scrapy Scrapy will call us before specified in custom_setting Douban. Pipelines. MoviePipeline to deal with access to the item, < span style = “box-sizing: border-box! Important; word-wrap: break-word! Important;”

Define Pipeline

class MoviePipeline(object):

    movieInsert = '''insert into movies(id,title,rating,genres,original_title,alt,image,year) values ('{id}','{title}','{rating}','{genres}','{original_title}','{alt}','{image}','{year}')'''

    def process_item(self, item, spider):

        id = item['id']
        sql = 'select * from movies where id=%s'% id
        self.cursor.execute(sql)
        results = self.cursor.fetchall()
        if len(results) > 0:
            rating = item['rating']
            sql = 'update movies set rating=%f' % rating
            self.cursor.execute(sql)
        else:
            sqlinsert = self.movieInsert.format(
                id=item['id'],
                title=pymysql.escape_string(item['title']),
                rating=item['rating'],
                genres=item.get('genres'),
                original_title=item.get('original_title'),
                alt=pymysql.escape_string(item.get('alt')),
                image=pymysql.escape_string(item.get('image')),
                year=item.get('year')
            )
            self.cursor.execute(sqlinsert)
        return item

    def open_spider(self, spider):
        self.connect = pymysql.connect('localhost','root','******','douban', charset='utf8', use_unicode=True)
        self.cursor = self.connect.cursor()
        self.connect.autocommit(True)


    def close_spider(self, spider):
        self.cursor.close()
        self.connect.close()

Copy the code

In this Pipeline, we insert each fetched item into a specific data table by connecting to the mysql database

Run the crawler

On the command line, enter:

scrapy crawl movie
Copy the code



Reviews the crawler

Reviews the difficulty of the crawler much larger, because the movie information we are directly by interface, this interface to return the data format is unified, basic won’t appear abnormal situation, and film quantity is limited, can crawl over a very short time, will not trigger the crawler mechanism of douban, and in the process of writing the crawler reviews, these will be met.

The crawler logic

class ReviewSpider(Spider): name = "review" allow_domain = ['douban.com'] custom_settings = { "ITEM_PIPELINES": { 'Douban.pipelines.ReviewPipeline': 300 }, "DEFAULT_REQUEST_HEADERS": { 'connection':'keep-alive', 'Upgrade-Insecure-Requests':'1', 'DNT':1, 'Accept':'text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q =0.8', 'accept-encoding ':'gzip, deflate, br',' accept-language ':' zh-cn,zh; Q = 0.9, en. Q = 0.8, useful - TW; Q = 0.7 ', 'cookies' :' the bid = wpnjOBND4DA; ll="118159"; __utmc=30149280; ', 'the user-agent' : 'Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) "Chrome/67.0.3396.87 Safari/537.36",}, "ROBOTSTXT_OBEY": False, # "DOWNLOAD_DELAY": 1, "RETRY_TIMES": 9, "DOWNLOAD_TIMEOUT": 10 }Copy the code

In contrast to the crawler that gets movie information, there are a few more Settings in custom_setting: RETRY_TIMES: Used to control the maximum retries, because douban has anti crawler mechanism, when the amounts of an IP access will limit the IP access, so in order to bypass this mechanism, we through a proxy IP to crawl the corresponding page, each crawl a page change a IP, but due to the variable quality proxy IP, charges may be better, but there will still be, To avoid the possibility of a page being ignored because the proxy could not connect, we set this value and pass the connection if the number of retries is greater than the value. If your proxy IP is of poor quality, please increase the number of times here. DOWNLOAD_TIMEOUT: download timeout period. The default value is 60 seconds. This parameter is changed to 10 seconds to speed up the overall crawl speed. Due to RETRY_TIMES, the RETRY time is 1 minute. DOWNLOAD_DELAY: Download delay. Set this value if 403 is still returned after you use a proxy IP address, as frequent visits by a certain IP address will trigger douban’s anti-crawler mechanism.

def start_requests(self): Connect = pymysql.connect('localhost','root','******','douban', charset='utf8', use_unicode=True) self.cursor = self.connect.cursor() self.connect.autocommit(True) sql = "select id,current_page,total_page from movies" self.cursor.execute(sql) results = self.cursor.fetchall() url_format = '''https://movie.douban.com/subject/{movieId}/reviews?start={offset}''' for row in results: movieId = row[0] current_page = row[1] total_page = row[2] if current_page ! = total_page: Url = url_format.format(movieId=movieId, offset=current_page*20) request = request (url, callback=self.parse_review, meta={'movieId': movieId}, dont_filter=True) yield requestCopy the code

As usual, in start_Request we tell our Scrapy to crawl the starting url. From our analysis, the address of the review page is in the following format:

https://movie.douban.com/subject/{movieId}/reviews?start={offset}
Copy the code

And movieId, our previous crawler already grabbed all the movie information, so what we’re going to do here is go to the database and get all the movie information, get the movieId, and then construct a page link.

url = url_format.format(movieId=movieId, offset=current_page*20)
Copy the code

Because the process of retrieving douban reviews is very long, there will be various problems in the process that cause the crawler to exit unexpectedly, so we need a mechanism that allows the crawler to continue to crawl from where it left off last time. For this purpose, current_page and total_page are served. In the following data parsing process, each page parses, The number of pages for the current period is stored in case something unexpected happens.

def parse_review(self, response): movieId = response.request.meta['movieId'] review_list = response.xpath("//div[contains(@class,'review-list')]//div[contains(@class,'review-item')]") for review in review_list: item = ReviewItem() item['id'] = review.xpath(".//div[@class='main-bd']//div[@class='review-short']/@data-rid").extract()[0] avator = review.xpath(".//header//a[@class='avator']/@href").extract()[0] item['username'] = avator.split('/')[-2] item['avatar']  = review.xpath("./header[@class='main-hd']//a[@class='avator']//img/@src").extract()[0] item['nickname'] = review.xpath(".//header//a[@class='name']/text()").extract()[0] item['movieId'] = movieId rate = review.xpath(".//header//span[contains(@class,'main-title-rating')]/@title").extract() if len(rate)>0: rate = rate[0] item['rating'] = RATING_DICT.get(rate) item['create_time'] = review.xpath(".//header//span[@class='main-meta']/text()").extract()[0] item['title'] = review.xpath(".//div[@class='main-bd']//h2//a/text()").extract()[0] item['alt'] = review.xpath(".//div[@class='main-bd']//h2//a/@href").extract()[0] summary = review.xpath(".//div[@class='main-bd']//div[@class='short-content']/text()").extract()[0] item['summary'] = summary.strip().replace('\n', '').replace('\xa0(','') yield item current_page = response.xpath("//span[@class='thispage']/text()").extract() total_page = response.xpath("//span[@class='thispage']/@data-total-page").extract() paginator = response.xpath("//div[@class='paginator']").extract() if len(paginator) == 0 and len(review_list): ## There is no navigation bar, but there is a comment list, SQL = "Update movies set current_page = 1, total_page=1 where id='%s'" % movieId self.cursor.execute(sql) elif len(paginator) and len(review_list): current_page = int(current_page[0]) total_page = int(total_page[0]) sql = "update movies set current_page = %d, total_page=%d where id='%s'" % (current_page, total_page, movieId) self.cursor.execute(sql) if current_page ! = total_page: url_format = '''https://movie.douban.com/subject/{movieId}/reviews?start={offset}''' next_request = Request(url_format.format(movieId=movieId, offset=current_page*20), callback=self.parse_review, dont_filter=True, meta={'movieId': movieId}) yield next_request else: yield response.requestCopy the code

Next, analyze the analytic function. The data acquisition of DoubanItem is not introduced in extra detail. The xpath statement used in the previous analysis can be easily defined to the specific content. So the movieId is passed in from the start link through the Meta property in the Request, and of course you can analyze the web page to find the place that contains the movieId.

current_page = response.xpath("//span[@class='thispage']/text()").extract()
total_page = response.xpath("//span[@class='thispage']/@data-total-page").extract()
paginator = response.xpath("//div[@class='paginator']").extract()
Copy the code

The basic code above is used to get the bottom navigation bar of the movie review page

However, the navigation bar will not be available in two cases:

1. When a movie has less than 20 reviews, it has only one page of reviews. 2. When the anti-crawler mechanism of Douban is triggered, the returned page is not a comment page, but a verification page, and the navigation bar cannot be found naturallyCopy the code

So in the following code, I use these variables to determine the above cases:

1. Save the current_page and total_page to the movie table instead of continuing to crawl the remaining comments. 2, because this time triggered the crawler mechanism, to return to the page without our data, if we ignore directly, it will lose a lot of data (this kind of situation is common), so we just try again and return to the request, let Scrapy crawl the page again, because every time to crawl in a new proxy IP, So we have a high probability that the next grab will be normal. One thing to note here: Since Scrapy filters out duplicate requests by default, we need to set the dont_filter parameter to True when constructing the Request so that it does not filter duplicate links. 3. In normal cases, xpath syntax is used to retrieve the next comment link and a request is constructed for Scrapy to crawlCopy the code

Movie reviews download middleware

As mentioned above, you need to bypass Douban’s anti-crawler mechanism by using proxy IP to grab review pages. The specific proxy Settings need to be set up in DownloadMiddleware

class DoubanDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.

ip_list = None

@classmethod
def from_crawler(cls, crawler):
    # This method is used by Scrapy to create your spiders.
    s = cls()
    crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
    return s

def process_request(self, request, spider):
    # Called for each request that goes through the downloader
    # middleware.

    # Must either:
    # - return None: continue processing this request
    # - or return a Response object
    # - or return a Request object
    # - or raise IgnoreRequest: process_exception() methods of
    #   installed downloader middleware will be called

    if self.ip_list is None or len(self.ip_list) == 0:
        response = requests.request('get','http://api3.xiguadaili.com/ip/?tid=555688914990728&num=10&protocol=https').text
        self.ip_list = response.split('\r\n')

    ip = random.choice(self.ip_list)
    request.meta['proxy'] = "https://"+ip
    print("当前proxy:%s" % ip)
    self.ip_list.remove(ip)
    return None

def process_response(self, request, response, spider):
    # Called with the response returned from the downloader.
    # Must either;
    # - return a Response object
    # - return a Request object
    # # - or raise IgnoreRequest

    if response.status == 403:
        res = parse.urlparse(request.url)
        res = parse.parse_qs(res.query)
        url = res.get('r')
        if url and len(url) > 0 :
            request = request.replace(url=res['r'][0])
        return request

    return response
Copy the code

Process_request, which Scrapy calls before each page climb, and process_response, which Scrapy calls after each page climb. In the former method, we call an online proxy IP fetch interface, get a proxy IP, and then set the proxy property of the request to change the proxy. Of course, you can also read the proxy IP from the file. In the latter method, we determine the status of 403, because it indicates that the current request is blocked by anti-crawler detection, and all we need to do is repackage the disabled request address into our Scrapy crawl queue.

Reviews the Item

class ReviewItem(scrapy.Item): id = scrapy.Field() username = scrapy.Field() nickname = scrapy.Field() avatar = scrapy.Field() movieId = scrapy.Field()  rating = scrapy.Field() create_time = scrapy.Field() title = scrapy.Field() summary = scrapy.Field() alt = scrapy.Field()Copy the code

There’s nothing to talk about. You can save whatever you want

Reviews Pipeline


class ReviewPipeline(object):

    reviewInsert = '''insert into reviews(id,username,nickname,avatar,summary,title,movieId,rating,create_time,alt) values ("{id}","{username}", "{nickname}","{avatar}", "{summary}","{title}","{movieId}","{rating}","{create_time}","{alt}")'''

    def process_item(self, item, spider):
        sql_insert = self.reviewInsert.format(
            id=item['id'],
            username=pymysql.escape_string(item['username']),
            nickname=pymysql.escape_string(item['nickname']),
            avatar=pymysql.escape_string(item['avatar']),
            summary=pymysql.escape_string(item['summary']),
            title=pymysql.escape_string(item['title']),
            rating=item['rating'],
            movieId=item['movieId'],
            create_time=pymysql.escape_string(item['create_time']),
            alt=pymysql.escape_string(item['alt'])
        )
        print("SQL:", sql_insert)
        self.cursor.execute(sql_insert)
        return item

    def open_spider(self, spider):
        self.connect = pymysql.connect('localhost','root','******','douban', charset='utf8', use_unicode=True)
        self.cursor = self.connect.cursor()
        self.connect.autocommit(True)


    def close_spider(self, spider):
        self.cursor.close()
        self.connect.close()
Copy the code
Similar to the previous movie pipeline, it is the basic database write operation.Copy the code

Run the crawler

scrapy crawl review
Copy the code

As I write this, the critical crawler is still crawling:



Check database, already have 97W data:



If you find my article helpful, please sponsor a cup at ☕️