From crawler to machine learning prediction, how did I do it step by step?

Author: xiaoyu

Python Data Science

Python data analyst

Antecedents to review

Some time ago, I shared with you the actual combat project of second-hand house price analysis in Beijing, which is divided into analysis and modeling. After the article was issued, it got everyone’s affirmation and support. I hereby express my thanks.

Data analysis actual combat – Beijing second-hand housing price analysis
Data analysis actual combat – Beijing second-hand housing price analysis (modeling)

In addition to data analysis, a lot of friends are also particularly interested in the crawler, want to know how the crawler part is implemented. This article will share the crawler part of the project, which is a prequel to data analysis.

Thoughts before the reptile

The crawler part mainly obtains the housing information of second-hand houses by crawling chain X and Anx guest. Considering that the housing information of different websites can be complementary, two websites are selected.

The target is second-hand houses in Beijing, only for a city, the amount of data is not large. So use Scrapy to do the crawling, and then store the data in CSV files. The final crawler result is like this, the crawler of chain X crawls 30000+ data, and the crawler of Ann X crawls 3000+ data. It has to be said that the housing supply of chain X is relatively complete.

Scrapy crawls chain X

When writing a crawler, of course, you have to figure out what kind of data you want to get. This project is interested in the data related to second-hand housing. It is natural to think that the detailed information of each housing source link is the most complete. However, considering that the crawler depth affects the overall crawler efficiency, and the data in the house list can already meet the basic requirements, it is not necessary to carry out in-depth crawling for each detailed link, so the house list is finally selected to crawl. The following is the housing information in the listing (partial screenshots) :

After determining the contents of the above crawling, the crawler part of the work began. We first define a subclass in the item.py file that inherits scrapy. item, and then use scrapy.field () to define the fields of the above information in the subclass. The following code sets up all the required field information.

import scrapy

class LianjiaSpiderItem(scrapy.Item):
    # define the fields for your item here like:
    Id = scrapy.Field()
    Region = scrapy.Field()
    Garden = scrapy.Field()
    Layout = scrapy.Field()
    Size = scrapy.Field()
    Direction = scrapy.Field()
    Renovation = scrapy.Field()
    Elevator = scrapy.Field()
    Floor = scrapy.Field()
    Year = scrapy.Field()
    Price = scrapy.Field()
    District = scrapy.Field()
    pass
Copy the code

Import the required libraries in the crawl file (custom) in the Spider folder as follows:

Json: Conversion of JSON format;
Scrapy: a scrapy library.
Logging: logging;
BeautifulSoup: Extracting web page information using BS4;
Table: a dictionary set by Settings;
LianjiaSpiderItem: Field Field;

# -*- coding:utf-8 -*-
import json
import scrapy
import logging
from bs4 import BeautifulSoup
from lianjia_spider.settings import table
from lianjia_spider.items import LianjiaSpiderItem
Copy the code

Now enter the key part, namely the crawler part. This part mainly needs to do by itself is how to parse, and we do not care about how crawler crawls, because it is the framework has completed scheduling and crawling at the bottom of the implementation, we can simply call.

For details, see the Python crawler’s Scrapy program.

The crawler parsing is done in LianjiaSpider, a subclass that inherits scrapy.Spider’s parent class. There are three functions in the subclass, which are parsed layer by layer through callback:

Start_requests: overrides the existing function in the parent class, climbs the initial URL and stores it in the message queue;
Page_navigate: parses the initial URL page and iteratively climbs all page links under each initial URL page.
Parse: Climb all the detailed house links under each page number, extract the corresponding field information, and store it in items.

The following is the function description of the three functions, and the code implementation.

start_requests

Any crawler needs to have an initial URL and then continue to crawl further urls from the initial URL until it reaches the desired data. Since the URL of Lianjia secondary house is composed of a base URL and the pinyin splicetogether of each region, the base URL of Base_URL and the pinyin list of each region in Beijing to be splicetogether are defined in start_requests function.

Each link is asynchronously requested by the scrapy.Request method as follows:

class LianjiaSpider(scrapy.Spider):
    name = 'lianjia'
    base_url = 'https://bj.lianjia.com/ershoufang/'

    def start_requests(self):
        district = ['dongcheng'.'xicheng'.'chaoyang'.'haidian'.'fengtai'.'shijingshan'.'tongzhou'.'changping'.'daxing'.'yizhuangkaifaqu'.'shunyi'.'fangshan'.'mentougou'.'pinggu'.'huairou'.'miyun'.'yanqing'.'yanjiao'.'xianghe']
        for elem in district:
            region_url = self.base_url + elem
            yield scrapy.Request(url=region_url, callback=self.page_navigate)
Copy the code

page_navigate

After making an asynchronous request for each region URL, we need to further crawl all the listings urls in each region, and in order to successfully crawl all the content, we need to solve the page number cycle problem. In the page_navigate function, BeautifulSoup parses the HTML and extracts the Pages data from the page.

See the Python crawler BeautifulSoup Parse Path for details on how to use BeautifulSoup

The pages data is a JSON string, so it needs to be converted to a dictionary format using json.loads to get max_number. Finally, the for loop sends the link of each page URL to complete the asynchronous request, and uses callback to enter the next function, the code is as follows:


    def page_navigate(self, response):
        soup = BeautifulSoup(response.body, "html.parser")
        try:
            pages = soup.find_all("div", class_="house-lst-page-box") [0]if pages:
                dict_number = json.loads(pages["page-data"])
                max_number = dict_number['totalPage']
                for num in range(1, max_number + 1):
                    url = response.url + 'pg' + str(num) + '/'
                    yield scrapy.Request(url=url, callback=self.parse)
        except:
            logging.info("******* No information about second-hand houses in this area ********")

Copy the code

parse

In the parse function, house_info_list is first parsed through BeautifulSoup for all listings under each page number. There is no information of the region in the house list of Chain X, but the region of the house is very important for subsequent data analysis, and we cannot get it only through page analysis. How do I implement this in order to get this field?

We can judge by response.url, because THE URL is exactly the region we started to splice together, and the large region information was already included when we constructed the URL. So simple by identifying the url in the region pinyin, you can solve the problem. Then use the dictionary table to map the corresponding Chinese Region name to the Region field.

Next, we parse each house info in house_info_list. Based on the page structure of chain X, you can see that there are three different information groups under each INFO, which can be located with the class_ parameter. The three locations are house_info, position_info and price_info, and each group of locations contains related field information.

House_info: The picture contains fields such as Garden, Size, Layout, Direction, Renovation, and Elevator house construction.
Position_info: The position_info figure contains the location age fields of Floor, Year, District, etc.
Price_info: The figure contains fields such as Total_price and price.

The position difference here is the position of the tags in the front HTML page.

See the following code for specific operation methods:


    def parse(self, response):
        item = LianjiaSpiderItem()
        soup = BeautifulSoup(response.body, "html.parser")

        Get information about all sublists
        house_info_list = soup.find_all(name="li", class_="clear")

        Identify the region by url
        url = response.url
        url = url.split('/')
        item['Region'] = table[url[-3]]

        for info in house_info_list:
            item['Id'] = info.a['data-housecode']

            house_info = info.find_all(name="div", class_="houseInfo")[0]
            house_info = house_info.get_text()
            house_info = house_info.replace(' '.' ')
            house_info = house_info.split('/')
            # print(house_info)
            try:
                item['Garden'] = house_info[0]
                item['Layout'] = house_info[1]
                item['Size'] = house_info[2]
                item['Direction'] = house_info[3]
                item['Renovation'] = house_info[4]
                if len(house_info) > 5:
                    item['Elevator'] = house_info[5]
                else:
                    item['Elevator'] = ' '
            except:
                print("Data saving error")

            position_info = info.find_all(name='div', class_='positionInfo')[0]
            position_info = position_info.get_text()
            position_info = position_info.replace(' '.' ')
            position_info = position_info.split('/')
            # print(position_info)
            try:
                item['Floor'] = position_info[0]
                item['Year'] = position_info[1]
                item['District'] = position_info[2]
            except:
                print("Data saving error")

            price_info = info.find_all("div", class_="totalPrice")[0]
            item['Price'] = price_info.span.get_text()

            yield item

Copy the code

For chain X crawls, there is no xpath because it is not very convenient to extract some tags (just for chain X), so the blogger has beautifulSoup instead.

Scrapy: Scrapy

See Scrapy for information and visual data analysis

The following is the core crawler part, which is the same as the chain X crawler part. The difference is that xpath is used for parsing and ItemLoader is used to load and store items.

# -*- coding:utf-8 -*-

import scrapy
from scrapy.loader import ItemLoader
from anjuke.items import AnjukeItem

class AnjukeSpider(scrapy.Spider):
    name = 'anjuke'
    custom_settings = {
        'REDIRECT_ENABLED': False
    }
    start_urls = ['https://beijing.anjuke.com/sale/']

    def start_requests(self):
        base_url = 'https://beijing.anjuke.com/sale/'
        for page in range(1, 51):
            url = base_url + 'p' + str(page) + '/'
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        num = len(response.xpath('//*[@id="houselist-mod-new"]/li').extract())
        house_info = response.xpath('//*[@id="houselist-mod-new"]')
        print(house_info)
        for i in range(1, num + 1):
            l = ItemLoader(AnjukeItem(), house_info)

            l.add_xpath('Layout'.'//li[{}]/div[2]/div[2]/span[1]/text()'.format(i))
            l.add_xpath('Size'.'//li[{}]/div[2]/div[2]/span[2]/text()'.format(i))
            l.add_xpath('Floor'.'//li[{}]/div[2]/div[2]/span[3]/text()'.format(i))
            l.add_xpath('Year'.'//li[{}]/div[2]/div[2]/span[4]/text()'.format(i))
            l.add_xpath('Garden'.'//li[{}]/div[2]/div[3]/span/text()'.format(i))
            l.add_xpath('Region'.'//li[{}]/div[2]/div[3]/span/text()'.format(i))
            l.add_xpath('Price'.'//li[{}]/div[3]/span[1]/strong/text()'.format(i))

            yield l.load_item()
Copy the code

If the proxy IP address pool is not used, it is very easy to fail if the speed is too fast. Chain X, on the other hand, is less rigid and can be very fast.

conclusion

The above is the sharing of the core content of the crawler part of this project. So far, this project has completed the complete process of “trilogy” from crawler to data analysis and then to data mining prediction. Although this project is relatively simple and still needs to be improved in many places, we hope that we can have a good understanding and understanding of the whole process through this project.

Find out more about Python data Science.

From crawler to machine learning prediction, how did I do it step by step?

Antecedents to review

Thoughts before the reptile

Scrapy crawls chain X

start_requests

page_navigate

parse

Scrapy: Scrapy

conclusion

Related Posts

MySQL > Set auto-commit (on/off)

Message queue testing: What are Redis’s solutions?

Back to bickering! Can you watch a movie when you run out of memory?