Introduction to Scrapy

Recently, because of the work of data acquisition related content, I specially studied the use of Scrapy. Summarize some Scrapy introduction and practical experience.

Considering that I wanted to get on the bus in Wuhan, I went to collect some data from the shell house search. Specific fields can be stored in a database or file. Here are some of the fields THAT I defined.

After the data is finally collected, some analysis is usually performed on the data. If you can directly write SQL statement processing is good, more friendly way is to use visual display, so more intuitive. Therefore, Pandas is used to process the data and Pyecharts to display the data. Here is a simple rendering.

Scrapy installed

For those of you who have worked with Python, this can be ignored.

pip install scrapy
Copy the code

Create a project using scrapy. This script is very convenient and can generate templates with one click.

Scrapy GenspiderCopy the code

The final engineering structure looks like this:

Scrapy use

The generated files all serve a purpose. The data fields are defined in items.py.

import scrapy
class BeikeWuhanItem(scrapy.Item) :
    name = scrapy.Field()
    lp_type = scrapy.Field()
    image = scrapy.Field()
    block = scrapy.Field()
    address = scrapy.Field()
    room_type = scrapy.Field()
    spec = scrapy.Field()
    ava_price = scrapy.Field()
    total_range = scrapy.Field()
    tags = scrapy.Field()
    detail_url = scrapy.Field()
    create_time = scrapy.Field()

Copy the code

You can see that the fields here are exactly the same as those defined in the database.

Settings.py defines some Settings properties. You can follow the links in the comments to find out how to use scrapy. Scrapy is well documented and newer-friendly.

# Download every 10 seconds too often the site will block you or use proxy pools
DOWNLOAD_DELAY = 10
The number after # represents the order
DOWNLOADER_MIDDLEWARES = {
   'BeikeSpider.middlewares.IPProxyMiddleware': 100.'BeikeSpider.middlewares.BeikespiderDownloaderMiddleware': 543,}Copy the code

Processingthe collected data is defined in Field.py.


class BeikespiderPipeline:
    def __init__(self) :
        self.connection = pymysql.connect(
            host=MYSQL_HOST,
            user=MYSQL_USER,
            password=MYSQL_PASS,
            db=MYSQL_DB,
            charset="utf8mb4",
            cursorclass=pymysql.cursors.DictCursor,
        )

    def process_item(self, item, spider) :
        if isinstance(item, BeikeWuhanItem):
            self.insert(item)
        return item

    def insert(self, item) :
        cursor = self.connection.cursor()
        keys = item.keys()
        values = tuple(item.values())
        fields = ",".join(keys)
        temp = ",".join(["%s"] * len(keys))
        sql = "INSERT INTO wh_loupan (%s) VALUES (%s)" % (fields, temp)
        cursor.execute(sql, values)
        self.connection.commit()

Copy the code

The logic here is to write the collected data to mysql, but you can also write files to Excel as well. Note that the pipeline must be configured in Settings after writing.

Finally, it is your own crawling logic. The file beike_wuhan. Py will not be automatically generated, so you need to define it yourself. This is the entrance of the crawler. The start_requests method is defined in the parent class, which defines the name of the crawler and which URL to start from. Finally, the request is handed over to the Scrapy loader using the yield keyword. There is also a callback function that specifies what to do after the request is processed. The parse function is also a parent class, as shown by the response argument.

It is quicker to use xpath when parsing page data. But for the eye is not good or consider other selector, because the hierarchy is easy to see flowers. I won’t go into detail on how to use xpath, but it’s nice to have the debugger open in your browser and just select an element and copy its xpath. Extract the element information and fill the BeikeWuhanItem with its value. This process is quite tedious and requires some patience. Items are still produced by yield. The item is passed to the pipline and can be handled directly by the process_item function.

Because you’re not just going to crawl this page, you’re going to keep climbing. Page turning is involved here. There is a next page button in the shell page, but the rendering of this button is dynamically generated by JS, and the element of this button is not returned by requesting the address directly. It’s worth noting that the simplest way to do this is to look at the source code of the web page, and if you can find this element you can get it through xpath, otherwise you can’t get it at all. So instead of judging whether there is a next page button to turn the page, you have to judge whether to continue to climb through the specific total number. Observe that if you get to the last page, the total number of items on the page elements becomes zero, and you don’t need to continue turning the page. The url for turning the page is also passed to the loader by yield, similar to a for loop, which keeps downloading and parsing. There is also a bug in the shell, which shows 80 pages, but in fact there is no data at 40 pages. When I started counting the total number of pages, I thought the total number of pages was only that many. In fact, the bug showed 10 pages and 20 pages, not to mention wudde.

p_room_type = re.compile('[0-9] +')
p_room_spec = re.compile('[\ s \ s] * ㎡ $')

root_url = 'https://wh.fang.ke.com/'
default_schema = 'https:'

class BeikeWuhanSpider(Spider) :
    def __init__(self, **kwargs) :
        super().__init__(**kwargs)
        self.curr_page = 1
        self.url_template = "https://wh.fang.ke.com/loupan/nhs1pg%s"

    name = "beike_wuhan"
    allowed_domains = ["wh.fang.ke.com"]

    def start_requests(self) :
        Home page of wuhan real estate
        baseurl = "https://wh.fang.ke.com/loupan/nhs1pg1"

        UserAgents = [
            'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/ 537.36edg /92.0.902.62'.'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE'.'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107]
        UserAgent = random.choice(UserAgents)

        headers = {'User-Agent': UserAgent}
        yield Request(
            url=baseurl,
            headers=headers,
            callback=self.parse)

    def parse(self, response) :
        # next_page = response.xpath('/html/body/div[7]/div[2]/')
        # if next_page is not None:
        # next_page = response.urljoin(next_page)
        # yield Request(next_page, callback=self.parse)
        # The number of pages is not fixed either 10 per page or 20 per page
        content = response.xpath('/html/body/div[6]/ul[2]/li')
        for item in content:
            image = item.xpath('./a/img/@src') [0].extract()
            name = item.xpath('./div[1]/div[1]/a/@title') [0].extract()
            detail_url = item.xpath('./div[1]/div[1]/a/@href') [0].extract()
            lp_type = item.xpath('./div[1]/div[1]/span[2]/text()').extract_first()
            address = item.xpath('./div[1]/a[1]/text()') [1].extract()
            block = ' '
            if address:
                block = address.split('/') [0]
            room_type_texts = item.xpath('./div[1]/a[2]/span/text()').extract()
            room_types = []
            room_spec = ' '
            for room_type_text in room_type_texts:
                if re.match(p_room_type, room_type_text):
                    room_types.append(room_type_text)
                if re.match(p_room_spec, room_type_text):
                    room_spec = room_type_text.split(' ') [1]

            room_type = '/'.join(room_types)
            t = item.xpath('./div[1]/div[3]/span/text()').getall()
            tags = ', '.join(t)

            ava_price = item.xpath('./div[1]/div[4]/div[1]/span[1]/text()').extract_first()
            total = item.xpath('./div[1]/div[4]/div[2]/text()').extract_first()
            if not total:
                total = ' '

            wuhan_item = BeikeWuhanItem()
            wuhan_item['name'] = name
            wuhan_item['lp_type'] = lp_type
            wuhan_item['image'] = default_schema + image
            wuhan_item['block'] = block
            wuhan_item['address'] = address
            wuhan_item['room_type'] = room_type
            wuhan_item['spec'] = room_spec
            wuhan_item['ava_price'] = ava_price
            wuhan_item['total_range'] = total
            wuhan_item['tags'] = tags
            wuhan_item['detail_url'] = root_url + detail_url
            wuhan_item['create_time'] = datetime.datetime.now()
            yield wuhan_item

        If the total number of buildings is 0, then there is no next page
        count = response.xpath('/html/body/div[6]/div[2]/span[2]/text()').get()
        print('= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =', count)
        if int(count) > 0:
            self.curr_page += 1
        else:
            self.crawler.engine.close_spider(self, 'Mission accomplished')
        next_page = self.url_template % self.curr_page
        yield Request(next_page, callback=self.parse)

Copy the code

Shell has a limit on frequent requests, so it’s best to use proxy pools, otherwise you’ll easily be banned and have to click on the “I’m not a machine” identifier to remove the limit. At this point, the whole acquisition process is basically described. The next step is to process and present the data.

Pandas Data Processing

Pandas can also read records in the database in just a few lines of code.

import pandas as pd
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
from pyecharts.charts import Bar

plt.rcParams['font.sans-serif'] = 'SimHei'
engine = create_engine('mysql+pymysql://root:123456@localhost:3306/beike_spider')
sql = ''' select * from wh_loupan; ' ' '
# Load data to PD
df = pd.read_sql_query(sql, engine)

Copy the code

Then we filter the data, because we collect buildings of many types, such as office buildings, merchants and so on. At the moment I only care about the house, so I filter through this operation.

Filter out all other types and analyze only houses while keeping only 4 columns
houses = df.loc[df["lp_type"] = ="Home"['id'.'name'.'ava_price'.'block']]
Copy the code

Then again, why not just write the filter criteria in SQL? Sure, but this is to show pandas in action.

Then divide the residential areas into groups and look at the distribution of residential buildings in these districts.


This returns a series with the index zone and the value of the number of buildings under the zone
group_num = houses.groupby("block").size()
index = []
val = []
for i in group_num.items():
    index.append(i[0])
    val.append(i[1])

bar = Bar()
bar.add_xaxis(index)
bar.add_yaxis("Number of Buildings", val)
# render generates a local HTML file. By default, it generates a render file in the current directory
# can also be passed path parameters, such as bar.render(" mychart.html ")
bar.render("./figures/ estate distribution. HTML")

Copy the code

Finally, the chart is generated and presented intuitively.

The complete code can be found at github.com/Mr-Vincent/…

conclusion

Long ago, the request library was used to directly request web pages, which required handling connection, parsing, retrying and other non-business dirty work. Code is more and more ugly, now using scrapy found a lot of instant refreshing, efficiency has improved a lot. After getting used to a strongly typed language like Java, it can be frustrating to write a weakly typed language like Python. For example, you know your divisor is a number, but the interpreter tells you it’s a character type. When using a class method, you don’t know how to pass parameters, and you have all kinds of fancy parameter lists. It takes some time for a strongly typed language user to change the programming mindset.

More not wonderful content, follow the public number is not necessarily to get

The resources

Pandas Doc Scrapy Tutorial