Python Scrapy (2

In the last video, we built the environment, we built the battery, now we just need the prey and the weapon.

One, the selection of prey

Here, we choose the xitexi proxy IP address as an example project for the following two reasons:

West thorn agent data specification, easy to climb, as a demonstration project more suitable
Proxy IP in our crawler may also be useful (although the availability is a little low, but if you are not walking volume, usually use yourself is still good)

Prey URL

http://www.xicidaili.com/nn
Copy the code

Note: although xitthorn claims to provide the only free proxy IP interface in the whole network, it does not seem to be useful, because it does not return data at all… We can do a little work by ourselves.

Second, our goal is

This is what scrapy’s initial plan does:

From the west thorn agent website to climb the free domestic high secret agent IP
Save the crawled proxy IP into the MySQL database
Obtain the latest IP address by the way of cyclic crawling, and by setting up the unique key of the database for a simple version of deduplication

Third, objective analysis

As the saying goes, know yourself and know your enemy. As for winning more or less, don’t tangle first. Let’s open up the site (using Chrome) and see something like this.

Web page structure analysis

Browse through our target site and select the data we need. From the web page we can see the west spur agent domestic high hidden IP shows the country, IP, port, server address, whether anonymous, type, speed, connection time, survival time, verification time these information.
In the data display area of the page in the field name (blue) area right click -> Check, we find that the data of the page is made by

To render the layout
Drag the web page to the bottom and find the data of the website to be displayed in distribution. When we click the next page to turn the page, we find that the page number parameter is regularly behind the URL, and the number of the current page is transmitted, such as:

http://www.xicidaili.com/nn/3
Copy the code

We see that the country is displayed by the flag, speed and connection time are displayed by the two color blocks, it seems not good to take these three information?

At this point, we move the mouse over the location of a small flag on the page and right-click -> Check. We find that this is an IMG tag with the Alt attribute containing the country code. You see, this country information is available to us.

Then move the mouse to the speed color block and right click -> check, we can find that there is a div with a title showing data like 0.876 seconds, and the connection time is also the same pattern, so we are basically sure that all our data items can be obtained, and we can get more IP by turning the page.

4. Prepare weapons and ammunition

4.1 Preparing Weapons

Our weapon, of course, is scrapy

4.2 Preparing Ammunition

Ammunition including

Pymsql (A Python library for manipulating MySQL)
Fake-useragent (Hide Your Identity)
pywin32

Five, the fire

5.1 Creating a Virtual Environment

C:\Users\jiang>workon Pass a name to activate one of the following virtualenvs: = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =test
C:\Users\jiang>mkvirtualenv proxy_ip
Using base prefix 'd:\\program files\\python36'
New python executable inD:\ProgramData\workspace\python\env\proxy_ip\Scripts\python.exe Installing setuptools, pip, wheel... done.Copy the code

5.2 Installing third-party Libraries

Note: In the following commands, if “(proxy_ip)” is displayed before the command, it indicates that the command is executed in the Python virtual environment of proxy_ip created above.

1 install scrapy

(proxy_ip) C:\Users\jiang>pip install scrapy -i https://pypi.douban.com/simple
Copy the code

Scrapy install scrapy

Error: Microsoft Visual C++ 14.0 is required. Get it with"Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools
Copy the code

The solution can be as follows:

Download the TWISTED WHL package to the following address

https://www.lfd.uci.edu/~gohlke/pythonlibs/
Copy the code

Open the URL above, search Twisted, and download the latest WHL file for Python and Windows, as shown in the following example

As shown in the figure above, Twisted‑18.4.0‑ CP36 ‑ CP36m ‑win_amd64. WHL represents Win 64 files that match Python 3.6. This is appropriate for my environment. So, download the file, then locate the file download location and drag it to the CMD command line window to install it, as shown in the following example:

(proxy_ip) C: \ Users \ jiang > PIP install D: \ ProgramData \ Download \ Twisted - 18.4.0 - cp36 - cp36m - win_amd64. WHLCopy the code

Execute the following command to install scrapy

(proxy_ip) C:\Users\jiang>pip install scrapy -i https://pypi.douban.com/simple
Copy the code

2 install pymysql

(proxy_ip) C:\Users\jiang>pip install pymysql -i https://pypi.douban.com/simple
Copy the code

3 installation fake – useragent

(proxy_ip) C:\Users\jiang>pip install fake-useragent -i https://pypi.douban.com/simple
Copy the code

4 install pywin32

(proxy_ip) C:\Users\jiang>pip install pywin32 -i https://pypi.douban.com/simple
Copy the code

Create a Scrapy Project

1 (Optional) Creating a working directory

We can create a special directory for python project files, for example:

I’ve created a python_projects directory in the user directory, or I can create a directory by any name or choose a directory I know where.

(proxy_ip) C:\Users\jiang>mkdir python_projects
Copy the code

2 Create scrapy projects

Go to the working directory and execute the command to create a scrapy project

(proxy_ip) C:\Users\jiang>cd python_projects

(proxy_ip) C:\Users\jiang\python_projects>scrapy startproject proxy_ip
New Scrapy project 'proxy_ip', using template directory 'd:\\programdata\\workspace\\python\\env\\proxy_ip\\lib\\site-packages\\scrapy\\templates\\project', created in:
    C:\Users\jiang\python_projects\proxy_ip

You can start your first spider with:
    cd proxy_ip
    scrapy genspider example example.com
Copy the code

At this point, you’re done creating a scrapy project. Next, let’s start our hunting program…

5.4 Key Configuration Codes

1 Open the project

Use PyCharm to Open our scrapy project and click “Open in New Window “to Open the project

2 Configure the project environment

File -> Settings -> Project: proxy_IP -> Project Interpreter -> Gear button -> Add…

Select “Existing environment” -> “…” button

Find Scripts/python.exe for the virtual environment proxy_ip created earlier, check it and confirm

The default location of the virtual environment proxy_ip is C:\Users\username\envs, where my virtual machine location has been changed by modifying the WORK_ON environment variable.

3 the configuration items. Py

Items defines the fields we climb and the processing of each field, which can be simply interpreted as an Excel template. We define the header fields of the module and the attributes of the fields, etc. Then we fill in the table according to this module.

class ProxyIpItem(scrapy.Item):
    country = scrapy.Field()
    ip = scrapy.Field()
    port = scrapy.Field( )
    server_location = scrapy.Field()
    is_anonymous = scrapy.Field()
    protocol_type = scrapy.Field()
    speed = scrapy.Field()
    connect_time = scrapy.Field()
    survival_time = scrapy.Field()
    validate_time = scrapy.Field()
    source = scrapy.Field()
    create_time = scrapy.Field()

    def get_insert_sql(self):
        insert_sql = """ insert into proxy_ip( country, ip, port, server_location, is_anonymous, protocol_type, speed, connect_time, survival_time, validate_time, source, create_time ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s) """

        params = (
                    self["country"], self["ip"], self["port"], self["server_location"],
                    self["is_anonymous"], self["protocol_type"], self["speed"], self["speed"],
                    self["survival_time"], self["validate_time"], self["source"], self["create_time"])return insert_sql, params
Copy the code

4 write pipelines. Py

Pipelines build scrapy pipelines Build scrapy pipelines build scrapy pipelines build scrapy pipelines build scrapy pipelines build piplelines You can think of it this way, for example: there is a reservoir from which we draw water to various places, such as waterworks, factories, farms…

So we built three pipes 1, 2 and 3 to connect water works, factories and farmland respectively. When the pipes are set up, we can open the switch (valve) whenever we need water, and the water in the reservoir can flow continuously to different destinations.

And the scene here has a lot of similarities with the reservoir above. < span style = “max-width: 100%; box-sizing: border-box! Important; word-break: inherit! Important;” You can define a flow in pipelines file of pipeline, flow can also build a database of pipes, such as MySQL/mongo/ElasticSearch, etc.), and these pipeline valve is located in the Settings. The py file, of course, You can also open more than one valve at a time, and you can even define the priority of each pipe.

If you could see here, would you find it interesting? It turns out that the big boys in the programming world, when they designed this framework, actually referred to a lot of real life examples. While I’m not sure that the predecessors of scrapy frames were drawing on real-life examples of reservoirs, I’m sure they were drawing on similar scenarios.

Here, I also pay tribute to these predecessors, their framework design is excellent and simple to use, thank you!

< span style = “box-sizing: border-box! Important; word-wrap: break-word! Important;”

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql
from twisted.enterprise import adbapi


class ProxyIpPipeline(object):
    """ xicidaili """
    def __init__(self, dbpool):
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls, settings):
        dbparms = dict(
            host=settings["MYSQL_HOST"],
            db=settings["MYSQL_DBNAME"],
            user=settings["MYSQL_USER"],
            passwd=settings["MYSQL_PASSWORD"],
            charset='utf8',
            cursorclass=pymysql.cursors.DictCursor,
            use_unicode=True,
        )
        dbpool = adbapi.ConnectionPool("pymysql", **dbparms)
        Instantiate an object
        return cls(dbpool)

    def process_item(self, item, spider):
        Use Twisted to insert mysql into asynchronous execution
        query = self.dbpool.runInteraction(self.do_insert, item)
        query.addErrback(self.handle_error, item, spider)  # handle exception


    def handle_error(self, failure, item, spider):
        Handle asynchronous insert exceptions
        print(failure)


    def do_insert(self, cursor, item):
        Perform a specific insert
        Mysql > insert SQL statement into mysql
        insert_sql, params = item.get_insert_sql()
        print(insert_sql, params)
        try:
            cursor.execute(insert_sql, params)
        except Exception as e:
            print(e)
Copy the code

Note:

A pipe defined in the above code is a pipe injected into MySQL, where the writing mode is completely fixed, Only dbPool = adbapi.connectionPool (“pymysql”, **dbparms) this needs to be consistent with the third party library that you are using to connect to mysql. If you are using another third-party library such as MySQLdb, change to MySQLdb…
The mysql information in the dbparams file is read from the Settings file. You only need to configure the mysql information in the Settings file

configurationsettings.pyPipeline setup in

< span style = “box-sizing: border-box; color: RGB (74, 74, 74); line-height: 22px; font-size: 14px! Important; word-break: inherit! Important;” Therefore, this pipe is used by default in settings.py.

For demonstration purposes, I have listed the default pipeline and other custom pipeline Settings below

The above shows that two pipelines are set for us, one of which is written to MySQL and the other to MongoDB. The following 300 and 400 indicate the execution priority of pipeline, where the higher the value is, the lower the priority is.

> < span style = “box-sizing: border-box; word-break: inherit! Important; word-break: inherit! Important;

MYSQL_HOST = "localhost"
MYSQL_DBNAME = "crawler"
MYSQL_USER = "root"
MYSQL_PASSWORD = "root123"
Copy the code

5 write spiders

The reservoir releases water downstream, but not everything in the reservoir is needed, such as weeds, mud and stones, etc. If it is not needed, then it has to be filtered out. The filtering process is similar to that of our spider.

Spiders create a xicidaili.py file under the spiders directory to write the spider logic we need.

Since this article is already too long, I’m going to omit the detailed description of spider here. You can get information about Spider from the official website, where there is a simple standard template of spider file on the official home page. What we need to do is to write rules for extracting data according to the template.

# -*- coding: utf-8 -*-
import scrapy

from scrapy.http import Request
from proxy_ip.items import ProxyIpItem
from proxy_ip.util import DatetimeUtil


class ProxyIp(scrapy.Spider):
    name = 'proxy_ip'
    allowed_domains = ['www.xicidaili.com']
    # start_urls = ['http://www.xicidaili.com/nn/1']

    def start_requests(self):
        start_url = 'http://www.xicidaili.com/nn/'

        for i in range(1, 6):
            url = start_url + str(i)
            yield Request(url=url, callback=self.parse)

    def parse(self, response):
        ip_table = response.xpath('//table[@id="ip_list"]/tr')
        proxy_ip = ProxyIpItem()

        for tr in ip_table[1:]:
            Extract the content list
            country = tr.xpath('td[1]/img/@alt')
            ip = tr.xpath('td[2]/text()')
            port = tr.xpath('td[3]/text()')
            server_location = tr.xpath('td[4]/a/text()')
            is_anonymous = tr.xpath('td[5]/text()')
            protocol_type = tr.xpath('td[6]/text()')
            speed = tr.xpath('td[7]/div[1]/@title')
            connect_time = tr.xpath('td[8]/div[1]/@title')
            survival_time = tr.xpath('td[9]/text()')
            validate_time = tr.xpath('td[10]/text()')

            Extract target content
            proxy_ip['country'] = country.extract()[0].upper() if country else ' '
            proxy_ip['ip'] = ip.extract()[0] if ip else ' '
            proxy_ip['port'] = port.extract()[0] if port else ' '
            proxy_ip['server_location'] = server_location.extract()[0] if server_location else ' '
            proxy_ip['is_anonymous'] = is_anonymous.extract()[0] if is_anonymous else ' '
            proxy_ip['protocol_type'] = protocol_type.extract()[0] if type else ' '
            proxy_ip['speed'] = speed.extract()[0] if speed else ' '
            proxy_ip['connect_time'] = connect_time.extract()[0] if connect_time else ' '
            proxy_ip['survival_time'] = survival_time.extract()[0] if survival_time else ' '
            proxy_ip['validate_time'] = '20' + validate_time.extract()[0] + '00' if validate_time else ' '
            proxy_ip['source'] = 'www.xicidaili.com'
            proxy_ip['create_time'] = DatetimeUtil.get_current_localtime()

            yield proxy_ip
Copy the code

6 middlewares.py

Most of the time, you only need to write or configure items, pipeline, Spidder, and Settings to run a full crawler project, but middlewares is useful in a few cases.

If you look at the reservoir release scenario, by default, the reservoir release process would be something like, well, the water plant needs water, they send a request to the reservoir, and the reservoir receives the request and opens the valve and releases the water downstream according to the filtered requirements. But if the waterworks have special requirements, for example, the waterworks can only want to run the water between 00:00 and 7:00 every day, this is a custom case.

Middlewares.py defines this information, including default request and response handling, such as default water discharge all day… And if we have special needs, we can define it in middlewares.py… The code for randomly switching the requested User-Agent using fake-UserAgent is shown in the attached project below:

class RandomUserAgentMiddleware(object):
    "" random replacement of user-agent """
    def __init__(self, crawler):
        super(RandomUserAgentMiddleware, self).__init__()
        self.ua = UserAgent()
        self.ua_type = crawler.settings.get("RANDOM_UA_TYPE"."random")

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def process_request(self, request, spider):
        def get_ua(a):
            return getattr(self.ua, self.ua_type)
        random_ua = get_ua()
        print("current using user-agent: " + random_ua)
        request.headers.setdefault("User-Agent", random_ua)
Copy the code

settings.pyTo configure the Middleware information

# Crawl responsibly by identifying yourself (and your website) on the user-agent
RANDOM_UA_TYPE = "random"  {'ie', 'Chrome ',' Firefox ', 'random'... }

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'proxy_ip.middlewares.RandomUserAgentMiddleware': 100}Copy the code

7 settings.py

Settings. py contains a lot of information about the project to manage the configuration uniformly, as shown below:

# -*- coding: utf-8 -*-

import os
import sys
# Scrapy settings for proxy_ip project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'proxy_ip'

SPIDER_MODULES = ['proxy_ip.spiders']
NEWSPIDER_MODULE = 'proxy_ip.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
RANDOM_UA_TYPE = "random"  {'ie', 'Chrome ',' Firefox ', 'random'... }

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 10
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 ',
# 'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'proxy_ip.middlewares.ProxyIpSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'proxy_ip.middlewares.RandomUserAgentMiddleware': 100}# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'proxy_ip.pipelines.ProxyIpPipeline': 300,
}

BASE_DIR = os.path.dirname(os.path.abspath(os.path.dirname(__file__)))
sys.path.insert(0, os.path.join(BASE_DIR, 'proxy_ip'))


# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = True

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


MYSQL_HOST = "localhost"
MYSQL_DBNAME = "crawler"
MYSQL_USER = "root"
MYSQL_PASSWORD = "root123"
Copy the code

6. Run tests

In the root directory of the proxy_ip project, create a main.py entry file to run the project with the following code:

# -*- coding:utf-8 -*-

__author__ = 'jiangzhuolin'

import sys
import os
import time

while True:
    os.system("scrapy crawl proxy_ip")  Scrapy crawl spider_name scrapy crawl spider_name
    print("The program goes to sleep...")
    time.sleep(3600)  Continue climbing after sleeping for 1 hour

Copy the code

Right-click “Run Main” to see the effect

Seven,

My original intention was to write a detailed introduction to scrapy and to share the details with my own understanding. Unfortunately, due to lack of experience, the more I write, the more I think the length will be too long. As a result, there’s a lot of information that I’ve simplified or left unwritten, and this might not be a good idea for a complete beginner. If you’ve written a simple scrapy project and don’t understand the framework, I’m hoping to improve on that.

The code of this project has been submitted to my personal Github and code cloud. If accessing Github is slow, you can access the code cloud to obtain the complete code, including the SQL code to create MySQL database tables

Github address: github.com/jiangzhuoli…

Code cloud code address: gitee.com/jzl975/prox…