“This is the 18th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021”

Scrapy crawls the thorn proxy

1. Create projects

  • scrapy startproject XcSpider

2. Create a crawler instance

  • scrapy genspider xcdl xicidaili.com

Start with the project folder Sources Root to prevent errors when importing your own files

3. Create a startup file main.py

from scrapy import cmdline
cmdline.execute('scrapy crawl xcdl'.split())
Copy the code

4. Overall tree structure of the project

Tree /F (/F can display a complete file)

│ ├ ── garbage, garbage, garbage, garbage, garbage, garbage, garbage, garbage, garbage, garbage, garbage, garbage, garbage, garbage, garbage │ ├ ─ ─ ─ mysqlpipelines │ │ pipelines. Py │ │ SQL. Py │ │ set py │ │ │ └ ─ ─ ─ __pycache__ │ pipelines, retaining - 36. Pyc │ SQL. Retaining - 36. Pyc │ set retaining - 36. Pyc │ ├ ─ ─ ─ spiders │ │ XCDL. Py │ │ set py │ │ │ └ ─ ─ ─ __pycache__ │ XCDL. Retaining - 36. Pyc │ set retaining - 36. Pyc │ └ ─ ─ ─ __pycache__ items. Retaining - 36. Pyc pipelines, retaining - 36. Pyc settings.cpython-36.pyc __init__.cpython-36.pycCopy the code

5. Configure the settings.py file

  • This section describes how to configure MySQL and MongoDB data
  • < span style = “box-sizing: border-box! Important; word-break: inherit! Important;
  • Set DEFAULT_REQUEST_HEADERS to prevent reverse crawling, and we add headers
# -*- coding: utf-8 -*-

# Scrapy settings for XcSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'XcSpider'

SPIDER_MODULES = ['XcSpider.spiders']
NEWSPIDER_MODULE = 'XcSpider.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'XcSpider (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 '.'Accept-Language': 'zh-CN,zh; Q = 0.9 '.'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko)'
                'the Chrome / 80.0.3987.149 Safari / 537.36',}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'XcSpider.middlewares.XcspiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'XcSpider.middlewares.XcspiderDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   # 'XcSpider.pipelines.XcspiderPipeline': 300,
    'XcSpider.mysqlpipelines.pipelines.XicidailiPipeline': 300.'XcSpider.pipelines.XcPipeline': 200,}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

# enable log
LOG_FILE = 'xcdl.log'
LOG_LEVEL = 'ERROR'
LOG_ENABLED = True

Mysql > configure Mysql
MYSQL_HOST = '127.0.0.1'
MYSQL_USER = 'root'
MYSQL_PASSWORD = 'root'
MYSQL_PORT = 3306
MYSQL_DB = 'db_xici'

# MongoDB configuration
# MONGODB hostname
MONGODB_HOST = '127.0.0.1'
# MONGODB port number
MONGODB_PORT = 27017
# database name
MONGODB_DBNAME = 'XCDL'
Table name where data is stored
MONGODB_SHEETNAME = 'xicidaili'

Copy the code

6. Items. Py files

  • Write the data you need to crawl
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class XcspiderItem(scrapy.Item) :
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class XiciDailiItem(scrapy.Item) :
    country = scrapy.Field()
    ipaddress = scrapy.Field()
    port = scrapy.Field()
    serveraddr = scrapy.Field()
    isanonymous = scrapy.Field()
    type = scrapy.Field()
    alivetime = scrapy.Field()
    verificationtime = scrapy.Field()
Copy the code

7. xcdl.py

  • For page processing, extract the required data
# -*- coding: utf-8 -*-
import scrapy
from XcSpider.items import XiciDailiItem

class XcdlSpider(scrapy.Spider) :
    name = 'xcdl'
    allowed_domains = ['xicidaili.com']
    start_urls = ['https://www.xicidaili.com/']

    def parse(self, response) :
        # print(response.body.decode('utf-8'))
        items_1 = response.xpath('//tr[@class="odd"]')
        items_2 = response.xpath('//tr[@class=""]')
        items = items_1 + items_2

        infos = XiciDailiItem()
        for item in items:
            # Get country picture link
            counties = item.xpath('./td[@class="country"]/img/@src').extract()
            try:
                country = counties[0]
            except:
                country = 'None'
            # get ipaddress
            ipaddress = item.xpath('./td[2]/text()').extract()
            try:
                ipaddress = ipaddress[0]
            except:
                ipaddress = 'None'
            # for the port
            port = item.xpath('./td[3]/text()').extract()
            try:
                port = port[0]
            except:
                port = 'None'
            # get serveraddr
            serveraddr = item.xpath('./td[4]/text()').extract()
            try:
                serveraddr = serveraddr[0]
            except:
                serveraddr = 'None'
            # get isanonymous
            isanonymous = item.xpath('./td[5]/text()').extract()
            try:
                isanonymous = isanonymous[0]
            except:
                isanonymous = 'None'
            # for the type
            type = item.xpath('./td[6]/text()').extract()
            try:
                type = type[0]
            except:
                type = 'None'
            Get the survival time
            alivetime = item.xpath('./td[7]/text()').extract()
            try:
                alivetime = alivetime[0]
            except:
                alivetime = 'None'
            Get the validation time
            verficationtime = item.xpath('./td[8]/text()').extract()
            try:
                verificationtime = verficationtime[0]
            except:
                verificationtime = 'None'

            print(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)

            infos['country'] = country
            infos['ipaddress'] = ipaddress
            infos['port'] = port
            infos['serveraddr'] = serveraddr
            infos['isanonymous'] = isanonymous
            infos['type'] = type
            infos['alivetime'] = alivetime
            infos['verificationtime'] = verificationtime


            yield infos

Copy the code

8. pipelines.py

I. Save the database to the MongoDB database

  • > < span style = “box-sizing: border-box; word-wrap: break-word! Important; word-wrap: break-word! Important;
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo
from XcSpider import settings

class XcspiderPipeline(object) :
    def process_item(self, item, spider) :
        return item

class XcPipeline(object) :
    def __init__(self) :
        host = settings.MONGODB_HOST
        port = settings.MONGODB_PORT
        dbname = settings.MONGODB_DBNAME
        sheetname = settings.MONGODB_SHEETNAME
        Create MONGODB database connection
        client = pymongo.MongoClient(host=host, port=port)
        Select * from database;
        mydb = client[dbname]
        The name of the database where the data is stored
        self.post = mydb[sheetname]

    def process_item(self, item, spider) :
        data = dict(item)
        self.post.insert(data)
        return item
Copy the code

Ii. Save the database to the MySQL database

  • < span style = “box-sizing: border-box! Important; word-break: inherit! Important
  • Mysql > create mysqlPipelines folder/Package
  • First, let’s write an SQL template -> sql.py
# -*- coding: UTF-8 -*-
"' = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = @ Project - > File: Project - > SQL @ IDE: PyCharm @ the Author: Ruochen @ Date: 2020/4/3 12:53 @ Desc = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ' ' '
import pymysql
from XcSpider import settings

MYSQL_HOST = settings.MYSQL_HOST
MYSQL_USER = settings.MYSQL_USER
MYSQL_PASSWORD = settings.MYSQL_PASSWORD
MYSQL_PORT = settings.MYSQL_PORT
MYSQL_DB = settings.MYSQL_DB

db = pymysql.connect(user=MYSQL_USER, password=MYSQL_PASSWORD, host=MYSQL_HOST, port=MYSQL_PORT, database=MYSQL_DB, charset="utf8")
cursor = db.cursor()

class Sql(object) :

    @classmethod
    def insert_db_xici(cls, country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime) :
        sql = 'insert into xicidaili(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)' \
              ' values (%(country)s, %(ipaddress)s, %(port)s, %(serveraddr)s, %(isanonymous)s, %(type)s, %(alivetime)s, %(verificationtime)s) '
        value = {
            'country': country,
            'ipaddress': ipaddress,
            'port': port,
            'serveraddr': serveraddr,
            'isanonymous': isanonymous,
            'type': type.'alivetime': alivetime,
            'verificationtime': verificationtime,
        }
        try:
            cursor.execute(sql, value)
            db.commit()
        except Exception as e:
            print('Failed to insert ----', e)
            db.rollback()

    # to heavy
    @classmethod
    def select_name(cls, ipaddress) :
        sql = "select exists(select 1 from xicidaili where ipaddress=%(ipaddress)s)"
        value = {
            'ipaddress': ipaddress
        }
        cursor.execute(sql, value)
        return cursor.fetchall()[0]

Copy the code
  • > < span style = “box-sizing: border-box! Important; word-break: inherit! Important
# -*- coding: UTF-8 -*-
"' = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = @ Project - > File: Project - > pipelines @ IDE: PyCharm @ the Author: Ruochen @ Date: 2020/4/3 12:53 @ Desc: = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ' ' '
from XcSpider.items import XiciDailiItem
from .sql import Sql

class XicidailiPipeline(object) :
    def process_item(self, item, spider) :
        if isinstance(item, XiciDailiItem):
            ipaddress = item['ipaddress']
            ret = Sql.select_name(ipaddress)
            if ret[0] = =1:
                print(IP: {} already exists ----.format(ipaddress))
            else:
                country = item['country']
                ipaddress = item['ipaddress']
                port = item['port']
                serveraddr = item['serveraddr']
                isanonymous = item['isanonymous']
                type = item['type']
                alivetime = item['alivetime']
                verificationtime = item['verificationtime']

                Sql.insert_db_xici(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)
                
Copy the code

9. Settings. py set up pipelines

  • The settings.py file has already been added, so let’s say it again
  • One is the middleware of MySQL and the other is the middleware of MongoDB
  • The priority can be set arbitrarily
  • They can be opened at the same time, but they can also be opened separately

“> < span style =” max-width: 100%; box-sizing: border-box! Important; word-wrap: break-word! Important

# from XcSpider.mysqlpipelines.pipelines import XicidailiPipeline
ITEM_PIPELINES = {
   # 'XcSpider.pipelines.XcspiderPipeline': 300,
    'XcSpider.mysqlpipelines.pipelines.XicidailiPipeline': 300.'XcSpider.pipelines.XcPipeline': 200,}Copy the code

10. Run the program

  • Now we can start our crawler by running the main.py file
  • You can then see the crawled data in the database

MySQL > alter table xicidaili; MySQL > alter table xicidaili; MySQL > create table xicidaili; MySQL > create table xicidaili; MySQL > create table xicidaili

Create Table: CREATE TABLE `xicidaili` (
  `id` int(255) unsigned NOT NULL AUTO_INCREMENT,
  `country` varchar(1000) NOT NULL,
  `ipaddress` varchar(1000) NOT NULL,
  `port` int(255) NOT NULL,
  `serveraddr` varchar(50) NOT NULL,
  `isanonymous` varchar(30) NOT NULL,
  `type` varchar(30) NOT NULL,
  `alivetime` varchar(30) NOT NULL,
  `verificationtime` varchar(30) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=64 DEFAULT CHARSET=utf8;
Copy the code

End. Run result

The MySQL database

Mongo database

Finally, welcome to pay attention to my personal wechat public account “Little Ape Ruochen”, get more IT technology, dry goods knowledge, hot news