I. Target address: www.23us.so/

MacOS + PyCharm + Python3.5.3 + Scrapy + MySQL 2 Install Scrapy and MySQL themselves

Start building projects and writing crawlers in the terminal to create a new scrapy project

scrapy startproject dingdian
Copy the code
  • MySQL > build MySQL > build MySQL > build MySQL > build MySQL > build MySQL > build MySQL > build MySQL > build MySQL > build MySQL > build MySQL > build MySQL > build MySQL > build MySQL > build MySQL > build MySQL > build MySQL
  • Create our own Spider under the Spiders folder (Dingdian)
  • Create a new run.py file to use as an entry point to run the crawler

The contents of run.py, where ‘dingdian’ is the unique name of the spider, are defined when the spider is defined

from scrapy.cmdline import execute

execute(['scrapy'.'crawl'.'dingdian'])
Copy the code

The project structure

Create model item and write it to items.py

import scrapy


class DingdianItem(scrapy.Item):
    # define the fields for your item here like:

    name = scrapy.Field()

    author = scrapy.Field()

    novelurl = scrapy.Field()

    # state
    serialstatus = scrapy.Field()

    # -
    serialnumber = scrapy.Field()

    # category
    category = scrapy.Field()

    # number
    name_id = scrapy.Field()

class DcontentItem(scrapy.Item):
    # Novel number
    id_name = scrapy.Field()

    # Chapter Content
    chaptercontent = scrapy.Field()

    Used to bind chapter order
    num = scrapy.Field()

    # Section address
    chapterurl = scrapy.Field()

    # Chapter name
    chaptername = scrapy.Field()
Copy the code

Then let’s look at the entry address www.23us.so/

Magical Magic: www.23us.so/list/1_1.ht…

Wuxia Xiuzhen: www.23us.so/list/2_1.ht…

City romance: www.23us.so/list/3_1.ht…

History military: www.23us.so/list//4_1.h…

Online game competition: www.23us.so/list/5_1.ht…

Science fiction: www.23us.so/list/6_1.ht…

Terror: www.23us.so/list/7_1.ht…

Novels for girls: www.23us.so/list/8_1.ht…

Other: www.23us.so/list/9_1.ht…

Origin: www.23us.so/full.html

Of course for the above address, request base_URL + ‘_d’

import re
import scrapy
from scrapy import Request
from bs4 import BeautifulSoup
from dingdian.items import DingdianItem, DcontentItem
from dingdian.mysqlpipelines.sql import Sql

class MySpider(scrapy.Spider):
    name = "dingdian"
    allowed_domains = ['23us.so']
    base_url = 'https://www.23us.so/list/'

    def start_requests(self):
        for i in range(1, 10):
            url = self.base_url + str(i) + '_1' + '.html' # url of fiction category
            yield Request(url, self.parse)
            # origin
        yield Request('https://www.23us.so/full.html', callback=self.parse)
Copy the code

For the above code, create a class Myspider that inherits from scrapy.Spider and defines name: dingdian (note that name is the third argument in the run.py file!). , this Name can only have one Name in the entire project, the Name can not be repeated!

Define an allowed_domains; This is not required, but is needed in some cases, such as when using crawl rules, which only follow up urls that exist in allowed_domains, and those that don’t are ignored. All urls of the novel categories found above are implemented using string concatenation.

Finally, we use the parse function to accept the response obtained from the above request. The URL in the response returned is the link of each novel category. There are many pages in each category, and we need to get the page number

 def parse(self, response):
        max_num = response.css('div.pagelink a.last::text').extract_first()

        for num in range(1, int(max_num) + 1):
            next_page = str(response.url)[:-7] + '_' + str(num) + '.html'
            if next_page is not  None:
                yield Request(next_page, callback=self.get_name)
Copy the code

When next_page exists, it requests it. Here, Response. CSS searches for labels through CSS selectors. There are many ways to search labels, such as CSS, xpath, or BeautifulSoup. You can use Chrome or Firefox to quickly obtain the CSS and xpath paths for the specified tags. Open code review in Chrome and click on this icon

The above two functions completely extract the page URLS of all novels in the whole website, and give the response of each page to the get_name function for processing.

    def get_name(self, response):
        tds = BeautifulSoup(response.text, 'lxml').find_all('tr', bgcolor='#FFFFFF')
        for td in tds:
            novelname = td.find('a').get_text()
            novelurl = td.find('a') ['href']
            yield Request(novelurl, callback=self.get_chapterurl, meta={'name': novelname, 'url': novelurl})
Copy the code

Get the novel name and URL, passing additional parameters to the get_chapterURL function via Reques’ meta

    def get_chapterurl(self, response):
        item = DingdianItem()

        item['name'] = str(response.meta['name']).replace('\xa0'.' ')
        item['novelurl'] = response.meta['url']
        category = response.css('table a::text').extract_first()
        author = response.css('table td::text').extract()[1]

        # Latest chapter
        bash_url = response.css('p.btnlinks a.read::attr(href)').extract_first()

        name_id = str(bash_url).split('/')[-2]

        item['category'] = str(category).replace('/'.' ')
        item['author'] = str(author).replace('/'.' ')
        item['name_id'] = name_id
        yield item

        yield Request(url=bash_url, callback=self.get_chapter, meta={'name_id': name_id})
Copy the code

Copy the required data to item[key] (note that the key is the field defined in the item file).

Attention! Response. meta[key] : This extracts the value passed from the previous function.

< span style = “box-sizing: border-box! Important; word-wrap: break-word! Important;” For example, store it in MySQL.

Traversing each chapter of the novel is similar to the previous operation, which is to find tags. Next we will talk about the problem of storing to MySQL through Pipeline

Private private private private private private private private private private private private private private private private private private private private private private

DROP TABLE IF EXISTS `dd_name`;
CREATE TABLE `dd_name` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `xs_name` varchar(255) DEFAULT NULL,
  `xs_author` varchar(255) DEFAULT NULL,
  `category` varchar(255) DEFAULT NULL,
  `name_id` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=38 DEFAULT CHARSET=utf8mb4;
Copy the code
DROP TABLE IF EXISTS `dd_chaptername`;
CREATE TABLE `dd_chaptername` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `xs_chaptername` varchar(255) DEFAULT NULL,
  `xs_content` text,
  `id_name` int(11) DEFAULT NULL,
  `num_id` int(11) DEFAULT NULL,
  `url` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2726 DEFAULT CHARSET=gb18030;
SET FOREIGN_KEY_CHECKS=1;
Copy the code

The MySQL configuration file is defined in settings.py. The account, password, port and database are filled in according to your local configuration. ‘DingDianBooks’ is the database I built for this project.

# mysql
MYSQL_HOSTS = '127.0.0.1'
MYSQL_USER = 'root'
MYSQL_PASSWORD = '11111'
MYSQL_PORT = '3306'
MYSQL_DB = 'DingDianBooks'
Copy the code

To connect Python to the MySQL database, you need to download another package. I’m using mysql-connector, managed by PIP installation.

Here is our sql.py file:

import mysql.connector
from dingdian import settings

# mysql
MYSQL_HOSTS = settings.MYSQL_HOSTS
MYSQL_USER = settings.MYSQL_USER
MYSQL_PASSWORD = settings.MYSQL_PASSWORD
MYSQL_PORT = settings.MYSQL_PORT
MYSQL_DB = settings.MYSQL_DB

cnx = mysql.connector.connect(user=MYSQL_USER, password=MYSQL_PASSWORD, host=MYSQL_HOSTS, database=MYSQL_DB)
cur = cnx.cursor(buffered=True)

class Sql:
    # Insert title + author + category
    @classmethod
    def insert_dd_name(cls , xs_name, xs_author, category, name_id):

        sql = 'INSERT INTO dd_name (`xs_name`, `xs_author`, `category`, `name_id`) VALUES (%(xs_name)s, %(xs_author)s, %(category)s, %(name_id)s)'

        value = {
            'xs_name' : xs_name,
            'xs_author': xs_author,
            'category': category,
            'name_id': name_id
        }

        cur.execute(sql, value)
        cnx.commit()

    # to heavy
    @classmethod
    def select_name(cls, name_id):
        sql = 'SELECT EXISTS(SELECT 1 FROM dd_name WHERE name_id=%(name_id)s)'
        value = {
            'name_id': name_id
        }
        cur.execute(sql, value)
        return cur.fetchall()[0]
Copy the code

Initialize a MySQL operation cursor, write four variables from the function to the database, select_name is a delete function, this function looks for name_id, if exists returns 1, if does not exist returns 0.

Sqi.py completed, now start to write pipeline:

from .sql import Sql
from dingdian.items import DingdianItem, DcontentItem

class DingDianPipeline(object):
    def process_item(self, item, spider):
        if isinstance(item, DingdianItem):
            name_id = item['name_id']
            ret = Sql.select_name(name_id)
            if ret[0] == 1:
                print('Already exists')
                pass
            else:
                xs_name = item['name']
                xs_author = item['author']
                category = item['category']
                Sql.insert_dd_name(xs_name, xs_author, category, name_id)
Copy the code

Create a DingdianItem class, don’t forget to inherit object, define a process_item function and have item and spider parameters, these two parameters are required, when there is DingdianItem in the item, the first to execute the weight, And then we take the value out of the item and store it in the database. The other way to access a table is similar, and details can be found in the code.

At this point, the crawler is almost complete, just run run.py in PyCharm to execute the crawler.

Project address: GitHub