This article is participating in Python Theme Month. See the link to the event for more details

preface

Scrapy is a very mature crawler framework, which almost encapsulates all the modules needed by developers, such as: Request, agent, log, URL automatic weight and so on, some not satisfied with the module is slightly modified. But scrapy can only crawl static web pages. If you want to crawl dynamically rendered web pages, you need to use third-party components such as Selenium or Splash.

Why Choose Selenium

Both Selenium and Splash are excellent dynamic web rendering tools. However, Selenium has the following advantages compared with Splash:

  1. Selenium can be run directly in the browser, just like a real user action.
  2. Selenium’s documentation is more detailed and there are plenty of tutorials in China, while Splash is pitifully sparse.
  3. The operation commands supported by Selenium to simulate user behaviors are more comprehensive and detailed, and splash also has some user operation behaviors. However, I can’t understand how to use them through their manual.

Considering all these points, Selenium is a better choice, but Splash is not without its advantages. I’ve written PHP crawlers before and used Splash. My personal opinion: If you don’t have a particularly complicated interaction, you can still use Splash. With the API provided by Splash, all you need to do is pass the request header, proxy and URL to Splash, and you can get the rendered HTML.

This article will implement the function

  1. The installation of Scrapy
  2. Configure the selenium
  3. Sqlalchemy completes data warehousing

The installation of Scrapy

Install scrapy

Type PIP install scrapy on your terminal

Create a scrapy project

  1. To create a project, run:Scrapy StartProject Specifies the name of the project
  2. To create a crawler file, run:Scrapy genspider file name domain

I perform

scrapy startproject spider
Copy the code
scrapy genspider douban douban.com
Copy the code

The generated project directory looks like this

spider
    spider
        spiders
            __init__.py
            douban.py
        __init__.py
        items.py
        middlewares.py
        piplines.py
        settings.py
    scrapy.cfg
Copy the code

Adding related directories

Models # mysql Model class __init__.py Spiders __init__.py Spiders __init__.py Spiders __init__.py Spiders douban.py __init__.py items.py middlewares.py piplines.py settings.py scrapy.cfgCopy the code

Configure the selenium

The browser driver installation is not mentioned, because each system installation is different, each system on the Internet installation method is very detailed, and not difficult.

Introducing the selenium package

PIP installs the Selenium package

pip install selenium
Copy the code

Change the code

spider/spiders/douban.py

The import scrapy from selenium import webdriver # use headless browser from the selenium. Webdriver. Chrome. The options import options # headless browser Settings chorme_options = Options() chorme_options.add_argument("--headless") chorme_options.add_argument("--disable-gpu") class DoubanSpider(scrapy.Spider): name = 'douban' allowed_domains = ['douban.com'] start_urls = ['https://movie.douban.com/explore#!type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start=0'] # Instantiate a browser object def __init__(self): self.browser = webdriver.Chrome(options=chorme_options) super().__init__() def parse(self, response): Print (response.body) pass def close(self, spider): print(' spider ') self.browser.quit()Copy the code

Change some Settings in settings.py

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7 Like Gecko) Chrome/90.0.4430.212 Safari/537.36' # Disable ROBOTSTXT_OBEY = False # Open download middleware DOWNLOADER_MIDDLEWARES =  { 'spider.middlewares.SpiderDownloaderMiddleware': # 543,} data processing pipeline ITEM_PIPELINES = {' spiders. Pipelines. SpiderPipeline: 300,}Copy the code

Middlewares. Py middleware file modification: SpiderDownloaderMiddleware class process_response method modification, load is driven by the browser to the url

Def process_response(self, request, response, spider): """ """ Spider.browser.get (url=request.url) # wait for loading, Time.sleep (1) row_response = spider.browser.page_source # The url parameter refers to the url currently visited by the browser (obtained by the current_URL method). Url return HtmlResponse(url= spider.browser.current_URL, body=row_response, encoding="utf8", request=request)Copy the code

At this point, selenium is configured

Sqlalchemy is introduced to complete data warehousing

In Python, the best known ORM framework is SQLAlchemy, which is well documented and easy to use, designed for efficient and high-performance database access, and implements a complete enterprise-class persistence model.

Add the database connection profile

Database configuration file spider/common/conf.py

'dev': {'host': '127.0.0.1', 'database': 'test', 'user': 'user', 'password': 'passord', 'the port: 3306,}, #' test 'test server configuration information: {' host' : '127.0.0.1', 'database' : 'test', 'user' : 'user', 'the password: 'passord', 'the port: 3306,}, #' prod 'online server configuration information: {' host' : '127.0.0.1', 'database' : 'test', 'user' : 'user', 'the password: 'passord', 'port': 3306, }, } }Copy the code

Run the environment configuration file spider/common/env.py

# coding: utF-8 """ prod """ class env: env = 'dev'Copy the code

Adding a Database connection

Add database connection in spider/models/__init__.py file

# -*- coding: utf-8 -*- from sqlalchemy.ext.declarative import declarative_base from sqlalchemy import create_engine from sqlalchemy.orm import sessionmaker from spider.common.conf import conf from spider.common.env import env Base = declarative_base() env = env.env mysql = conf.mysql[env] engine = create_engine( url="mysql+pymysql://{user}:{password}@{host}/{database}".format( user=mysql['user'], password=mysql['password'], Host =mysql['host'], database=mysql['database']), encoding=' utF-8 ', pool_size=5, Pool_pre_ping =True) Session = sessionMaker (bind=engine)()Copy the code

Establish the data table model

Add the Movie. Py file to the spider/ Models directory

from sqlalchemy import Column, String, Integer, DateTime from spider.models import Base class Movie(Base): Id = Column(Integer, primary_key=True) title = Column(String(100), Updated_at = Column(DateTime, comment=' update time ')Copy the code

Add the item field

The items. Add MovieItem py

import scrapy

class SpiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class MovieItem(scrapy.Item):
    title = scrapy.Field()
    pass
Copy the code

Change crawler

Spiders /spiders/douban.py add data parsing

The import scrapy from selenium import webdriver # use headless browser from the selenium. Webdriver. Chrome. The options import options # headless browser Settings from spider.items import MovieItem chorme_options = Options() chorme_options.add_argument("--headless") chorme_options.add_argument("--disable-gpu") class DoubanSpider(scrapy.Spider): name = 'douban' allowed_domains = ['douban.com'] start_urls = ['https://movie.douban.com/explore#!type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start=0'] # Instantiate a browser object def __init__(self): self.browser = webdriver.Chrome(options=chorme_options) super().__init__() def parse(self, response): for value in response.css('.list-wp>.list>a'): item = MovieItem() item['title'] = value.css('.cover-wp>img::attr(alt)').extract_first() yield item pass def close(self, Spider): print(' end of crawler, close browser ') self.browser.quit()Copy the code

Pipeline inventory

Pills.py adds data to the database

import datetime from spider.models import Session from spider.models.Movie import Movie class SpiderPipeline: """ def open_spider(self, spider): print(" spider starts....") """ def process_item(self, item, spider): return: """ def process_item(self, item, spider): return: """ data = Movie(title=item['title'], created_at=datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), updated_at=datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")) Session.add(data) Session.commit() except(): Rollback () return None """ def close_spider(self, spider): Session.close() print(" crawler end, close database connection ")Copy the code

conclusion

So far a simple crawler is completed, all the above requests are selenium components, some static web pages are not needed, in the middleware to determine the line. This article is just a simple encapsulation, perfect crawler also need to set up the agent, consider distributed, incremental crawler, here will not go into detail.