The original article is reprinted from liu Yue’s Technology blog v3u.cn/a_id_83

Scrapy is a handy Python crawler framework that allows you to write only a few components to crawl web pages. However, when we need to crawl a lot of pages, the processing capacity of a single server can not meet our needs (no matter the processing speed or the number of concurrent network requests), at this time the advantages of distributed crawler appear.

Scrapy-redis is a distributed component based on Redis. It uses Redis to store and Schedule Requests for crawls, and stores items resulting from crawls for subsequent processing. Scrapy-redi rewrites some of the key aspects of scrapy, turning scrapy into a distributed crawler that can run on multiple hosts simultaneously.

To put it bluntly, When a crawler takes a URL from a redis, it removes that URL from the queue to ensure that two crawlers do not get the same URL, even though it is possible for two crawlers to request the same url at the same time A URL, redis will do the reprocessing again when the result is returned, so we can achieve a distributed effect, we take one host redis queue, and then run crawlers on other hosts. And scrapy-Redis will always be connected to Redis, so even when there is no URL in the Redis queue, the crawler will periodically refresh the request, and once there is a new URL in the queue, the crawler will immediately start crawling

Firstly, the crawler library is installed on the master and slave respectively

pip3 install requests scrapy scrapy-redis redis
Copy the code

Install Redis on the host

# installed redisYum install redis start systemctl start redis check the version number redis-cli --version Set the system to start systemctlenable redis.service
Copy the code

Edit the redis configuration file vim /etc/redis.conf to set the protection mode to no and comment out bind. In order to enable remote access, note that the Ari cloud security policy also needs to expose port 6379

# bind 127.0.0.1
protected-mode no
Copy the code

After changing the configuration, don’t forget to restart the service for it to take effect

systemctl restart redis
Copy the code

Then create crawler projects respectively

scrapy startproject myspider
Copy the code

Create a new test.py under the spiders of the item

# guide package
import scrapy
import os
from scrapy_redis.spiders import RedisSpider

# define the fetching class
#class Test(scrapy.Spider):
class Test(RedisSpider):

    Define the crawler name, which matches the name of the command line runtime
    name = "test"

    # define key for redis
    redis_key = 'test:start_urls'

    Define header information
    haders = {
        'User-Agent': 'the Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit Ubuntu Chromium/73.0.3683.86 Chrome/73.0.3683.86 Safari/537.36'
    }

    def parse(self, response):
        print(response.url)
        pass
Copy the code

Then modify settings.py to add the following configuration, where the redis address is the redis address configured on the host:

BOT_NAME = 'myspider'

SPIDER_MODULES = ['myspider.spiders']
NEWSPIDER_MODULE = 'myspider.spiders'

# Set Chinese encoding
FEED_EXPORT_ENCODING = 'utf-8'

# scrapy-redis host address
REDIS_URL = 'redis: / / [email protected]:6379'
# queue scheduling
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
Do not clear the cache
SCHEDULER_PERSIST = True
# Reweight via Redis
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# Do not follow robots
ROBOTSTXT_OBEY = False
Copy the code

Finally, you can start the Scrapy service on both hosts

scrapy crawl test
Copy the code

At this point, the service is up, but there are no tasks in the Redis queue, waiting

Redis into the host

redis-cli
Copy the code

Push the task queue into Redis

lpush test:start_urls http://baidu.com
lpush test:start_urls http://chouti.com
Copy the code

It can be seen that the crawler services of the two servers grab the tasks in the queue respectively, and the URL will not be repeatedly captured by virtue of redis characteristics

When the crawl task is complete, you can flush the address fingerprint with the flushdb command so that the historical address can be fetched again.

The original article is reprinted from liu Yue’s Technology blog v3u.cn/a_id_83