spiderman

General distributed crawler framework based on Scrapy-Redis

Project address github.com/TurboWay/sp…

directory

rendering
- Acquisition effect
- Crawler metadata
- Distributed crawler running
- Single crawler operation
- Download the attachment
- Kafka real-time acquisition monitoring example
introduce
- function
- Principle that
Quick start
- Download and install
- How to develop a new crawler
- How to make up the climb
- How to download attachments
- How do I extend distributed crawlers
- How do I manage crawler metadata
- How to cooperate with Kafka to do real-time acquisition monitoring
- How to use the crawler API
other
- Matters needing attention
- Hive Environment Problems
- Update log
- TODO

Demo Collection Effect

Crawler metadata

Cluster mode

Standalone mode

Download the attachment

Kafka real-time acquisition and monitoring

function

Automatically build table
Automatic generation of crawler code, only need to write a small amount of code to complete the distributed crawler
Automatic storage of metadata, analysis of statistics and crawling are very convenient
Suitable for multi-site development, each crawler customized independently, each other
Easy to call, you can customize the number of pages collected and the number of crawlers enabled according to the parameters
The extension is simple and allows you to choose from standalone mode, standalone (default) or distributed cluster as required
Easy to collect data landing, support a variety of databases, only in spider to enable the relevant pipeline

relational
- mysql
- sqlserver
- oracle
- postgresql
- sqlite3
non-relational
- hbase
- mongodb
- elasticsearch
- hdfs
- hive
- Datafile, such as CSV
Anti – crawl processing is easy, has encapsulated a variety of anti – crawl middleware
- Random UserAgent
- Customize the request Headers
- Custom Cookies pool
- Customizing proxy IP addresses
- Use requests in scrapy
- Content request
- Render JS using Splash

Principle that

Message queues use Redis and collection policies use breadth-first, first-in, first-out
Each crawler has a job file, which is used to generate the initial request class ScheduledRequest and push it to Redis.

After all the initial requests are pushed to Redis, a spider is run to parse the generated data and iterate over new requests to Redis until all requests in Redis are consumed

# scrapy_redis request class
class ScheduledRequest:

    def __init__(self, **kwargs) :
        self.url = kwargs.get('url')                 # request url
        self.method = kwargs.get('method'.'GET')   The request mode defaults to get
        self.callback = kwargs.get('callback')  The callback function specifies the parser function for spider
        self.body = kwargs.get('body')  As a POST form when # body, method is POST
        self.meta = kwargs.get('meta')  # meta, carrying metadata such as PAGenum
Copy the code

The item class defines table name, field name, sort number (custom field order), comment description (easy to manage metadata), field type (only valid for relational database pipes)

class zhifang_list_Item(scrapy.Item) :
    # define table
    tablename = 'zhifang_list'
    tabledesc = 'list'
    # define the fields for your item here like:
    VARCHAR(length=255); VARCHAR(length=255);
    # colname = scrapy.Field({'idx': 1, 'comment': 'name ', 'type': VARCHAR(255)})
    tit = scrapy.Field({'idx': 1.'comment': 'Title of House'})
    txt = scrapy.Field({'idx': 2.'comment': 'House Description'})
    tit2 = scrapy.Field({'idx': 3.'comment': 'House floor'})
    price = scrapy.Field({'idx': 4.'comment': 'House price'})
    agent = scrapy.Field({'idx': 5.'comment': 'Real Estate Agent'})
    # default column
    detail_full_url = scrapy.Field({'idx': 100.'comment': 'Details link'})  # generic field
    pkey = scrapy.Field({'idx': 101.'comment': 'md5(detail_full_url)'})  # generic field
    pagenum = scrapy.Field({'idx': 102.'comment': 'page'})  # generic field
Copy the code

Deduplication policy. By default, deduplication is not performed. Each collection is independent. You can modify the following configurations if necessary

Job file (single crawler)

class zhifang_job(SPJob) :

    def __init__(self) :
        super().__init__(spider_name=zhifang_Spider.name)
        # self.delete() # Comment this line if you want to de-duplicate, incremental collection
Copy the code

Spider file (single crawler)

custom_settings = { ... .'DUPEFILTER_CLASS': 'scrapy_redis.dupefilter.RFPDupeFilter'.'SCHEDULER_PERSIST': True.# Enable persistence
    }
   
    def get_callback(self, callback) :
        # url derename: True does not deduplicate False deduplicate
        callback_dt = {
            'list': (self.list_parse, False),
            'detail': (self.detail_parse, False),}return callback_dt.get(callback)
Copy the code

Bloom filter.

When a large amount of data is collected, bloom filter can be used. The algorithm occupies a small space and is controllable, which is suitable for massive data deduplication. However, the algorithm will have a leakage rate, which is the same as the crawler. You can adjust the filter load, memory configuration, and hashing times to reduce the leakage rate. By default, 1 filter, 256 M memory, and 7 seeds are used. This configuration indicates that the probability of missing is 8.56E-05, which can satisfy the de-gravity of 93 million strings. When the leakage rate is 0.000112, it can satisfy the deduplication of 98 million strings. Adjust participation leakage rate reference

custom_settings = { ... .'DUPEFILTER_CLASS': 'SP.bloom_dupefilter.BloomRFDupeFilter'.# Use bloom filters
        'SCHEDULER_PERSIST': True.# Enable persistence
        'BLOOM_NUM': 1.The number of loads on the Bloom filter can be increased when the memory limit is reached
        'BLOOM_MEM': 256.# Bloem filter memory size (in M), the maximum memory size is 512 M (redis string is 512 M).
        'BLOOM_K': 7.# Bloem filter hash times, the fewer times, the faster weight removal, but the higher the leakage rate
    }
   
    def get_callback(self, callback) :
        # url derename: True does not deduplicate False deduplicate
        callback_dt = {
            'list': (self.list_parse, False),
            'detail': (self.detail_parse, False),}return callback_dt.get(callback)
Copy the code

Download and install

Git clone github.com/TurboWay/sp… ; cd spiderman;
Virtualenv -p /usr/bin/python3 venv
Source venv/bin/activate source venv/bin/activate
pip install -i pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt
Modify configuration vi SP/settings.py
Run the demo python SP_JOBS/zhifang_job.py

How to develop a new crawler

Running easy_scrapy.py automatically generates the following code files from the template and automatically opens the spidername_job.py file in the editor.

category	The path	instructions
job	SP_JOBS/spidername_job.py	Writing the initial request
spider	SP/spiders/spidername.py	Write parsing rules to generate new requests
items	SP/items/spidername_items.py	Define the table name field

Python SP_JOBS/spidername_job.py after the above code file is written, run python SP_JOBS/spidername_job.py

-p Number of pages to be collected, -n Number of crawlers to be enabled. Python SP_JOBS/spidername_job.py -p 10 -n 1

How to make up the climb

Running easy_scrapy.py automatically generates the following code files from the template and automatically opens the spidername_job_patch.py file in the editor.

category	The path	instructions
job	SP_JOBS/spidername_job_patch.py	Write the crawl request

After the above code file is written, run Python SP_JOBS/spidername_job_patch.py directly

How to download attachments

There are two ways to download:

1. Enable the attachment download pipe directly in spider
2. Use the custom downloader execute_download.py to upload parameters to download

jpg/pdf/word… And various documents, collectively referred to as attachments. Downloading attachments occupies a lot of bandwidth. Therefore, in large-scale collection, it is best to first put structured table data and metadata of attachments into the database to ensure the integrity of data, and then download attachments through downloaders as required.

How do I extend distributed crawlers

There are two gathering modes (controlled in Settings) : standalone(default) and clustered distributed

To switch to a distributed crawler, you need to enable the following configuration in spiderman/SP/settings.py

Note: The premise is that all SLAVE machines have the same crawler code and python environment, and can run crawler demo

False Single machine (default); True distributed needs to configure the following slaves
CLUSTER_ENABLE = True
Copy the code

The name of the configuration	meaning	The sample
SLAVES	Crawler configuration list	[{‘ host ‘:’ 172.16.122.12 ‘and’ port ‘: 22,’ user ‘:’ spiders, ‘the PWD’ : ‘spiders’}, {‘ host ‘:’ 172.16.122.13 ‘and’ port ‘: 22,’ user ‘:’ spiders, ‘the PWD’ : ‘spiders’}]
SLAVES_BALANCE	Crawler configuration (SSH load balancing)	{‘ host ‘:’ 172.16.122.11 ‘and’ port: 2202, ‘user’ : ‘spiders,’ the PWD ‘:’ spiders’}
SLAVES_ENV	[Optional] Path of crawler virtual environment	/home/spider/workspace/spiderman/venv
SLAVES_WORKSPACE	Crawler machine code engineering path	/home/spider/workspace/spiderman

How do I manage crawler metadata

Run easy_meta.py to automatically generate the metadata of all crawlers of the current project, which is recorded in SQLite meta.db by default and can be self-configured in setting.

# the crawler meta
META_ENGINE = 'sqlite:///meta.db'
Copy the code

The meta table meta dictionary is as follows:

The field name	type	annotation
spider	varchar(50)	The crawler name
spider_comment	varchar(100)	The crawler description
tb	varchar(50)	The name of the table
tb_comment	varchar(100)	Table describes
col_px	int	The serial number field
col	varchar(50)	The field name
col_comment	varchar(100)	The field
author	varchar(20)	The developer
addtime	varchar(20)	Development time
insertime	varchar(20)	Metadata update time

How to cooperate with Kafka to do real-time acquisition monitoring

Configure kafka (modify setting KAFKA_SERVERS)
Custom monitoring rules (modify writing kafka_mon.py and running the script to start monitoring)
Enable kafka pipeline in spider (run crawler job, start collecting)

How to use the crawler API

Run directly API. Py, then can through http://127.0.0.1:2021/docs to check the relevant API documentation

Matters needing attention

Tablename, ISLOAD, Ctime, bizDate, spider, and other fields cannot be used for field names because these fields are used as common fields to avoid conflicts
It is recommended to add comments to each field of the Items file. When metadata is generated, the comments are imported into the metadata table to facilitate crawler management

Hive Environment Problems

In Windows, there are many pits when python3 is used to connect to Hive. Therefore, when HDFS is used, the hive automatic table creation function is disabled by default for easy deployment. To enable the Hive automatic table building function, perform the following operations:

pip install -i pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt
PIP install – no – deps thrift – sasl = = 0.2.1
To verify the environment, run sp.utils. ctrl_hive

If the command is executed successfully, the Hive environment is ready. You can directly enable hive automatic table creation. If you encounter problems, see Big Data to connect Python3 to Hive in Windows

Update log

The date of	Update the content
20200803	1. Use a more elegant way to generate metadata; 2. Adjusting the writing method of pipeline function parameter transfer; Download status (isLOAD => status)
20200831	1. If data fails to be imported into the database, try again. 2. All pipelines are optimized, and when warehousing fails, it will automatically switch to line-by-line warehousing, and only abnormal records will be discarded
20201104	1. Requests Middleware supports DOWNLOAD_TIMEOUT and DOWNLOAD_DELAY
20201212	1. Payload middleware supports DOWNLOAD_TIMEOUT and DOWNLOAD_DELAY. 2. Optimization of get_SP_cookies method, using lightweight SPLASH to replace Selenium; 3. The principle part of MD adds the description of deduplication strategy
20210105	1. Add a Bloom filter
20210217	Elasticsearch (compatible with ElasticSearch 7 or later) uses table names as index names
20210314	1. All anti-crawl middleware is merged into SPMiddleWare
20210315	1. Generate the initial job request in an elegant way. 2. Optimize the headers middleware to reduce the memory occupation of Redis; 3. Delete the cookie middleware. Cookie is only a value of headers and you can directly use headers middleware. 4. Delete Payload middleware. Payload requests can be used directly 5. Add CookiesPool middleware for randomly switching between multiple accounts
20210317	1. Add distributed attachment downloaders that work independently of scrapy
20210318	1. Add API services