Scrapy crawls the seismic data of China earthquake network in one year

Goal setting

The seismic data of China seismological network were crawled and input into Mysql. The data were crawled in full quantity at one time and then incremented

preparation

Analyzing the request path

By visiting the China earthquake networks – www.ceic.ac.cn/speedsearch query seismic data, check the network request can be found that path.

Earthquake within 24 hours:www.ceic.ac.cn/ajax/speeds…

Within 48 hours of the earthquake: www.ceic.ac.cn/ajax/speeds…

The earthquake within 7 days: www.ceic.ac.cn/ajax/speeds…

30 days after the earthquake: www.ceic.ac.cn/ajax/speeds…

1 year earthquake: www.ceic.ac.cn/ajax/speeds…

Visible, in accordance with the time for seismic data path is: www.ceic.ac.cn/ajax/speeds… * * *

Guess what:

The number 1 represents different time query dimensions (1:24 hours, 2:48 hours, 3:7 days, 4:30 days, 6:1 years)
The number 2 represents the page number
The callback argument, which contains a timestamp and is used to call the data display, is not necessary for us, and it is true in our tests

So sure we crawl path is: www.ceic.ac.cn/ajax/speeds…

Our goal is to do a full and daily incremental seismic data crawl, so only request paths:

Full amount of www.ceic.ac.cn/ajax/speeds…

Incremental www.ceic.ac.cn/ajax/speeds…

Analyzing response Packets

The browser input: www.ceic.ac.cn/ajax/speeds…

The response message is in Unicode format, and the online tool (www.jsons.cn/unicode) to transfer…

The return value is a tuple with only one element in JSON format.www.sojson.com) for formatting analysis, the valid data is the value whose key is shuju, and the value type is seismic data list. The list elements have the following key values that we care about: M (magnitude), O_TIME (earthquake onset time), EPI_LAT (epicentral latitude), EPI_LON (epicentral longitude), EPI_DEPTH (earthquake depth), LOCATION_C (earthquake location), ID (data unique ID, used for later incremental weight judgment)

Data Storage Planning

Mysql > create table GEO_DB

CREATE TABLE 'EARTHQUAKE' (' EARTHQUAKE_LEVEL 'float(3,1) DEFAULT NULL COMMENT' EARTHQUAKE severity ', 'EARTHQUAKE_TIME' datetime(6) DEFAULT NULL COMMENT 'earthquake time ',' EARTHQUAKE_LON 'float(5,2) DEFAULT NULL COMMENT' earthquake longitude ', 'EARTHQUAKE_LAT' float(5,2) DEFAULT NULL COMMENT 'epicentral latitude ',' bigint(10) DEFAULT NULL COMMENT 'epicentral depth ', 'EARTHQUAKE_ADDRESS' VARCHar (255) CHARACTER SET UTf8 DEFAULT NULL COMMENT 'Earthquake location ', 'VERSION' datetime(6) DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP(6) COMMENT 'UPDATE time ', 'DID' varchar(20) CHARACTER SET utf8 DEFAULT NULL COMMENT 'InnoDB DEFAULT CHARSET=utf8';Copy the code

Scrapy code implementation

How Scrapy works

Project creation

< span style = "box-sizing: border-box; color: RGB (0, 0, 0); font-size: 14px! Important; word-break: normal; "www.ceic.ac.cn" # scrapy genspider increment "www.ceic.ac.cn"Copy the code

The directory structure is as follows

Modify configuration file setting.py to add Mysql connection parameters

Rule # to avoid the creeper ROBOTSTXT_OBEY = False # default crawler open ITEM_PIPELINES = {' earthquake. Pipelines. EarthquakePipeline ': In 300, } # Mysql connection parameters MYSQL_HOST = '* * *. * * *. * * *. * * *' * * MYSQL_USER MYSQL_PORT = = '* * *' MYSQL_PASSWORD = '* * *' MYSQL_DB = '* * *'Copy the code

Implement the data model Items.py

# -*- coding: UTF-8 -*- """ "import scrapy class EarthquakeItem(scrapy.item): earthquake_level = scrapy.Field() earthquake_time = scrapy.Field() earthquake_lon = scrapy.Field() earthquake_lat = scrapy.Field() earthquake_depth = scrapy.Field() earthquake_address = scrapy.Field() version = scrapy.Field() did = scrapy.Field()Copy the code

Implement full crawler full_mount.py file

# -*- coding: utF-8 -*- """ import scrapy from.. items import EarthquakeItem class FullMountSpider(scrapy.Spider): Name = 'full_mount' allowed_domains = ['www.ceic.ac.cn'] # start_url = 'http://www.ceic.ac.cn/ajax/speedsearch?num=6&&page=' # crawl the page number cycles MAX_PAGE = 60 def start_requests (self) : For I in range(1, self.MAX_PAGE+1): yield scrapy.Request('%s%d' % (self.start_url, i), callback=self.parse, dont_filter=True) def parse(self, response): result=eval(response.body.decode('utf-8')) records=result['shuju'] for record in records: item=EarthquakeItem() item['earthquake_level']=record['M'] item['earthquake_time']=record['O_TIME'] item['earthquake_lon']=record['EPI_LON'] item['earthquake_lat']=record['EPI_LAT'] item['earthquake_depth']=record['EPI_DEPTH'] item['earthquake_address']=record['LOCATION_C'] item['did']=record['id'] yield itemCopy the code

Implement incremental crawler increment.py file

# -*- coding: utF-8 -*- """ import scrapy from.. items import EarthquakeItem class IncrementSpider(scrapy.Spider): Name = 'increment' allowed_domains = ['www.ceic.ac.cn'] # 'http://www.ceic.ac.cn/ajax/speedsearch?num=1&&page=' # crawl the page number cycles MAX_PAGE = 1 def start_requests (self) : For I in range(1, self.MAX_PAGE+1): yield scrapy.Request('%s%d' % (self.start_url, i), callback=self.parse, dont_filter=True) def parse(self, response): result=eval(response.body.decode('utf-8')) records=result['shuju'] for record in records: item=EarthquakeItem() item['earthquake_level']=record['M'] item['earthquake_time']=record['O_TIME'] item['earthquake_lon']=record['EPI_LON'] item['earthquake_lat']=record['EPI_LAT'] item['earthquake_depth']=record['EPI_DEPTH'] item['earthquake_address']=record['LOCATION_C'] item['did']=record['id'] yield itemCopy the code

SQL > insert Mysql database in charge

# -*- coding: Utf-8 -*- """ "import Pymysql import Logging logging.getLogger.setlevel (logging.info) from utF-8 -*- """ "import Pymysql import Logging logging.getLogger.setlevel (logging.info) from scrapy.utils.project import get_project_settings settings = get_project_settings() class EarthquakePipeline(object): def open_spider(self, spider): Host = Settings ['MYSQL_HOST'] port = Settings ['MYSQL_PORT'] user = Settings ['MYSQL_USER'] password = settings['MYSQL_PASSWORD'] database = settings['MYSQL_DB'] try: self.conn = pymysql.connect( host=host, port=port, user=user, password=password, Database =database) logging.info('Mysql database connected successfully. ') except Exception as e: logging.error('Mysql database connection failed! ') raise e def process_item(self, item, spider): $select_sql = 'SELECT COUNT(1) AS CNT FROM EARTHQUAKE WHERE DID=\'%s\' %(item[' DID ']) try: cursor = self.conn.cursor() cursor.execute(select_sql) result_count,=cursor.fetchone() if result_count>0: Logging.info (' Data already exists, no need to repeat entry! ') return item except Exception as e: Logging.error (' Data query error: %s' % select_SQL) return item # INSERT new record insert_SQL = 'INSERT INTO EARTHQUAKE(DID,EARTHQUAKE_LEVEL,EARTHQUAKE_TIME,EARTHQUAKE_LON,EARTHQUAKE_LAT,EARTHQUAKE_DEPTH,EARTHQUAKE_ADDRESS,VERSIO N) VALUES(\'%s\',%s,\'%s\',%s,%s,%s,\'%s\',now())' % ( item['did'], item['earthquake_level'], item['earthquake_time'], item['earthquake_lon'], item['earthquake_lat'], item['earthquake_depth'], item['earthquake_address']) try: Cursor = self.conn.cursor() cursor.execute(insert_SQL) logging.info(' Insert data successfully! ') except Exception as e: Logging.error (' Insert data error: %s' % insert_sql) self.conn.rollback() return item self.conn.commit() return item def close_spider(self, spider): Self.conn.close () logging.info('Mysql database connection is closed. ')Copy the code

Full crawl execution

>> Crawl full_mount scrapy crawl full_mountCopy the code

Incremental crawl is performed

Scrapy Crawl increment IncrementCopy the code

Crawl results

SELECT * FROM `GEO_DB`.`EARTHQUAKE` LIMIT 1000;
Copy the code

Full code view

Gitee.com/angryshar12…