The Python Scrapy crawler framework crawls Twitter messages and persists data

Recently, I want to build a collection system for hot information about COVID-19 at home and abroad, so I need to get some data from Twitter, and then do data classification and sentiment analysis. As a qualified programmer, we should have the “take the spirit”, with the wheel of others to achieve their own projects, instead of building from scratch.

First, the crawler framework Scrapy

Scrapy is a Python implementation of an application framework written for crawling Web site data and extracting structural data. Professional work is done by professional framework. Therefore, in this project, we decided to use Scrapy framework to carry out data crawling. If you’re not familiar with Scrapy, check out my previous blog post to get you started on the Python Scrapy Crawler Framework.

Look for open source projects

Avoid reinventing the wheel before starting a project, so use the keywords “Scrapy”, “Twitter” to search GitHub for existing open source projects.

Through our search, we found that there are many open source projects that fit the bill, so how do we choose them? There are three conditions. The first is the number of STAR, which means the quality of the project should be good and is recognized by everyone. The second is the update time, which indicates that the project has been maintained. So, through the above three conditions, we look at the discharge at the first open source project is very good, the high number of star, recently updated in a few months ago, and the document is very detailed, so we can use this project to do secondary development, making the address of the project: jonbakerfish/TweetScraper.

Three, local installation and debugging

1. Pull items

It requires Scrapy and PyMongo (Also install MongoDB if you want to save the data to database). Setting up:

$ git clone https://github.com/jonbakerfish/TweetScraper.git
$ cd TweetScraper/
$ pip install -r requirements.txt  #add '--user' if you are not root
$ scrapy list
$ #If the output is 'TweetScraper', then you are ready to go.

2. Data persistence

After reading the document, we found that there are three ways to persist the data in this project. The first way is to save the data in the file, the second way is to save it in Mongo, and the third way is to save it in the MySQL database. Because the data we captured needs to be analyzed in the later period, we need to save the data in MySQL.

By default, the captured Data is stored on disk./Data/tweet/ in JSON format, so you need to modify the configuration file TweetScraper/settings.py.

ITEM_PIPELINES = { # 'TweetScraper.pipelines.SaveToFilePipeline':100, #'TweetScraper.pipelines.SaveToMongoPipeline':100, # replace `SaveToFilePipeline` with this to use MongoDB 'TweetScraper.pipelines.SavetoMySQLPipeline':100, # replace 'SaveToFilePipeline' with this to use MySQL} # Settings for MySQL MYSQL_SERVER = "18.126.219.16" MYSQL_DB =" "scraper" MYSQL_TABLE = "tweets" # the table will be created automatically MYSQL_USER = "root" # MySQL user to use (should have INSERT access granted to the Database/Table MYSQL_PWD = "admin123456" # MySQL user's password

3, test,

Go to the root directory of the project and run the following command:

# # into the project directory CD/work/Code/scraper/TweetScraper scrapy crawl TweetScraper - a query = "will be coronavirus, # COVID - 19"

Note that fetching Twitter data requires either scientific access to the Internet or a server deployed outside of the country, which is what I used.

[root@cs TweetScraper]# scrapy crawl TweetScraper -a query="Novel coronavirus,#COVID-19" 2020-04-16 19:22:40 [Scrapy.utils.log] INFO: Scrapy 2.0.1 Started (bot: Tweetscraper) 2020-04-16 19:22:40 [Scrapy.utils.log] INFO: Scrapy.utils.log Versions: LXML 4.2.1.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.6.5 | Anaconda, Inc. | (default, Apr 29 2018, 16:14:56) - [GCC 7.2.0], pyOpenSSL 18.0.0 of 1.0.2 o 27 Mar, 2018). Cryptography 2.2.2, Platform Linux -3.10.0-862.el7.x86_64-with-centos -7.5.1804-core 2020-04-16 19:22:40 [Scrapy.crawler] INFO: Platform Linux -3.10.0-862.el7.x86_64-with-centos -7.5.1804-core 2020-04-16 19:22:40 [Scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'TweetScraper', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'TweetScraper.spiders', 'SPIDER_MODULES': ['TweetScraper.spiders'], 'USER_AGENT': 'TweetScraper'} 2020-04-16 19:22:40 [scrapy.extensions.telnet] INFO: Telnet Password: 1fb55da389e595db 2020-04-16 19:22:40 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] 2020-04-16 19:22:41 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2020-04-16 19:22:41 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy. Spidermiddlewares. The depth. DepthMiddleware'] Mysql connection is successful # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # MySQLCursorBuffered: (Nothing executed yet) 2020-04-16 19:22:41 [TweetScraper.pipelines] INFO: Table 'tweets' already exists 2020-04-16 19:22:41 [scrapy.middleware] INFO: Enabled item pipelines: ['TweetScraper.pipelines.SavetoMySQLPipeline'] 2020-04-16 19:22:41 [scrapy.core.engine] INFO: Spider opened 2020-04-16 19:22:41 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-04-16 19:22:41 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2020-04-16 19:23:45 [scrapy. Extensions. Logstats] the INFO: Crawled 1 pages (at 1 pages/min), scraped 11 items (at 11 items/min) 2020-04-16 19:24:44 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 1 pages/min), scraped 22 items (at 11 items/min) ^C2020-04-16 19:26:27 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force 2020-04-16 19:26:27 [scrapy.core.engine] INFO: Closing spider (shutdown) 2020-04-16 19:26:43 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 1 pages/min), scraped 44 items (at 11 items/min)

As you can see, the project runs OK and the captured data has been saved in the database.

Four, cleaning data

Because there are special symbols such as emoticon on the captured Twitter, errors will be reported when inserted into the database. Therefore, the captured content information needs to be cleaned here.

A filter_emoji filter is added to the TweetScraper/utils.py file

import re

def filter_emoji(desstr, restr=''):
    """
    filter emoji
    desstr: origin str
    restr: replace str
    """
    # filter emoji
    try:
        res = re.compile(u'[\U00010000-\U0010ffff]')
    except re.error:
        res = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
    return res.sub(restr, desstr)

Call this method in the tweetcrawler.py file:

from TweetScraper.utils import filter_emoji def parse_tweet_item(self, items): for item in items: try: tweet = Tweet() tweet['usernameTweet'] = item.xpath('.//span[@class="username u-dir u-textTruncate"]/b/text()').extract()[0] ID = item.xpath('.//@data-tweet-id').extract() if not ID: continue tweet['ID'] = ID[0] ### get text content tweet['text'] = ' '.join( item.xpath('.//div[@class="js-tweet-text-container"]/p//text()').extract()).replace(' # ', '#').replace( ' @ ', '@') ### clear data[20200416] # tweet['text'] = Re. Sub (r "[\ s + \. \! \ / _ $% ^ * (+ \ '\')] + | [+ -? 【 】? ~... @ # $% & *] + | + + | | \ \ r \ \ n (\ \ xa0) + | | \ \ t (\ \ u3000) +", "", tweet [' text ']); Tweet ['text'] = filter_emoji(Tweet ['text'], ") if Tweet ['text'] == ": # If there is not text, we ignore the tweet continue ### get meta data tweet['url'] = item.xpath('.//@data-permalink-path').extract()[0] nbr_retweet = item.css('span.ProfileTweet-action--retweet > span.ProfileTweet-actionCount').xpath( '@data-tweet-stat-count').extract() if nbr_retweet: tweet['nbr_retweet'] = int(nbr_retweet[0]) else: tweet['nbr_retweet'] = 0 nbr_favorite = item.css('span.ProfileTweet-action--favorite > span.ProfileTweet-actionCount').xpath( '@data-tweet-stat-count').extract() if nbr_favorite: tweet['nbr_favorite'] = int(nbr_favorite[0]) else: tweet['nbr_favorite'] = 0 nbr_reply = item.css('span.ProfileTweet-action--reply > span.ProfileTweet-actionCount').xpath(  '@data-tweet-stat-count').extract() if nbr_reply: tweet['nbr_reply'] = int(nbr_reply[0]) else: tweet['nbr_reply'] = 0 tweet['datetime'] = datetime.fromtimestamp(int( item.xpath('.//div[@class="stream-item-header"]/small[@class="time"]/a/span/@data-time').extract()[ 0])).strftime('%Y-%m-%d %H:%M:%S') ### get photo has_cards = item.xpath('.//@data-card-type').extract() if has_cards and  has_cards[0] == 'photo': tweet['has_image'] = True tweet['images'] = item.xpath('.//*/div/@data-image-url').extract() elif has_cards: logger.debug('Not handle "data-card-type":\n%s' % item.xpath('.').extract()[0]) ### get animated_gif has_cards = item.xpath('.//@data-card2-type').extract() if has_cards: if has_cards[0] == 'animated_gif': tweet['has_video'] = True tweet['videos'] = item.xpath('.//*/source/@video-src').extract() elif has_cards[0] == 'player': tweet['has_media'] = True tweet['medias'] = item.xpath('.//*/div/@data-card-url').extract() elif has_cards[0] == 'summary_large_image': tweet['has_media'] = True tweet['medias'] = item.xpath('.//*/div/@data-card-url').extract() elif has_cards[0] == 'amplify': tweet['has_media'] = True tweet['medias'] = item.xpath('.//*/div/@data-card-url').extract() elif has_cards[0] == 'summary': tweet['has_media'] = True tweet['medias'] = item.xpath('.//*/div/@data-card-url').extract() elif has_cards[0] == '__entity_video': pass # TODO # tweet['has_media'] = True # tweet['medias'] = item.xpath('.//*/div/@data-src').extract() else: # there are many other types of card2 !!!! logger.debug('Not handle "data-card2-type":\n%s' % item.xpath('.').extract()[0]) is_reply = item.xpath('.//div[@class="ReplyingToContextBelowAuthor"]').extract() tweet['is_reply'] = is_reply ! = [] is_retweet = item.xpath('.//span[@class="js-retweet-text"]').extract() tweet['is_retweet'] = is_retweet ! = [] tweet['user_id'] = item.xpath('.//@data-user-id').extract()[0] yield tweet if self.crawl_user: ### get user info user = User() user['ID'] = tweet['user_id'] user['name'] = item.xpath('.//@data-name').extract()[0] user['screen_name'] = item.xpath('.//@data-screen-name').extract()[0] user['avatar'] = \ item.xpath('.//div[@class="content"]/div[@class="stream-item-header"]/a/img/@src').extract()[0] yield user except: logger.error("Error tweet:\n%s" % item.xpath('.').extract()[0]) # raise

After data cleaning, it can now be inserted into the table as normal.

Five, translated into Chinese

We can see that the data content crawled contains languages of many countries, such as English, Japanese, Arabic, French, etc. In order to know the meaning of these words, we need to translate them into Chinese. How to translate? There is an open source Python Google translation package available on GitHub, SSUT/Py-Googletrans. This project is very powerful. It can automatically recognize the language and translate it into the language we have specified.

1, install,

$ pip install googletrans

2, use,

From Googletrans import Translator >>> = Translator() >> Translator. Translate ('안녕하세요.') # <Translated SRC =ko dest=en text=Good evening. Pronunciation =Good evening.> >>> Translator. Dest ='ja') # <Translated SRC =ko dest=ja text=こんにちは. pronunciation=Kon'nichiwa.> >>> translator.translate('veritas lux mea', src='la') # <Translated src=la dest=en text=The truth is my light pronunciation=The truth is my light>

From GoogleTrans import Translator destination = 'zh-CN' # T = '안녕하세요.' res = Translator(). Translate (T, Dest = destination). The text print (res) hello

3. Reference to the project

This method is called in the tweetcrawler.py file, and you need to add a new field in the database, text_cn.

# google translate[20200416] # @see https://github.com/ssut/py-googletrans from googletrans import Translator def parse_tweet_item(self, items): for item in items: try: tweet = Tweet() tweet['usernameTweet'] = item.xpath('.//span[@class="username u-dir u-textTruncate"]/b/text()').extract()[0] ID = item.xpath('.//@data-tweet-id').extract() if not ID: continue tweet['ID'] = ID[0] ### get text content tweet['text'] = ' '.join( item.xpath('.//div[@class="js-tweet-text-container"]/p//text()').extract()).replace(' # ', '#').replace( ' @ ', '@') ### clear data[20200416] # tweet['text'] = Re. Sub (r "[\ s + \. \! \ / _ $% ^ * (+ \ '\')] + | [+ -? 【 】? ~... @ # $% & *] + | + + | | \ \ r \ \ n (\ \ xa0) + | | \ \ t (\ \ u3000) +", "", tweet [' text ']); Tweet ['text'] = filter_emoji(Tweet ['text'], '') # Translate Translate Chinese [20200417] tweet['text_cn'] = Translator().translate(tweet['text'],' zh-cn ').text; if tweet['text'] == '': # If there is not text, we ignore the tweet continue ### get meta data tweet['url'] = item.xpath('.//@data-permalink-path').extract()[0] nbr_retweet = item.css('span.ProfileTweet-action--retweet > span.ProfileTweet-actionCount').xpath( '@data-tweet-stat-count').extract() if nbr_retweet: tweet['nbr_retweet'] = int(nbr_retweet[0]) else: tweet['nbr_retweet'] = 0 nbr_favorite = item.css('span.ProfileTweet-action--favorite > span.ProfileTweet-actionCount').xpath( '@data-tweet-stat-count').extract() if nbr_favorite: tweet['nbr_favorite'] = int(nbr_favorite[0]) else: tweet['nbr_favorite'] = 0 nbr_reply = item.css('span.ProfileTweet-action--reply > span.ProfileTweet-actionCount').xpath(  '@data-tweet-stat-count').extract() if nbr_reply: tweet['nbr_reply'] = int(nbr_reply[0]) else: tweet['nbr_reply'] = 0 tweet['datetime'] = datetime.fromtimestamp(int( item.xpath('.//div[@class="stream-item-header"]/small[@class="time"]/a/span/@data-time').extract()[ 0])).strftime('%Y-%m-%d %H:%M:%S') ### get photo has_cards = item.xpath('.//@data-card-type').extract() if has_cards and  has_cards[0] == 'photo': tweet['has_image'] = True tweet['images'] = item.xpath('.//*/div/@data-image-url').extract() elif has_cards: logger.debug('Not handle "data-card-type":\n%s' % item.xpath('.').extract()[0]) ### get animated_gif has_cards = item.xpath('.//@data-card2-type').extract() if has_cards: if has_cards[0] == 'animated_gif': tweet['has_video'] = True tweet['videos'] = item.xpath('.//*/source/@video-src').extract() elif has_cards[0] == 'player': tweet['has_media'] = True tweet['medias'] = item.xpath('.//*/div/@data-card-url').extract() elif has_cards[0] == 'summary_large_image': tweet['has_media'] = True tweet['medias'] = item.xpath('.//*/div/@data-card-url').extract() elif has_cards[0] == 'amplify': tweet['has_media'] = True tweet['medias'] = item.xpath('.//*/div/@data-card-url').extract() elif has_cards[0] == 'summary': tweet['has_media'] = True tweet['medias'] = item.xpath('.//*/div/@data-card-url').extract() elif has_cards[0] == '__entity_video': pass # TODO # tweet['has_media'] = True # tweet['medias'] = item.xpath('.//*/div/@data-src').extract() else: # there are many other types of card2 !!!! logger.debug('Not handle "data-card2-type":\n%s' % item.xpath('.').extract()[0]) is_reply = item.xpath('.//div[@class="ReplyingToContextBelowAuthor"]').extract() tweet['is_reply'] = is_reply ! = [] is_retweet = item.xpath('.//span[@class="js-retweet-text"]').extract() tweet['is_retweet'] = is_retweet ! = [] tweet['user_id'] = item.xpath('.//@data-user-id').extract()[0] yield tweet if self.crawl_user: ### get user info user = User() user['ID'] = tweet['user_id'] user['name'] = item.xpath('.//@data-name').extract()[0] user['screen_name'] = item.xpath('.//@data-screen-name').extract()[0] user['avatar'] = \ item.xpath('.//div[@class="content"]/div[@class="stream-item-header"]/a/img/@src').extract()[0] yield user except: logger.error("Error tweet:\n%s" % item.xpath('.').extract()[0]) # raise

New fields have been added to items.py

# -*- coding: utf-8 -*- # Define here the models for your scraped items from scrapy import Item, Field class Tweet(Item): ID = Field() # tweet id url = Field() # tweet url datetime = Field() # post time text = Field() # text content text_cn = Field() # text Chinese content () user_id = Field() # user id

Pipeline piplines.py file to modify database persistence method, new text_cn field

class SavetoMySQLPipeline(object): ''' pipeline that save data to mysql ''' def __init__(self): # connect to mysql server self.cnx = mysql.connector.connect( user=SETTINGS["MYSQL_USER"], password=SETTINGS["MYSQL_PWD"], host=SETTINGS["MYSQL_SERVER"], database=SETTINGS["MYSQL_DB"], Buffered self = True). The cursor = self. CNX. Cursor () print (' Mysql connection is successful # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # ', self.cursor) self.table_name = SETTINGS["MYSQL_TABLE"] create_table_query = "CREATE TABLE `" + self.table_name + "` (\ `ID` CHAR(20) NOT NULL,\ `url` VARCHAR(140) NOT NULL,\ `datetime` VARCHAR(22),\ `text` VARCHAR(280),\ `text_cn` VARCHAR(280),\ `user_id` CHAR(20) NOT NULL,\ `usernameTweet` VARCHAR(20) NOT NULL\ )" try: self.cursor.execute(create_table_query) except mysql.connector.Error as err: logger.info(err.msg) else: self.cnx.commit() def find_one(self, trait, value): select_query = "SELECT " + trait + " FROM " + self.table_name + " WHERE " + trait + " = " + value + ";" try: val = self.cursor.execute(select_query) except mysql.connector.Error as err: return False if (val == None): return False else: return True def check_vals(self, item): ID = item['ID'] url = item['url'] datetime = item['datetime'] text = item['text'] user_id = item['user_id'] username = item['usernameTweet'] if (ID is None): return False elif (user_id is None): return False elif (url is None): return False elif (text is None): return False elif (username is None): return False elif (datetime is None): return False else: return True def insert_one(self, item): ret = self.check_vals(item) if not ret: return None ID = item['ID'] user_id = item['user_id'] url = item['url'] text = item['text'] text_cn = item['text_cn'] username = item['usernameTweet'] datetime = item['datetime'] insert_query = 'INSERT INTO ' + self.table_name + ' (ID, url, datetime, text, text_cn, user_id, usernameTweet )' insert_query += ' VALUES ( %s, %s, %s, %s, %s, %s, %s)' insert_query += ' ON DUPLICATE KEY UPDATE' insert_query += ' url = %s, datetime = %s, text= %s, text_cn= %s, user_id = %s, usernameTweet = %s' try: self.cursor.execute(insert_query, ( ID, url, datetime, text, text_cn, user_id, username, url, datetime, text, text_cn, user_id, username )) # insert and updadte parameter,so repeat except mysql.connector.Error as err: logger.info(err.msg) else: self.cnx.commit() def process_item(self, item, spider): if isinstance(item, Tweet): self.insert_one(dict(item)) # Item is inserted or updated. logger.debug("Add tweet:%s" %item['url'])

4. Run it again

Then run the command again:

scrapy crawl TweetScraper -a query="Novel coronavirus,#COVID-19"

You can see that the database has already translated the foreign language into Chinese.

The Python Scrapy crawler framework can be used to build a Scrapy, Scrapy, and Scrapy crawler