preface

Knife, gun, halberd, axe, hook fork, sickle after stick, whip mace hammer grasp.

God soldier in hand, demon not afraid, cut thorns, blood spatter flowers.

Walking in rivers and lakes, who does not have a weapon.

However, there are various weapons with corrugated, bladed, military rope, and chains. For beginners, it is really a “chaotic flower gradually to attract eyes”.

However, in the past, there were people who knew everything about Python, and now there are people who love Python. The weapon Spectrum written by Bai Xiaosheng made rivers and rivers bloody. Bai Meisheng also compiled a Python Weapon Spectrum. I wonder if it can make Python’s rivers and rivers stir up a storm?

Today, we will talk about the “data analysis” chapter of Shenbing Notation. This “data analysis” is divided into the first, the middle and the next three parts, respectively aiming at data acquisition, data processing and data visualization of data analysis.

This article is not only the display of the military, but also to teach you simple use, to help you choose the right weapon, in order to be invincible in the river’s lake.

Let’s cut to the chase.

Part ONE: Data acquisition

When it comes to data collection, the most famous way is “crawler”. Let’s take a look at the “crawler” sharp tool brought to us by 100 mei. Is it true as the rumor “see blood seal throat”?

Requests

What? Why are Requests “crawlers”?

Don’t underestimate it! Requests may be a library of network requests, but they can be just as deadly when used properly.

Attack with Requests as fast as the wind and as light as the leaves.

>>> r = requests.get('https://api.github.com/user', auth=('user'.'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
'{"type":"User"... '
>>> r.json()
{'private_gists': 419, 'total_private_repos': 77, ...}
Copy the code

That’s it?

If it’s an API service that returns Json, yes, that’s it. We’ve got the numbers.

If it is an API service that returns XML, then we can combine it with a native XML or LXML parser to defeat the enemy.

""" Content is a string in XML format, i.e., r.ext, such as 
        
        
        
         """
import xml.etree.ElementTree as ET

tree = ET.parse(content)
root = tree.getroot()
Walk through the node
for child in root:
    print(child.tag, child.attrib)
Copy the code

LXML is faster and more ferocious.

from lxml import etree

root = etree.XML(content)
for element in root.iter():
    print("%s - %s" % (element.tag, element.text))
Copy the code

LXML also supports powerful xpath and XLST syntax (see Resources for the syntax documentation).

# Use xpath syntax to quickly locate nodes and extract data
r = root.xpath('country')
text = root.xpath('country/text()')
Copy the code

XLST for fast conversion.

xslt_root = etree.XML("' \ < XSL: stylesheet version =" 1.0 "XMLNS: XSL =" http://www.w3.org/1999/XSL/Transform "> < XSL: template match ="/" > 
      
       
        ''')
transform = etree.XSLT(xslt_root)
f = StringIO('<a><b>Text</b></a>')
doc = etree.parse(f)
result_tree = transform(doc)
Copy the code

Even worse, it’s AN HTML document! BeautifulSoup or LXML parser is needed.

BeautifulSoup is slow, but fortunately easy to understand.

from bs4 import BeautifulSoup

# content is the HTML string, the text returned by Requests
soup = BeautifulSoup(content, 'html.parser')

print(soup.title)
print(soup.title.name)
print(soup.find_all('a'))
print(soup.find(id="link3"))
for link in soup.find_all('a'):
    print(link.get('href'))
Copy the code

That’s easy.

LXML is still pretty neat.

html = etree.HTML(content)
result = etree.tostring(html, pretty_print=True, method="html")
print(result)
# It's time for xpath
Copy the code

It can be seen that wooden sword, though simple, can change infinitely in the hands of a master. If it had been Elderberry, it would have been great. For the fastest and most convenient data collection, come in Requests!

Scrapy

Let’s take a look at data acquisition and Scrapy, which brings us to our fullest.

# Create a project
scrapy startproject tutorial
cd tutorial
Create a crawler
scrapy genspider quotes quotes.toscrape.com
Copy the code

Then edit the spiders/quotes.py crawler file under the project.

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        """ Generates an initial request. "" "
        urls = [
            'http://quotes.toscrape.com/page/1/'.'http://quotes.toscrape.com/page/2/',]for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        """ Processes the response returned by the request. "" "
        page = response.url.split("/") [2 -]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)
Copy the code

And then you start the crawler.

scrapy crawl quotes
Copy the code

And that’s not even Scrapy!

Parse web pages

# CSS analytical
response.css('title::text').getall()
# xpath parsing
response.css('//title/text()').getall()
Copy the code

Automatically generate result files

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/'.'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        The # parse function returns a dictionary or Item object directly.
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
Copy the code

Add the -o parameter to the crawl command to quickly save the results to a file, support a variety of formats (CSV, JSON, JSON lines, XML), and easily extend your own format.

scrapy crawl quotes -o quotes.json
Copy the code

Data is paged, what about the next page? Throw the request and let Scrapy handle it.



class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        """The parse function yield dictionary or Item object is regarded as the result, and the yield request object (the follow method follows the link and quickly generates the corresponding request object) continues to crawl. """
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)
Copy the code

Is that it? Of course not. Scrapy also provides a variety of capabilities for data collection.

  • Powerful extension capability, fast writing extensions and middleware.
  • Flexible configuration, concurrency control, speed limit control, etc.
  • Custom crawl object processing pipeline.
  • Custom crawl object store.
  • Automatic statistics.
  • Consolidate email.
  • Telnet console and so on.

This is just a core feature, yet to see its community capabilities!

  • Scrapyd: Engineer the deployment of crawlers.
  • Scrapy-splash: Provides JS rendering capabilities for Scrapy.
  • Scrapy Jsonrpc: Json RPC service control crawler.
  • Gerapy: Web crawler management platform.
  • ScrapyWeb: Another Web crawler management platform.
  • ScrapyKeeper is also a Web crawler management platform.
  • Portia: Interactive crawler platform without coding.

These don’t unfold anymore.

Fast and powerful data acquisition tools, should be Scrapy!

Pyspider

The powerful Swiss Army knife Pyspider.

Pyspider is the Swiss Army knife of the crawler world, offering a complete data collection solution.

  • Native Web management interface, support task monitoring, project management, results view and so on.
  • Native support for many database backends, such as MySQL, MongoDB, SQLite, Elasticsearch, Postgresql.
  • Native support for many message queues, such as RabbitMQ, Beanstalk, Redis, Kombu.
  • Supports task priority, automatic retry, scheduled task, JS rendering and other functions.
  • Distributed architecture.

Crawlers, it’s that simple!

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }
Copy the code

Start the crawler framework.

pyspider
Copy the code

Then, we can manage and run the crawler through http://localhost:5000/.

We can use CSS selectors to quickly extract information from web pages.

    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            if re.match("http://www.imdb.com/title/tt\d+/$", each.attr.href):
                self.crawl(each.attr.href, callback=self.detail_page)
        self.crawl(response.doc('#right a').attr.href, callback=self.index_page)
        
    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('.header > [itemprop="name"]').text(),
            "rating": response.doc('.star-box-giga-star').text(),
            "director": [x.text() for x in response.doc('[itemprop="director"] span').items()],
        }
Copy the code

Enable PhantomJS to render JS on a web page.

pyspider phantomjs
Copy the code

Using fetch_type = ‘js’.

class Handler(BaseHandler):
    def on_start(self):
        self.crawl('http://www.twitch.tv/directory/game/Dota%202',
                   fetch_type='js', callback=self.index_page)

    def index_page(self, response):
        return {
            "url": response.url,
            "channels": [{
                "title": x('.title').text(),
                "viewers": x('.info').contents()[2]."name": x('.info a').text(),
            } for x in response.doc('.stream.item').items()]
        }
Copy the code

You can also execute a JS code to retrieve dynamically generated web page content.

class Handler(BaseHandler):
    def on_start(self):
        self.crawl('http://www.pinterest.com/categories/popular/',
                   fetch_type='js', js_script=""" function() { window.scrollTo(0,document.body.scrollHeight); } "" ", callback=self.index_page)

    def index_page(self, response):
        return {
            "url": response.url,
            "images": [{
                "title": x('.richPinGridTitle').text(),
                "img": x('.pinImg').attr('src'),
                "author": x('.creditName').text(),
            } for x in response.doc('.item').items() if x('.pinImg')]}Copy the code

Okay, next thing I know, the question is which Pyspider or Scrapy?

Just a quick comparison.

Scrapy has greater scalability, more active communities, and richer neighborhoods. Pyspider itself is more versatile, but less extensible. Many of the features that Scrapy needs to extend, such as Web interfaces and JS rendering, are provided native to Pyspider.

Pyspider’s entire ecology is easier to get started with and faster to implement. Scrapy has more options and flexibility for complex scenarios.

So, which one do you choose?

Do adults have to choose?

Afterword.

This previous chapter introduced three types of data acquisition field.

  • Plain and magical Boneserver Swords — Requests
  • Fast and powerful Scrapy
  • The Pyspider is a simple and versatile Swiss Army knife

Have this 3 shenzhou soldier in hand, do not believe you cannot gallop “reptilian” river’s lake!

Data analysis of Python “God Soldier Spectrum” – the first chapter, if you feel useful, please click the thumbprint attention collection oh!

reference

  • Requests
  • Python xml
  • Python lxml
  • XPath
  • XLST
  • BeautifulSoup
  • Scrapy
  • Pyspider