It is very easy and pleasant to develop a crawler with Python, because there are many related libraries and it is convenient to use. A crawler can be developed in just a dozen lines of code. However, in response to anti-crawling measures of the site, the use of JS dynamic loading of the site, App collection will have to use the brain; And in the development of distributed crawler, high-performance crawler should be designed more carefully.

Summary of tools commonly used in Python development crawlers

  1. Reqeusts: Python HTTP Web request library;
  2. Pyquery: Python HTML DOM structure parsing library, using jquery-like syntax;
  3. BeautifulSoup: Python HTML and XML structure parsing;
  4. Selenium: Python automated testing framework for crawlers;
  5. Phantomjs: a headless browser that works with Selenium to get DYNAMICALLY loaded JS content;
  6. Re: Python’s built-in regular expression module;
  7. Fiddler: a packet grab tool that works as a proxy server to grab mobile phone bags.
  8. Anyproxy: a proxy server that can write its own rule to intercept request or response and is usually used for client collection.
  9. Celery: Python distributed computation framework for developing distributed crawlers;
  10. Gevent: Python’s coroutine based network library for developing high-performance crawlers
  11. Grequests: asynchronous requests
  12. Aiohttp: asynchronous HTTP client/server framework
  13. Asyncio: Python built-in asynchronous IO, event loop library
  14. Uvloop: a very fast event loop library, very efficient with asyncio
  15. Concurrent: An extension built into Python for concurrent task execution
  16. Scrapy: Python crawler framework;
  17. Splash: A JavaScript rendering service that functions as a lightweight browser and works with Lua scripting to parse pages through its HTTP API;
  18. Splinter: Open source automated Python Web testing tool
  19. Pyspider: Python crawler

Idea of web crawling

    1. Can the data be retrieved directly from HTML? The data is directly nested in the HTML structure of the page;
    2. Is the data dynamically rendered to the page using JS? The data is nested in JS code and then loaded into the page using JS or rendered with Ajax;
    3. Do I need to authenticate the use of the obtained pages? Only after login can the page be accessed;
    4. Is the data directly available through the API? Some data can be retrieved directly from the API without parsing HTML. Most apis return data in JSON format.
    5. How is data collected from the client? For example, wechat APP and wechat client

How to deal with backcrawl

  1. Don’t go too far, control the speed of the crawlers, don’t crush them, then you both lose;

  2. Use proxy to hide the real IP address, and realize anti-crawl;

  3. To make the crawler look like a human user, optionally drop the following HTTP headers:

    • Host:www.baidu.com
    • Connection: keep alive
    • Accept: text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng,/; Q = 0.8
    • UserAgent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.104 Safari/537.36
    • The Referer: s.weibo.com/user/gameli…
    • Accept-Encoding: gzip, deflate
    • Accept-Language: zh-CN,zh; Q = 0.8
  4. Check cookies on the site. In some cases, cookies are added to the request to pass some validation on the server.

Case description

Static page parsing

Import pyQuery import re def weixin_article_html_parser(HTML): """ """ pq = pyquery.PyQuery(html) article = { "weixin_id": pq.find("#js_profile_qrcode " ".profile_inner .profile_meta").eq(0).find("span").text().strip(), "weixin_name": pq.find("#js_profile_qrcode .profile_inner strong").text().strip(), "account_desc": pq.find("#js_profile_qrcode .profile_inner " ".profile_meta").eq(1).find("span").text().strip(), "article_title": pq.find("title").text().strip(), "article_content": pq("#js_content").remove('script').text().replace(r"\r\n", ""), "is_orig": 1 if pq("#copyright_logo").length > 0 else 0, "article_source_url": pq("#js_sg_bar .meta_primary").attr('href') if pq( "#js_sg_bar .meta_primary").length > 0 else '', } match = {" msg_cdn_URL ": {"regexp": "(? < = \ "). * (? = \ ") ", "value" : ""}, # matching articles cover" var ct ": {" regexp: (? <=\")\d{10}(? = \ ") ", "value" : ""}, post # matching time" publish_time ": {" regexp: (? <=\")\d{4}-\d{2}-\d{2}(? = \ ") ", "value" : ""}, matching the article # release date" msg_desc ": {" regexp: (? < = \ "). * (? = \ ") ", "value" : ""}, # matches the article introduction" msg_link ": {" regexp: (? < = \ "). * (? = \ ") ", "value" : ""}, # matches the article links" msg_source_url ": {" regexp: (? < = '). * (? = ') ", "value" : ""}, # for the original link" var biz ": {" regexp" : "(? <=\")\w{1}.+? (? =\")", "value": ""}, "var idx": {"regexp": "(? <=\")\d{1}(? =\")", "value": ""}, "var mid": {"regexp": "(? <=\")\d{10,}(? =\")", "value": ""}, "var sn": {"regexp": "(? <=\")\w{1}.+? (? =\")", "value": ""}, } count = 0 for line in html.split("\n"): for item, value in match.items(): if item in line: m = re.search(value["regexp"], line) if m is not None: count += 1 match[item]["value"] = m.group(0) break if count >= len(match): break article["article_short_desc"] = match["msg_desc"]["value"] article["article_pos"] = int(match["var idx"]["value"])  article["article_post_time"] = int(match["var ct"]["value"]) article["article_post_date"] = match["publish_time"]["value"] article["article_cover_img"] = match["msg_cdn_url"]["value"] article["article_source_url"] = match["msg_source_url"]["value"] article["article_url"] = "https://mp.weixin.qq.com/s?__biz={biz}&mid={mid}&idx={idx}&sn={sn}".format( biz=match["var biz"]["value"], mid=match["var mid"]["value"], idx=match["var idx"]["value"], sn=match["var sn"]["value"], ) return article if __name__ == '__main__': from pprint import pprint import requests url = ("https://mp.weixin.qq.com/s?__biz=MzI1NjA0MDg2Mw==&mid=2650682990&idx=1" "&sn=39419542de39a821bb5d1570ac50a313&scene=0#wechat_redirect") Pprint (weixin_article_html_parser(requests. Get (URL).text)) # {'account_desc': 'Night listen, let more families get happier and happier. ', # 'article_content': 'Article_content ':' An Meng \xa0 \xa0 Voice: What did Liu Xiao get? What was lost? ', # 'article_cover_img': 'http://mmbiz.qpic.cn/mmbiz_jpg/4iaBNpgEXstYhQEnbiaD0AwbKhmCVWSeCPBQKgvnSSj9usO4q997wzoicNzl52K1sYSDHBicFGL7WdrmeS0K8nia iaaA/0?wx_fmt=jpeg', # 'article_pos': 1, # 'article_post_date': '2017-07-02', # 'article_post_time': 1499002202, # 'article_short_DESC ':' Good night from Liu Xiao on Sunday. '#' article_source_url ':', '#' article_title ':' walk to the night listening to 】 【 here '#' article_url ': 'https://mp.weixin.qq.com/s?__biz=MzI1NjA0MDg2Mw==&mid=2650682990&idx=1&sn=39419542de39a821bb5d1570ac50a313', # # 'is_orig: 0,' weixin_id ':' yetingfm '#' weixin_name ':' night listening to '}Copy the code

 

Parse jS-rendered pages using PhantomJS – Twitter search

Some pages use complex JS logic processing, including a variety of Ajax requests, requests also contain some encryption operations, through the analysis of JS logic to re-render the page to get the data you want is extremely difficult, there is no solid JS foundation, not familiar with a variety of JS framework, understand this kind of page don’t think; Rendering the page in a browser-like fashion makes it much easier to get the HTML directly from the page.

For example: s.weibo.com/ search results are dynamic rendering using JS, direct access to HTML will not get the results of the search, so we have to run the PAGE JS, the page rendering success, and then obtain its HTML for parsing;

 

Use Python mock login to get cookies

Some websites are quite painful and usually need to log in before they can get data. Here is a simple example: used to log in to a website, get cookies, and then use them for other requests

However, this is only in the case of no verification code, if you want to have SMS authentication, image authentication, email authentication that has to be designed separately;

Target website: www.newrank.cn, date: 2017-07-03, if the website structure changes, it is necessary to modify the following code;

#! /usr/bin/env python3 # encoding: utf-8 import time from urllib import parse from selenium import webdriver from selenium.common.exceptions import TimeoutException, WebDriverException from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.common.desired_capabilities import DesiredCapabilities from pyquery import PyQuery def weibo_user_search(url: str): "" "through phantomjs access to the search page HTML ", "" desired_capabilities = DesiredCapabilities. CHROME. Copy () Desired_capabilities [" phantomjs. Page. Settings. UserAgent "] = (" Mozilla / 5.0 (Windows NT 10.0; Win64; X64) "" AppleWebKit / 537.36 (KHTML, Like Gecko) "" Chrome / 59.0.3071.104 Safari / 537.36") desired_capabilities [" phantomjs. Page. Settings. LoadImages "] # = True Custom head desired_capabilities [" phantomjs. Page. CustomHeaders. Upgrade - Insecure - Requests "] = 1 desired_capabilities["phantomjs.page.customHeaders.Cache-Control"] = "max-age=0" desired_capabilities["phantomjs.page.customHeaders.Connection"] = "keep-alive" driver = Executable_path ="/usr/bin/ PhantomJS ", # set PhantomJS path to desired_capabilities=desired_capabilities, Service_log_path ="ghostdriver.log",) Driver.set_page_load_timeout (60) # Set the timeout of the asynchronous script driver.set_script_timeout(60) driver.maximize_window() try: driver.get(url=url) time.sleep(1) try: Company = driver.find_element_by_css_selector(" pany") ActionChains(driver).move_to_element(company) except WebDriverException: pass html = driver.page_source pq = PyQuery(html) person_lists = pq.find("div.list_person") if person_lists.length > 0: for index in range(person_lists.length): person_ele = person_lists.eq(index) print(person_ele.find(".person_name > a.W_texta").attr("title")) return html except (TimeoutException, Exception) as e: print(e) finally: driver.quit() if __name__ == '__main__': Weibo_user_search (url="http://s.weibo.com/user/%s" % parse.quote(" news ")) # CCTV news # Sina News # News # Sina News client # China News Weekly # China News Network # Daily Business News # Thepaper # netease news client # Phoenix news client # Real Madrid News # Network news broadcast # CCTV5 Sports News # Manchester United News # Sohu News client # Barca News # News watch # Newagaki Knoyi News Agency # Look at news KNEWS # CCTV News ReviewCopy the code

 

Use Python to simulate login to get cookies

Some websites are quite painful and usually need to log in before they can get data. Here is a simple example: used to log in to a website, get cookies, and then use them for other requests

However, this is only in the case of no verification code, if you want to have SMS authentication, image authentication, email authentication that has to be designed separately;

Target website: www.newrank.cn, date: 2017-07-03, if the website structure changes, it is necessary to modify the following code;

 

#! /usr/bin/env python3 # encoding: utf-8

from time import sleep

from pprint import pprint

from selenium.common.exceptions import TimeoutException, WebDriverException

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

from selenium import webdriver

Def login_newrank(): “”

desired_capabilities = DesiredCapabilities.CHROME.copy() desired_capabilities[“phantomjs.page.settings.userAgent”] = (” Mozilla / 5.0 (Windows NT 10.0; Win64; X64) “” AppleWebKit / 537.36 (KHTML, Like Gecko) “” Chrome / 59.0.3071.104 Safari / 537.36″) desired_capabilities [” phantomjs. Page. Settings. LoadImages “] = True

# custom head desired_capabilities [” phantomjs. Page. CustomHeaders. Upgrade – Insecure – Requests “] = 1 desired_capabilities[“phantomjs.page.customHeaders.Cache-Control”] = “max-age=0” desired_capabilities[“phantomjs.page.customHeaders.Connection”] = “keep-alive”

User = {“mobile”: “user”, “password”: “password”}

    print(“login account: %s” % user[“mobile”])

    driver = webdriver.PhantomJS(executable_path=”/usr/bin/phantomjs”,

desired_capabilities=desired_capabilities,

service_log_path=”ghostdriver.log”, )

# implICITly_wait (1) # implICITly_wait (1) Driver.set_page_load_timeout (60) # Set the timeout of the asynchronous script driver.set_script_timeout(60)

    driver.maximize_window()

Try: driver get [url = “www.newrank.cn/public/logi…” ) driver. Find_element_by_css_selector (“. The login – normal – tap: the NTH – of – type (2) “). Click () sleep (0.2) Driver. Find_element_by_id (” account_input “). Send_keys (user [” mobile “]) sleep (0.5) Driver. Find_element_by_id (” password_input “). Send_keys (user (” password “)) sleep (0.5) driver.find_element_by_id(“pwd_confirm”).click() sleep(3) cookies = {user[“name”]: user[“value”] for user in driver.get_cookies()} pprint(cookies)

    except TimeoutException as exc:

print(exc)

except WebDriverException as exc:

print(exc)

finally:

driver.quit()

if __name__ == ‘__main__’: login_newrank() # login account: 15395100590 # {‘CNZZDATA1253878005’: ‘1487200824-1499071649-%7C1499071649’, # ‘Hm_lpvt_a19fd7224d30e3c8a6558dcb38c4beed’: ‘1499074715’ # ‘Hm_lvt_a19fd7224d30e3c8a6558dcb38c4beed’ : ‘1499074685149074 713’ # ‘UM_distinctid’ : ’15d07d0d4dd82b-054b56417-9383666-c0000-15d07d0d4deace’, # ‘name’: ‘15395100590’, # ‘rmbuser’: ‘true’, # ‘token’: ‘A7437A03346B47A9F768730BAC81C514’, # ‘useLoginAccount’: ‘true’}

 

After the cookie is obtained, the cookie can be added to the subsequent request, but because the cookie is valid, it needs to be updated regularly; It can be realized by designing a cookie pool, which dynamically logs in a batch of accounts at regular intervals, obtains cookies and stores them in the database (Redis, MySQL, etc.), obtains an available cookie from the database when requesting, and adds access to it in the request.

import sys

import csv

import pyquery

from PyQt5.QtCore import QUrl

from PyQt5.QtWidgets import QApplication

from PyQt5.QtWebEngineWidgets import QWebEngineView


class Browser(QWebEngineView):

    def __init__(self):

super(Browser, self).__init__()

self.__results = []

self.loadFinished.connect(self.__result_available)

    @property

def results(self):

return self.__results

    def __result_available(self):

self.page().toHtml(self.__parse_html)

    def __parse_html(self, html):

pq = pyquery.PyQuery(html)

for rows in [pq.find(“#table_list tr”), pq.find(“#more_list tr”)]:

for row in rows.items():

columns = row.find(“td”)

d = {

“avatar”: columns.eq(1).find(“img”).attr(“src”),

“url”: columns.eq(1).find(“a”).attr(“href”),

“name”: columns.eq(1).find(“a”).attr(“title”),

“fans_number”: columns.eq(2).text(),

“view_num”: columns.eq(3).text(),

“comment_num”: columns.eq(4).text(),

“post_count”: columns.eq(5).text(),

“newrank_index”: columns.eq(6).text(),

}

self.__results.append(d)

        with open(“results.csv”, “a+”, encoding=”utf-8″) as f:

writer = csv.DictWriter(f, fieldnames=[“name”, “fans_number”, “view_num”, “comment_num”, “post_count”,

“newrank_index”, “url”, “avatar”])

writer.writerows(self.results)

    def open(self, url: str):

self.load(QUrl(url))

If __name__ = = “__main__ ‘: app = QApplication (sys. Argv) browser = browser () the open (” www.newrank.cn/public/info…” ) browser.show() app.exec_()

5. Use Fiddler to capture and analyze packets

  1. Browser Packet Capture
  2. Fiddler phone grab bag

6. Use AnyProxy to capture client data – Collect client data

7. Summary of developing high-performance crawlers