Extract the Web source — Requests Toolkit

Before we can extract web information, we must extract the source code of the web page, and the Requests toolkit, built by Kenneth Reitz, is arguably the most useful and popular static web crawler available today. In Requests’s official introduction, the philosophy of Requests is: Beautiful is better than ugly.

2.Explicit is better than implicit.

3.Simple is better than complex.

Complicated is better than complicated.

5.Readability counts.

The Requests toolkit is simple to use. The main function is Requests. Get (URL), where the url is the page we want to extract the source code from, and the page is converted to a string using Requests.



Information extraction — Xpath

Our normal web pages are usually XML documents, and when we want to extract the content, we need Xpath to decompose the format and extract the content we want. There are four basic concepts we need to know before we get to Xpath: nodes, elements, attributes, and text. Let’s look at an example:

<? xml version="1.0" encoding="ISO-8859-1"? > <book> <title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> </bookstore>Copy the code

This is the source of a web page where <bookstore> is the document node, <author>J K. Rowling</author> is the element node, and lang=”en” is the attribute node. J. K. Rowling, 2005 and 29.99 are text (text is usually what we want to crawl).

Example sharing – crawl douban movie information

Next we do a very simple example to share, we want to crawl on douban.com the movie “Living Together” director and actor information.

First we find the URL of the movie:Right click to view the source code of the page (or you can enter developer mode (Insepect)) :Next we will enter the source code of the web page, we want to collect information in the source code, just input keywords can be viewed, for example, here we enter ‘Regal sound’ to locate the location we want to find:Then we need to analyze the format of the TML. For example, we need to find all the main actors:It’s a general rule that every leading actor’s name corresponds to a node whose property is rel=” V: Starring “, so it’s easy to navigate to the node that contains the element using xpath syntax and find out all the results.

Similarly, the node named a corresponding to the director’s name is rel=” v:directedBy”. We can also find the corresponding text through this location:

The specific code is as follows:

import requests
​
from lxml import etree
​
 
​
 
​
url='https://movie.douban.com/subject/27133303/?from=showing'  # Enter our URL
​
get = requests.get(url).text # get(URL) gets our web page, text converts the source web page into a string
​
 
​
 
​
selector = etree.HTML(get) Convert the source code to a TML format that xpath recognizes
​
 
​
 
​
info = {}  # Dictionaries are used to store information
​
info['movie'] = selector.xpath('//title/text()')[0].strip() # Locate the movie name
​
info['director']=selector.xpath('//a[@rel="v:directedBy"]/text()') # Locate the director name
​
info['actors']=selector.xpath('//a[@rel="v:starring"]/text()') # Locate actor namesprint(info)Copy the code

Finally we get the result of a dictionary set:

{'movie': 'Living together in Time (Douban)'.'director': ['suellen'].'actors': ['Lei Jia Yin'.'Tong Liya'.'a clothes'.'Yu He Wei'.'Wang Zhengjia'.'hong tao'.'nian li'.'Li Guangjie'.'Yang 玏'.'thorgrim'.'Xu Zheng'.'yandi'.'Fang Ling'.'Tony']}Copy the code

Example Share 2 – Crawl information with douban movies in JSON format

First of all, JSON is a lightweight data exchange format. Its concise and clear hierarchy makes JSON an ideal data exchange language, which is easy to read and write by human beings, as well as easy to parse and generate by machines, and effectively improves network transmission efficiency.

In our crawler process, we can sometimes find complete JSON-formatted data on the developer mode return page, and we can use the JSON function in the Requests package to format the retrieved text so that we can easily extract the content. Let’s take douban movies for example:This is the movie information we see when we click on the category, and if we want to access the movie information, we can right click and go to Inspector.Be sure to refresh after you open developer mode, otherwise the page information received before will not be displayed. Then we select the network TAB on the right and click on the XHR option. We will see a returned page. Double click on it.This is the json file that I have opened. It is in a clearer format because I have a JSON plugin on my computer. (For those of you who use Chrome, go to the Chrome store and download the Juan Ramon Json Beautifier Chrome plugin.) Json files can also be understood as a large dictionary with many layers of small dictionaries and lists. Once we find Json web pages, we just need to convert requests into Json format to easily extract information.The code is as follows:

import requests
url='https://movie.douban.com/j/new_search_subjects?sort=T&range=0, 10 & tags = & start = 0'
get = requests.get(url).json()   Use json() to get the source code of the web page
​
get = get['data']
​
info = {}
​
for i in range(len(get)):
​
   
​
    info[get[i]['title']] = [get[i]['directors'], get[i]['rate']]Extract the director and rating for each filmprint(info)Copy the code