Public account: You and the cabin by: Peter Editor: Peter

Powerful Xpath

In the past, when crawlers parse data, they almost always use regular expressions themselves. Regular parsing data is powerful, but expressions are cumbersome and relatively slow to write. This article shows you how to get started quickly with a data parsing tool: Xpath.

Xpath is introduced

XPath (XML Path) is a language for finding information in XML documents. XPath can be used to traverse elements and attributes in AN XML document.

XPath is a major element of the W3C XSLT standard, and both XQuery and XPointer are built on XPath expressions.

  • Xpath is a query language
  • Find nodes in the TREE structure of XML (Extensible Markup Language) and HTML
  • XPATH is a language for ‘finding people’ based on ‘address’

Quick start website: www.w3schools.com/xml/default…

Xpath installation

MacOS installation is very simple:

pip install lxml
Copy the code

In Linux, Ubuntu is used as an example:

sudo apt-get install python-lxml
Copy the code

Windows installation please baidu, certainly there will be a tutorial, is the process is relatively more troublesome.

How to check whether the installation is successful? If the import LXML command is not displayed, the installation is successful.

Xpath Parsing Principles

  • Instantiate an ETREE parsing object and load the parsed page source data into the object
  • Xpath parsing methods in xpath are called in conjunction with xpath expressions to locate tags and capture content

How do I instantiate an Etree object?

  1. Load source data from a local HTML document into an ETree object: etree.parse(filePath)
  2. Load the source code data from the Internet into this object: etree.html (‘page_text’), where page_text refers to the source code content we retrieved

Xpath usage

Three special symbols

  • / : Indicates that the resolution starts from the root node and is performed at a single level
  • // : represents multiple levels, some of which can be skipped; It also means to start at any position
  • . : A dot represents the current node

Common path expressions

Here are common Xpath path expressions:

expression describe
nodename Selects all children of this node.
/ From the root node.
// Nodes in the document are selected from the current node selected by the match, regardless of their location.
. Select the current node.
Selects the parent of the current node.
@ Select properties.

For example,

expression instructions
books Selects all the children of the books element
/books Selects the root element bookstore
books//title Selects all the title elements that belong to the children of books
//price Select all price elements
books/book[3] Selects the third book element that belongs to the books child element, with the index starting at 1
/ bookstore/book [price > 55.0] Select all book elements whose unit price is greater than 55
//@category Select all the attributes named category
/books/book/title/text() Select all title values for the document

Xpath operators

Operators are directly supported in Xpath expressions:

Operator Description Example Chinese
| Computes two node-sets //book | //cd Combine two results
+ Addition 6 + 4 add
Subtraction 6-4 Reduction of
* Multiplication 6 * 4 take
div Division 8 div 4 In addition to
= Equal Price = 9.80 Is equal to the
! = Not equal price! = 9.80 Is not equal to
< Less than Price < 9.80 Less than
< = Less than or equal to Price < = 9.80 Less than or equal to
> Greater than Price > 9.80 Is greater than
> = Greater than or equal to Price > = 9.80 Greater than or equal to
or or Price or price = 9.80 = 9.70 or
and and Price and price < > 9.00 9.90 and
mod Modulus (division remainder) 5 mod 2 For more than

The HTML element

HTML elements refer to all the code from the start tag to the end tag. Basic syntax:

  • The HTML element toThe start tagBegin; An HTML element terminates with a closing tag
  • The content of the element is between the start tag and the end tag
  • Some HTML elements have empty content
  • Empty elements are closed in the start tag (end at the end of the start tag)
  • Most HTML elements can have attributes; Attributes in lower case are recommended

Regarding the use of empty elements: Adding a slash to the start tag, such as

, is the proper way to turn off empty elements, and is accepted by HTML, XHTML, and XML.

Common properties

attribute value describe
class classname Specify the element’s classname (classname)
id id Specifies the unique ID of the element
style style_definition Specify the inline style of an element
title text Additional information for the specified element (can be displayed in the tooltip)

The HTML title

There are six levels of headings in HTML.

Headings (Heading) are defined by tags such as < H1 > – < H6 >.

defines the largest title, and

defines the smallest title.

Case analysis

The original data

Before using Xpath to parse the data, we need to import it and instantiate an ETree object:

# import libraries
from lxml import etree

# instance resolve object
tree = etree.parse("test.html")
tree
Copy the code

Here is the raw data to be parsed: test.html

  1 <html lang="en">
  2 <head>
  3     <meta charset="UTF-8" />
  4     <title>Ancient poets and works</title>
  5 </head>
  6 <body>
  7     <div>
  8         <p>The poet's name</p>
  9     </div>
 10     <div class="name">
 11         <p>Li bai</p>
 12         <p>Bai juyi</p>
 13         <p>Li qingzhao</p>
 14         <p>Du fu</p>
 15         <p>Wang anshi</p>
 16         <a href="http://wwww.tang.com" title="Li Shimin" target="_self">
 17             <span> this is span </span>The poems written by ancient poets are really wonderful</a>
 19         <a href="" class="du">In front of the bed there was moonlight, and I thought it was frost on the ground</a>
 20         <img src="http://www.baidu.com/tang.jpg" alt="" />
 21     </div>
 22     <div class="tang">
 23         <ul>
 24             <li><a href="http://www.baidu.com" title="Baidu">Bai Di chao ci clouds, thousands of miles jiangling a day also</a></li>
 25             <li><a href="http://www.sougou.com" title="Sogou">Rain falls heavily during Qingming festival, and passers-by are dying on the road</a></li>
 26             <li><a href="http://www.360.com" alt="360">Qin Mingyue Han guan, thousands of long March people have not yet</a></li>
 27             <li><a href="http://www.sina.com" title="Bing">A gentleman gives speech to others, and a concubine gives wealth to others</a></li>
 28             <li><b>Su shi</b></li>
 29             <li><i>Su Xun</i></li>
 30             <li><a href="http://www.google.cn" id="Google">Welcome to Chrome</a></li>
 31         </ul>
 32     </div>
 33 </body>
 34 </html>                     
Copy the code

Gets the content of a single label

For example, you want to get the content in the title tag: ancient poets and works

title = tree.xpath("/html/head/title")
title
Copy the code

As you can see from the above results, each Xpath parse results in a list

To retrieve the text content of the tag, use text() :

Extract the content from the list

title = tree.xpath("/html/head/title/text()") [0]  Index 0 retrieves the first element value
title
Copy the code

Gets multiple contents within the tag

For example, if we want to retrieve the contents of the div tag, there are three pairs of div tags in the raw data, and the result is that the list contains three elements:

1. Use a single slash / : to indicate that the root node HTML begins positioning, indicating a hierarchy

2, use a double slash in the middle // : skip the middle level, indicating multiple levels

3, the beginning of the use of double slash // : from any position to start

Attribute to locate

[@attribute name =” attribute value “] :

name = tree.xpath('//div[@class="name"]')   # Locate the class attribute with the value name
name
Copy the code

The index position

Indexing from 1 in Xpath is different from indexing from 0 in Python. For example, if you want to locate all p tags under the class attribute (value name) of the div tag: 5 pairs of P tags, the result should be 5 elements

Get all data

index = tree.xpath('//div[@class="name"]/p')
index
Copy the code

If we want to retrieve the third p tag:

Get a single specified data: the index starts at 1

index = tree.xpath('//div[@class="name"]/p[3]')  # index starts at 1
index
Copy the code

Get text content

The first method: the text() method

Get the element below a specific tag:

# 1, / : Single level
class_text = tree.xpath('//div[@class="tang"]/ul/li/b/text()')
class_text
Copy the code
# 2, // : multiple levels
class_text = tree.xpath('//div[@class="tang"]//b/text()')
class_text
Copy the code

2. Multiple contents under a tag

For example, if you want to get all the contents of the p tag:

Get all data

p_text = tree.xpath('//div[@class="name"]/p/text()')
p_text
Copy the code

For example, we want to get the contents of the third p tag:

Get the third TAB content

p_text = tree.xpath('//div[@class="name"]/p[3]/text()')  
p_text
Copy the code

If you want to fetch the entire contents of the p tag, the result is a list, and then use the Python index to fetch the contents. Note that the index is 2:

Non-label direct content acquisition:

Fetch of the li tag: the result is empty and there is no content in the li tag of the tag

If you want to access to the entire contents of the li tag, the following can be a, b, tags, I use a vertical bar |

Select a, B, and I tags from a, B, and I tags

abi_text = tree.xpath('//div[@class="tang"]//li/a/text() | //div[@class="tang"]//li/b/text() | //div[@class="tang"]//li/i/text()')
abi_text
Copy the code

Lineal and non-lineal understanding

Fetch attribute content

If you want to get the value of an attribute, add: @+ the attribute name to the final expression to get the value of the corresponding attribute

1. Get the value of a single attribute

2. Get multiple values for the property

Property starts and contains

Xpath supports Xpath expressions that start with certain strings or contain certain characters. Xpath does not have expressions that end in strings

  • Start: starts with
  • Contains: the contains

The syntax can be written as:

/ / label [starts -with(@ Attribute name,"Same part of the string"] // tag [conatians(@attribute name,"Same part of the string")]
Copy the code

1. Start with a string

Gets the text content under the HREF starting with HTTP under the a tag

2. Contains strings

The title attribute under tag A contains baidu text content:

conclusion

Here’s a summary of the use of Xpath:

  • // : indicates that the label is not directly related to the content
  • / : Only the immediate content of the label is obtained
  • If the index is in an Xpath expression, the index starts at 1; If you get the list data from the Xpath expression, then use the Python index to fetch the number, starting at 0

In actual combat

Use XPATH to crawl the name and URL of all Gu Long’s novels on the novel website.

Xiong Yaohua, born in Jiangxi; He graduated from Tamkang English College in Taiwan. When he was young, he was interested in reading ancient and modern martial arts novels and Western literature. It is generally believed that he was influenced by Eiji Yoshikawa, Dumas, Hemingway, Jack London, Steinbeck’s novels and even Western philosophy such as Nietzsche and Saudi. “I like to steal lessons from modern Japanese and Western novels,” he said. ) so it can be new and new, come from behind to take the lead, and do not open a new realm of martial arts novels.

Web data analysis

Crawl the information on this website: www.kanunu8.com/zj/10867.ht…

When we click on a specific novel, such as “Two Pride”, we can enter the specific information of that novel:

By looking at the source code of the page, we find that the name and URL address are all in the tag below:

Below each TR tag are three TD tags representing three novels, one td containing the address and name

Get the source code of the web page

Send a web request to get the source code

import requests
from lxml import etree 
import pandas as pd

url = 'https://www.kanunu8.com/zj/10867.html'
headers = {'user-agent': 'Request header'}

response = requests.get(url = url,headers = headers)
result = response.content.decode('gbk')   # This page requires GBK encoding to parse the data
result 
Copy the code

Access to information

1. Get the exclusive link address for each novel

tree = etree.HTML(result)

href_list = tree.xpath('//tbody/tr//a/@href')  # specify information about attributes
href_list[:5]
Copy the code

2. Get the name of each novel

name_list = tree.xpath('//tbody/tr//a/text()')  Specify everything below the tag
name_list[:5]
Copy the code

3. Generate DataFrame DataFrame

# Generate the address and name of gulong's novel

gulong = pd.DataFrame({
    "name":name_list,
    "url":href_list 
})

gulong
Copy the code

4, improve the URL address

Virtually every URL address of the novel is a prefix, such as handsome siblings complete address: www.kanunu8.com/book/4573/

gulong['url'] = 'https://www.kanunu8.com/book' + gulong['url']   # with a public prefix
gulong

Export to excel file
gulong.to_excel("gulong.xlsx",index=False)
Copy the code