Xpath, the Python web parsing library, will no longer worry my mother about my inability to parse

This article is published simultaneously on my wechat official account. You can follow it by scanning the QR code at the bottom of the article or searching the Geek navigation on wechat. The article will be updated every weekday.

A profile,

In the first two articles, we have made a general use of network library Requests. To request every page of data on the website is the first step of crawler, and then we need to extract the data useful to us on every page. There are many ways to extract data, such as regular, xpath, BS4, etc. Let’s learn the syntax of xpath today.

Second, the Xpath

What is the xpath? XPath (XML Path Language) is a Language for finding information in XML documents. It can be used to traverse elements and attributes in XML documents.
XML is what?W3School
- XML stands for EXtensible Markup Language
- XML is a markup language, much like HTML
- XML is designed to transmit data, not display it
- XML tags are not predefined. You need to define your own tags.
- XML is designed to be self-descriptive.
- XML is a W3C recommendation
Differences between XML and HTML

The data format	describe	role
XML	Extensible Markup Language	Used to transfer and store data
HTML	Hypertext Markup Language	Used to display data

3, preparation,

pip3 install lxml
Copy the code

4, usage,

XPath uses path expressions to select nodes or sets of nodes in an XML document. These path expressions are very similar to those we see in a regular computer file system.

expression	meaning
/	Start at the root
//	From any node
.	From the current node
.	From the parent of the current node
@	Select properties
text()	Select the text

case

from lxml import etree
data = """ 
       
        
        first item 
        second item 
        third item 
        fourth item 
        fifth item 
        
      
 """

html = etree.HTML(data)Construct an XPath parsing object. The etree.html module automatically corrects HTML text.

li_list = html.xpath('//ul/li')Select all li nodes under ul
#li_list = html.xpath('//div/ul/li')# li_list = html.xpath('//div/ul/li'

a_list = html.xpath('//ul/li/a')Select all a nodes under ul
herf_list = html.xpath('//ul/li/a/@href')Select the value of herf for all a nodes below ul
text_list = html.xpath('//ul/li/a/text()')Select the values of all a nodes under ul
print(li_list)
print(a_list)
print(herf_list)
print(text_list)

# print
[<Element li at 0x1015f4c48>, <Element li at 0x1015f4c08>, <Element li at 0x1015f4d08>, <Element li at 0x1015f4d48>, <Element li at 0x1015f4d88>]
[<Element a at 0x1015f4dc8>, <Element a at 0x1015f4e08>, <Element a at 0x1015f4e48>, <Element a at 0x1015f4e88>, <Element a at 0x1015f4ec8>]
['link1.html'.'link2.html'.'link3.html'.'link4.html'.'link5.html']
['first item'.'second item'.'third item'.'fourth item'.'fifth item']
Copy the code

We notice that the last value we print is a list object, and we can walk through the list if we want to value it.

Pick Unknown Nodes XPath wildcards are used to pick unknown XML elements.

The wildcard	meaning
*	Selects any element node
@ *	Selects the node of any property

case

from lxml import etree
data = """ 
       
        
        first item 
        second item 
        third item 
        fourth item 
        fifth item 
        
      
 """

html = etree.HTML(data)

li_list = html.xpath('//li[@class="item-0"]')Select the li tag whose class is item-0
text_list = html.xpath('//li[@class="item-0"]/a/text()')# select the value of a from li where class is item-0
li1_list  = html.xpath('//li[@id="1"]')Select the li tag whose id is 1
li2_list  = html.xpath('//li[@data="2"]')# select the li tag whose data is 2
print(li_list)
print(text_list)
print(li1_list)
print(li2_list)

# print
[<Element li at 0x101dd4cc8>, <Element li at 0x101dd4c88>]
['first item'.'fifth item']
[<Element li at 0x101dd4d88>]
[<Element li at 0x101dd4c88>]

Copy the code

Some path expression for the predicate

expression	meaning
[?]	Select the number of nodes
last()	Select the last node
last()-1	Select the next-to-last node
position()-1	Take the first two

case

from lxml import etree

data = """ 
       
        
        first item 
        second item 
        third item 
        fourth item 
        fifth item 
        
      
 """

html = etree.HTML(data)

li_list = html.xpath('//ul/li[1]')  Select the first li node under ul
li1_list = html.xpath('//ul/li[last()]')  Select the last li node under ul
li2_list = html.xpath('//ul/li[last()-1]')  Select the last li node under ul
li3_list = html.xpath('//ul/li[position()<= 3]')  # Select the first 3 labels under ul
text_list = html.xpath('//ul/li[position()<= 3]/a/@href')  Select the href value of the a tag from the first 3 tags below ul
print(li_list)
print(li1_list)
print(li2_list)
print(li3_list)
print(text_list)

# print
[<Element li at 0x1015d3cc8>]
[<Element li at 0x1015d3c88>]
[<Element li at 0x1015d3d88>]
[<Element li at 0x1015d3cc8>, <Element li at 0x1015d3dc8>, <Element li at 0x1015d3e08>]
['link1.html'.'link2.html'.'link3.html']

Copy the code

Five, the function

expression	meaning
starts-with	What element is selected that starts with
contains	Select the element that contains some information
and	And the relationship between
or	Or relationship

case

from lxml import etree

data = """ 
       
        
        first item 
        second item 
        third item 
        fourth item 
        fifth item 
        
      
 """

html = etree.HTML(data)

li_list = html.xpath('//li[starts-with(@class,"item-1")]')The class contains the li tag starting with item-1
li1_list = html.xpath('//li[contains(@class,"item-1")]')Get the li tag of class containing item
li2_list = html.xpath('//li[contains(@class,"item-0") and contains(@data,"2")]')Get the li tag with class item0 and data 2
li3_list = html.xpath('//li[contains(@class,"item-1") or contains(@data,"2")]')Get the li tag with class item-1 or data 2
print(li_list)
print(li1_list)
print(li2_list)
print(li3_list)

# print
[<Element li at 0x101dcac08>, <Element li at 0x101dcabc8>]
[<Element li at 0x101dcac08>, <Element li at 0x101dcabc8>]
[<Element li at 0x101dcacc8>]
[<Element li at 0x101dcac08>, <Element li at 0x101dcabc8>, <Element li at 0x101dcacc8>]
Copy the code

These are some common uses of Xpath. For more syntax, see W3School

6. Browser plugins

You can install some xpath plug-ins in your browser to parse the data.

Chrome plugin XPath Helper
Firefox plugin XPath Checker

Go to the Browser extension to download these plug-ins and you will see the ICONS in the upper left corner of your browser, as shown below

Seven,

We’ll put together the web library, the parsing library, and then we can start the real crawler journey. The following articles will try to crawl several sites using Requests and Xpath.

Welcome to pay attention to my public number, we learn together.

Xpath, the Python web parsing library, will no longer worry my mother about my inability to parse

A profile,

Second, the Xpath

3, preparation,

4, usage,

Five, the function

6. Browser plugins

Seven,

Related Posts

The node of the express

React (step 2 ✌️)

Redis common command