This article is published simultaneously on my wechat official account. You can follow it by scanning the QR code at the bottom of the article or searching the Geek navigation on wechat. The article will be updated every weekday.

A profile,

In the first two articles, we have made a general use of network library Requests. To request every page of data on the website is the first step of crawler, and then we need to extract the data useful to us on every page. There are many ways to extract data, such as regular, xpath, BS4, etc. Let’s learn the syntax of xpath today.

Second, the Xpath

  • What is the xpath? XPath (XML Path Language) is a Language for finding information in XML documents. It can be used to traverse elements and attributes in XML documents.
  • XML is what?W3School
    • XML stands for EXtensible Markup Language
    • XML is a markup language, much like HTML
    • XML is designed to transmit data, not display it
    • XML tags are not predefined. You need to define your own tags.
    • XML is designed to be self-descriptive.
    • XML is a W3C recommendation
  • Differences between XML and HTML
The data format describe role
XML Extensible Markup Language Used to transfer and store data
HTML Hypertext Markup Language Used to display data

3, preparation,

pip3 install lxml
Copy the code

4, usage,

XPath uses path expressions to select nodes or sets of nodes in an XML document. These path expressions are very similar to those we see in a regular computer file system.

expression meaning
/ Start at the root
// From any node
. From the current node
. From the parent of the current node
@ Select properties
text() Select the text

case

from lxml import etree
data = """ 
       """

html = etree.HTML(data)Construct an XPath parsing object. The etree.html module automatically corrects HTML text.

li_list = html.xpath('//ul/li')Select all li nodes under ul
#li_list = html.xpath('//div/ul/li')# li_list = html.xpath('//div/ul/li'

a_list = html.xpath('//ul/li/a')Select all a nodes under ul
herf_list = html.xpath('//ul/li/a/@href')Select the value of herf for all a nodes below ul
text_list = html.xpath('//ul/li/a/text()')Select the values of all a nodes under ul
print(li_list)
print(a_list)
print(herf_list)
print(text_list)

# print
[<Element li at 0x1015f4c48>, <Element li at 0x1015f4c08>, <Element li at 0x1015f4d08>, <Element li at 0x1015f4d48>, <Element li at 0x1015f4d88>]
[<Element a at 0x1015f4dc8>, <Element a at 0x1015f4e08>, <Element a at 0x1015f4e48>, <Element a at 0x1015f4e88>, <Element a at 0x1015f4ec8>]
['link1.html'.'link2.html'.'link3.html'.'link4.html'.'link5.html']
['first item'.'second item'.'third item'.'fourth item'.'fifth item']
Copy the code

We notice that the last value we print is a list object, and we can walk through the list if we want to value it.

Pick Unknown Nodes XPath wildcards are used to pick unknown XML elements.

The wildcard meaning
* Selects any element node
@ * Selects the node of any property

case

from lxml import etree
data = """ 
       """

html = etree.HTML(data)

li_list = html.xpath('//li[@class="item-0"]')Select the li tag whose class is item-0
text_list = html.xpath('//li[@class="item-0"]/a/text()')# select the value of a from li where class is item-0
li1_list  = html.xpath('//li[@id="1"]')Select the li tag whose id is 1
li2_list  = html.xpath('//li[@data="2"]')# select the li tag whose data is 2
print(li_list)
print(text_list)
print(li1_list)
print(li2_list)

# print
[<Element li at 0x101dd4cc8>, <Element li at 0x101dd4c88>]
['first item'.'fifth item']
[<Element li at 0x101dd4d88>]
[<Element li at 0x101dd4c88>]

Copy the code

Some path expression for the predicate

expression meaning
[?] Select the number of nodes
last() Select the last node
last()-1 Select the next-to-last node
position()-1 Take the first two

case

from lxml import etree

data = """ 
       """

html = etree.HTML(data)

li_list = html.xpath('//ul/li[1]')  Select the first li node under ul
li1_list = html.xpath('//ul/li[last()]')  Select the last li node under ul
li2_list = html.xpath('//ul/li[last()-1]')  Select the last li node under ul
li3_list = html.xpath('//ul/li[position()<= 3]')  # Select the first 3 labels under ul
text_list = html.xpath('//ul/li[position()<= 3]/a/@href')  Select the href value of the a tag from the first 3 tags below ul
print(li_list)
print(li1_list)
print(li2_list)
print(li3_list)
print(text_list)

# print
[<Element li at 0x1015d3cc8>]
[<Element li at 0x1015d3c88>]
[<Element li at 0x1015d3d88>]
[<Element li at 0x1015d3cc8>, <Element li at 0x1015d3dc8>, <Element li at 0x1015d3e08>]
['link1.html'.'link2.html'.'link3.html']

Copy the code

Five, the function

expression meaning
starts-with What element is selected that starts with
contains Select the element that contains some information
and And the relationship between
or Or relationship

case

from lxml import etree

data = """ 
       """

html = etree.HTML(data)

li_list = html.xpath('//li[starts-with(@class,"item-1")]')The class contains the li tag starting with item-1
li1_list = html.xpath('//li[contains(@class,"item-1")]')Get the li tag of class containing item
li2_list = html.xpath('//li[contains(@class,"item-0") and contains(@data,"2")]')Get the li tag with class item0 and data 2
li3_list = html.xpath('//li[contains(@class,"item-1") or contains(@data,"2")]')Get the li tag with class item-1 or data 2
print(li_list)
print(li1_list)
print(li2_list)
print(li3_list)

# print
[<Element li at 0x101dcac08>, <Element li at 0x101dcabc8>]
[<Element li at 0x101dcac08>, <Element li at 0x101dcabc8>]
[<Element li at 0x101dcacc8>]
[<Element li at 0x101dcac08>, <Element li at 0x101dcabc8>, <Element li at 0x101dcacc8>]
Copy the code

These are some common uses of Xpath. For more syntax, see W3School

6. Browser plugins

You can install some xpath plug-ins in your browser to parse the data.

  • Chrome plugin XPath Helper
  • Firefox plugin XPath Checker

Go to the Browser extension to download these plug-ins and you will see the ICONS in the upper left corner of your browser, as shown below

Seven,

We’ll put together the web library, the parsing library, and then we can start the real crawler journey. The following articles will try to crawl several sites using Requests and Xpath.

Welcome to pay attention to my public number, we learn together.