Install the LXML package before using it

Begin to use

Like BeautifulSoup, we first need to get a document tree

  • Converts text to a document tree object

    from lxml import etree if name == ‘main‘: doc=”’

    • first item
    • second item
    • third item
    • fourth item
    • Fifth item # Notice that one is missing here
    • Closing tags

    ‘ ‘ ‘

    html = etree.HTML(doc)
    result = etree.tostring(html)
    print(str(result,'utf-8'))
    Copy the code
  • Convert the file to a document tree object

    from lxml import etree

    Read the external file index.html

    html = etree.parse(‘./index.html’) result = etree.tostring(html, Pretty_print =True) # print(result)

Will print out the contents of the document

Nodes, elements, attributes, content

The idea of xpath is to find nodes by path representation. Nodes include elements, attributes, and content

  • Elements, for example,

    html —> … div —>

    .

    a —> .

Here we can see that the element has the same meaning as the tag in HTML. Individual elements cannot express a path, so individual elements cannot be used independently

Path expression

/ root node, node separator, // any position. Current node.. Parent node @ propertyCopy the code

The wildcard

* Any element @* Any attribute Node () Any child node(element, attribute, content)Copy the code

The predicate

Use brackets to qualify elements, called predicates

//a[n] n is an integer greater than zero, representing the NTH position of the child element <a> //a[last()] Last () represents the last position of the child element <a> //a[last()-] //a[position()<3] //a[@href='www.baidu.com'] href <a> element //book[@price>2] <book> element whose price value is greater than 2Copy the code

Multiple paths

Use | connect two expressions, or match

//book/title | //book/price
Copy the code

function

Xpath has many built-in functions. More functions to check www.w3school.com.cn/xpath/xpath…

  • contains(string1,string2)
  • starts-with(string1,string2)
  • Ends-with (string1,string2) #
  • Upper-case (string) # Not supported
  • text()
  • last()
  • position()
  • node()

You can see that last() is also a function, which we mentioned earlier in the predicate

case

Positioning elements

Matches multiple elements and returns the list

from lxml import etree if __name__ == '__main__': doc=''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li Class ="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> The lack of a < / li > closing tag < / ul > < / div > "' HTML = etree. HTML (doc) print (HTML. Xpath (" / / li")) print (HTML. Xpath (" / / p "))Copy the code

【 Result 】

[<Element li at 0x2b41b749848>, <Element li at 0x2b41b749808>, <Element li at 0x2b41b749908>, <Element li at 0x2b41b749948>, <Element li at 0x2b41b749988>] [] # print(etree.tostring(html.xpath("//li[@class='item-inactive']")[0])) print(html.xpath("//li[@class='item-inactive']")[0].text) print(html.xpath("//li[@class='item-inactive']/a")[0].text) print(html.xpath("//li[@class='item-inactive']/a/text()")) print(html.xpath("//li[@class='item-inactive']/.." )) print(html.xpath("//li[@class='item-inactive']/.. /li[@class='item-0']"))Copy the code

【 Result 】

B '<li class="item-inactive"><a href="link3.html">third item</a></li>\n 'None # None third item # ['third item'] [<Element ul at 0x19cd8c4c848>] [<Element li at 0x15ea3c5b848>, <Element li at 0x15ea3c5b6c8>]Copy the code

Using the function

contains

Sometimes class is not appropriate @class=’…. ‘This is a perfect match, class may add or subtract classes like active when the king style changes. It’s very convenient to use contains

from lxml import etree if __name__ == '__main__': doc=''' <div> <ul> <p class="item-0 active"><a href="link1.html">first item</a></p> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li Class ="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> The lack of a < / li > closing tag < / ul > < / div > "' HTML = etree. HTML (doc) print (HTML. Xpath (" / / * [contains (@ class, 'item')]"))Copy the code

【 Result 】

[<Element p at 0x23f4a9d12c8>, <Element li at 0x23f4a9d13c8>, <Element li at 0x23f4a9d1408>, <Element li at 0x23f4a9d1448>, <Element li at 0x23f4a9d1488>]
Copy the code

starts-with

from lxml import etree if __name__ == '__main__': doc=''' <div> <ul class='ul items'> <p class="item-0 active"><a href="link1.html">first item</a></p> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth Item </a> # The lack of a < / li > closing tag < / ul > < / div > "' HTML = etree. HTML (doc) print (HTML. Xpath (" / / * [contains (@ class, 'item')]")) print(html.xpath("//*[starts-with(@class,'ul')]"))Copy the code

【 Result 】

[<Element ul at 0x23384e51148>, <Element p at 0x23384e51248>, <Element li at 0x23384e51288>, <Element li at 0x23384e512c8>, <Element li at 0x23384e51308>, <Element li at 0x23384e51388>]
[<Element ul at 0x23384e51148>]
Copy the code

ends-with

print(html.xpath("//*[ends-with(@class,'ul')]"))
Copy the code

【 Result 】

Traceback (most recent call last):
  File "F:/OneDrive/pprojects/shoes-show-spider/test/xp5_test.py", line 18, in <module>
    print(html.xpath("//*[ends-with(@class,'ul')]"))
  File "src\lxml\etree.pyx", line 1582, in lxml.etree._Element.xpath
  File "src\lxml\xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
  File "src\lxml\xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
lxml.etree.XPathEvalError: Unregistered function
Copy the code

It appears that Python’s LXML does not support lists of xpath functions

upper-case

As with ends-with, this function is not supported. The same error LXML. Etree. XPathEvalError: Unregistered function

print(html.xpath("//a[contains(upper-case(@class),'ITEM-INACTIVE')]"))
Copy the code

The text, the last

# the last li are limited print (HTML. Xpath (" / / li [last ()] / a/text () ")) # will get all ` < a > ` element content, because every < a > tag is the last element of their respective parent element. <a> = <a> = <a> = <a> So is the last print (HTML. Xpath (" / / li/a [last ()] / text () ")) print (HTML. Xpath (" / / li/a [contains (text (), 'third')] "))Copy the code

【 Result 】

['fifth item']
['second item', 'third item', 'fourth item', 'fifth item']
[<Element a at 0x26ab7bd1308>]
Copy the code

position

Print (HTML. Xpath (" / / li [position () = 2] / a/text () ")) # results for [' third item ']Copy the code

This is an example that we’ve seen before

There is a question about whether position() can be used like text()
Print (HTML. Xpath (" / / li [last ()] / a/the position () ")) # results LXML. Etree. XPathEvalError: Unregistered functionCopy the code

So here we have the conclusion that the function doesn’t just go anywhere and get what it wants, right

node

Returns all child nodes regardless of type (familiar, element, content)

print(html.xpath("//ul/li[@class='item-inactive']/node()"))
print(html.xpath("//ul/node()"))
Copy the code

【 Result 】

[<Element a at 0x239a0d197c8>] ['\n ', <Element li at 0x239a0d19788>, '\n ', <Element li at 0x239a0d19888>, '\n ', <Element li at 0x239A0D19908 >, '\n ', <Element Li at 0x239A0D19948 >, '\n ', <Element Li at 0x239A0D198c8 >, 'close tag \n ']Copy the code

Access to content

** already mentioned that you can use.text and text() to get the content of an element

from lxml import etree if __name__ == '__main__': doc=''' <div> <ul class='ul items'> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth Item </a> # The lack of a < / li > closing tag < / ul > < / div > "' HTML = etree. XML (doc) print (HTML. Xpath (" / / a/text ()")) print(html.xpath("//a")[0].text) print(html.xpath("//ul")[0].text) print(len(html.xpath("//ul")[0].text)) print(html.xpath("//ul/text()"))Copy the code

【 Result 】

['first item', 'second item', 'third item', 'fourth item', 'fifth item'] first item 18 ['\n ', '\n ', '\n ', '\n ', '\n ', 'close tag \n ']Copy the code

Seeing here, we observe the difference between text() and.text. Sum up for yourself. It’s not easy to express, so I won’t

Retrieve attributes

print(html.xpath("//a/@href"))
print(html.xpath("//li/@class"))
Copy the code

【 Result 】

['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
['item-0 active', 'item-1', 'item-inactive', 'item-1', 'item-0']
Copy the code

Custom function

We get the conclusion from the process of using functions, that is, some functions are not supported, some are supported, then the question comes, which methods are supported. We found the answer on the LXML website. LXML. DE/xpathxslt. H… . LXML supports XPath 1.0. If you want to use other extensions, use libxml2, and libXSLT in a standards-compliant manner. Official XPath 1.0 documentation and other versions of XPath documentation are available at www.w3.org/TR/xpath/

LXML Supports XPath 1.0, XSLT 1.0 and the EXSLT extensions through libxml2 and libXSLT in a standards Compliant way.Copy the code

In addition, LXML provides custom functions to extend xpath support lxml.de/extensions…

Def ends_with(context,s1,s2): return s1[0]. Endswith (s2) if __name__ == '__main__': doc=''' <div> <ul class='ul items'> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth Item </a> # There is a missing </li> closing tag </ul> </div> "HTML = etree.xml (doc) ns = etree.functionnamespace (None) ns['ends-with'] = ends_with Print (html.xpath("//li[ends-with(@class,'active')]") print(html.xpath("//li[ends-with(@class,'active')]/a/text()"))Copy the code

【 Result 】

[<Element li at 0x2816ed30548>, <Element li at 0x2816ed30508>]
['first item', 'third item']
Copy the code
  • Parameter s1 takes the first argument in xpath, @class, but notice that @class is a list
  • Parameter s2 is passed the second argument in xpath, ‘active’, which is a string

Lxml.de /extensions….

def hello(context, a): return "Hello %s" % a from lxml import etree ns = etree.FunctionNamespace(None) ns['hello'] = hello root = Etree.xml ('<a><b>Haegar</b></a>') print(root.xpath("hello('Dr. Falken')")Copy the code

Complete video tutorial or PDFClick here to get it