Install the LXML package before using it

Begin to use

Like BeautifulSoup, we first need to get a document tree

  • Converts text to a document tree object

    from lxml import etree if name == ‘main‘: doc=”’

    • first item
    • second item
    • third item
    • fourth item
    • Fifth item # Notice that one is missing here
    • Closing tags

    ‘ ‘ ‘

    html = etree.HTML(doc)
    result = etree.tostring(html)
  • Convert the file to a document tree object

    from lxml import etree

    Read the external file index.html

    html = etree.parse(‘./index.html’) result = etree.tostring(html, Pretty_print =True) # print(result)

Will print out the contents of the document

Nodes, elements, attributes, content

The idea of xpath is to find nodes by path representation. Nodes include elements, attributes, and content

  • Elements, for example,

    html —> … div —>


    a —> .

Here we can see that the element has the same meaning as the tag in HTML. Individual elements cannot express a path, so individual elements cannot be used independently

Path expression

The wildcard

The predicate

Use brackets to qualify elements, called predicates

Multiple paths

Use | connect two expressions, or match

//book/title | //book/price
Xpath has many built-in functions. More functions to check…

  • contains(string1,string2)
  • starts-with(string1,string2)
  • Ends-with (string1,string2) #
  • Upper-case (string) # Not supported
  • text()
  • last()
  • position()
  • node()

You can see that last() is also a function, which we mentioned earlier in the predicate


Positioning elements

Matches multiple elements and returns the list

【 Result 】

【 Result 】

Using the function


Sometimes class is not appropriate @class=’…. ‘This is a perfect match, class may add or subtract classes like active when the king style changes. It’s very convenient to use contains

【 Result 】

[<Element p at 0x23f4a9d12c8>, <Element li at 0x23f4a9d13c8>, <Element li at 0x23f4a9d1408>, <Element li at 0x23f4a9d1448>, <Element li at 0x23f4a9d1488>]
from lxml import etree if __name__ == '__main__': doc=''' <div> <ul class='ul items'> <p class="item-0 active"><a href="link1.html">first item</a></p> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth Item </a> # The lack of a < / li > closing tag < / ul > < / div > "' HTML = etree. HTML (doc) print (HTML. Xpath (" / / * [contains (@ class, 'item')]")) print(html.xpath("//*[starts-with(@class,'ul')]"))Copy the code

【 Result 】

[<Element ul at 0x23384e51148>, <Element p at 0x23384e51248>, <Element li at 0x23384e51288>, <Element li at 0x23384e512c8>, <Element li at 0x23384e51308>, <Element li at 0x23384e51388>]
[<Element ul at 0x23384e51148>]
【 Result 】

Traceback (most recent call last):
  File "F:/OneDrive/pprojects/shoes-show-spider/test/", line 18, in <module>
  File "src\lxml\etree.pyx", line 1582, in lxml.etree._Element.xpath
  File "src\lxml\xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
  File "src\lxml\xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
lxml.etree.XPathEvalError: Unregistered function
It appears that Python’s LXML does not support lists of xpath functions


As with ends-with, this function is not supported. The same error LXML. Etree. XPathEvalError: Unregistered function

The text, the last

【 Result 】

['fifth item']
['second item', 'third item', 'fourth item', 'fifth item']
[<Element a at 0x26ab7bd1308>]
This is an example that we’ve seen before

There is a question about whether position() can be used like text()
So here we have the conclusion that the function doesn’t just go anywhere and get what it wants, right


Returns all child nodes regardless of type (familiar, element, content)

【 Result 】

Access to content

** already mentioned that you can use.text and text() to get the content of an element

【 Result 】

['first item', 'second item', 'third item', 'fourth item', 'fifth item'] first item 18 ['\n ', '\n ', '\n ', '\n ', '\n ', 'close tag \n ']Copy the code

Seeing here, we observe the difference between text() and.text. Sum up for yourself. It’s not easy to express, so I won’t

Retrieve attributes

【 Result 】

['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
['item-0 active', 'item-1', 'item-inactive', 'item-1', 'item-0']
Custom function

We get the conclusion from the process of using functions, that is, some functions are not supported, some are supported, then the question comes, which methods are supported. We found the answer on the LXML website. LXML. DE/xpathxslt. H… . LXML supports XPath 1.0. If you want to use other extensions, use libxml2, and libXSLT in a standards-compliant manner. Official XPath 1.0 documentation and other versions of XPath documentation are available at

In addition, LXML provides custom functions to extend xpath support…

【 Result 】

[<Element li at 0x2816ed30548>, <Element li at 0x2816ed30508>]
['first item', 'third item']
  • Parameter s1 takes the first argument in xpath, @class, but notice that @class is a list
  • Parameter s2 is passed the second argument in xpath, ‘active’, which is a string /extensions….

