😀 This is the “10” original crawler column

We’ve seen how LXML uses XPath and PyQuery to extract the content of a page using CSS selectors. Both XPath and CSS selectors are sufficient for most content extraction, so you can choose your own library to extract the content.

But at this point one might ask: Can I use both? Sometimes when you’re doing content extraction, XPath is easier to write, and sometimes CSS Selector is easier to write, can you use both together? The answer is yes.

Here we introduce another parsing library, called Parsel.

Note: If you’ve ever used a Scrapy framework, you’ll notice that the Parsel API is very similar to the Scrapy selector API, because Scrapy selectors have learned how to use the library by wrapping parsel around it. The use of Scrapy selector is complete.

1. Introduction

Parsel is a library that parses HTML and XML, supports extraction and modification of content using XPath and CSS selectors, and incorporates regular expression extraction. It’s flexible and powerful, but it’s also the underlying support for Python’s most popular crawler framework, Scrapy.

2. Preparation

Before starting this section, make sure that the Parsel library is installed. If not, use PIp3 to install it:

pip3 install parsel
Copy the code

More detailed installation instructions can be reference: setup. Scrape. Center/parsel.

Once installed, we are ready to start this section.

3. The initialization

Let’s start with the example HTML from the previous section and declare the HTML variable as follows:

html = ''' 
       '''
Copy the code

And then, typically, we declare a Selector object using parsel’s Selector class, like this:

from parsel import Selector
selector = Selector(text=html)
Copy the code

So here we’re creating a Selector object, passing in the text argument, which is the HTML string we just declared, and assigning it to a Selector variable.

Once we have a Selector object, we can use CSS and xpath methods to pass in CSS Selector and xpath to extract content. For example, here we extract a node whose class contains item-0 as follows:

items = selector.css('.item-0')
print(len(items), type(items), items)
items2 = selector.xpath('//li[contains(@class, "item-0")]')
print(len(items2), type(items), items2)
Copy the code

We first extracted nodes using CSS method, and output the length and content of the extracted results. Xpath method is also written in the same way, and the running results are as follows:

3 <class 'parsel.selector.SelectorList'> / <Selector xpath="descendant-or-self:: * [@class and contains(concat(' ', normalize-space(@class), ' '), ' item-0 ')],"data= '<li class="item0 >"first item</li>, < > 'Selector xpath="descendant-or-self:: * [@class and contains(concat(' ', normalize-space(@class), ' '), ' item-0 ')],"data= '<li class="item0active"> <a href="li. '>, <Selector xpath="descendant-or-self:: * [@class and contains(concat(' ', normalize-space(@class), ' '), ' item-0 ')],"data= '<li class="item0 > <"a href="link5.htm. '>] < 3class 'parsel.selector.SelectorList'> / <Selector xpath= '/ /li[contains(@class."item-0")] 'data= '<li class="item0 >"first item</li>, < > 'Selector xpath= '/ /li[contains(@class."item-0")] 'data= '<li class="item0active"> <a href="li. '>, <Selector xpath= '/ /li[contains(@class."item-0")] 'data= '<li class="item0 > <"a href="link5.htm. '>]Copy the code

You can see that both results are SelectorList objects, which are actually an iterable. The other thing you can do is you can use len to get its length, which is 3, and the nodes that you extract are actually the same, the first, third, and fifth Li nodes, and each node is still returned as a Selector object, The data property of each Selector object contains the HTML code to extract the node.

But here you may have a question, the first time we didn’t use CSS method to extract the node? Why does the Selector object in the result also output an xpath property instead of a CSS property? This is because behind the CSS method, the CSS Selector we pass in is first turned into XPath, and XPath is really used for node extraction. The process of converting A CSS Selector to an XPath is done underneath using the cSSSelect library, For example, the.item-0 CSS Selector converted to XPath results in a finite-or-self ::*[@class and contains(concat(” “, Normalize-space (@class), “”), ‘item-0 ‘)], so the output Selector has xpath attributes. But you don’t have to worry about that, because it doesn’t affect the extraction, it just changes the representation.

4. Extract text

SelectorList = SelectorList = SelectorList = SelectorList = SelectorList

from parsel import Selector
selector = Selector(text=html)
items = selector.css('.item-0')
for item in items:
    text = item.xpath('.//text()').get()
    print(text)
Copy the code

Here we’re going to iterate over the items variable, assign it to item, and then the item becomes a Selector object, and now we can call its CSS or xpath method to extract the content, Here, for example, we use the XPath notation.//text() to extract all the contents of the current node. If no other methods are called, the result should still be an iterable SelectorList of selectorLists. SelectorList has a get method that extracts the contents of the Selector object that the SelectorList contains.

The running results are as follows:

first item
third item
fifth item
Copy the code

So what the get method does here is extract the first Selector from the SelectorList, and then print the result of that Selector.

Let’s look at another example:

result = selector.xpath('//li[contains(@class, "item-0")]//text()').get()
print(result)
Copy the code

The following output is displayed:

first item
Copy the code

//li[contains(@class, “item-0”)]//text() contains all li nodes whose classes contain item-0. So basically, the return SelectorList should correspond to three Li objects, and here the get method just returns the text content of the first Li object, because it’s actually only going to extract the result of the first Selector object.

So is there a way to extract the corresponding contents of all the selectors? Yes, that’s the Getall method.

Therefore, if we want to extract the text content of all corresponding Li nodes, we can rewrite it as follows:

result = selector.xpath('//li[contains(@class, "item-0")]//text()').getall()
print(result)
Copy the code

The following output is displayed:

['first item'.'third item'.'fifth item']
Copy the code

And at that point, we’re going to get a list type result, one to one corresponding to the Selector object.

So, if you want to extract the corresponding results from a SelectorList, you can use either get or Getall, the former gets the contents of the first Selector, the latter gets the corresponding results of each Selector in turn.

In addition, in the above case, the xpath method is rewritten to the CSS method, which can be implemented as follows:

result = selector.css('.item-0 *::text').getall()
print(result)
Copy the code

Here * is used to extract all child nodes (including plain text nodes). To extract text, add ::text. The final result is the same.

Here we simply understand the method of text extraction.

5. Extract attributes

//text() = //text() = //text() In a similar way, it would be nice to express it directly in XPath or CSS Selector.

For example, we extract the href attribute of a node inside the third Li node as follows:

from parsel import Selector
selector = Selector(text=html)
result = selector.css('.item-0.active a::attr(href)').get()
print(result)
result = selector.xpath('//li[contains(@class, "item-0") and contains(@class, "active")]/a/@href').get()
print(result)
Copy the code

Here we implement two approaches, one using CSS and the other using xpath. We select the third li node based on the two classes including item-0 and active, and then select the node a. For CSS Selector, we need to add ::attr() and pass in the corresponding attribute name to select the attribute. For XPath, simply use /@ and the attribute name. Finally, get method can be used to extract the results.

The running results are as follows:

link3.html
link3.html
Copy the code

You can see that both methods extract the corresponding href attribute correctly.

6. Regular extraction

In addition to the usual CSS and xpath methods, Selector objects also provide regular expression extraction methods. Let’s use an example:

from parsel import Selector
selector = Selector(text=html)
result = selector.css('.item-0').re('link.*')
print(result)
Copy the code

Here we first use CSS to extract all the nodes whose class contains item-0, and then use re to pass in link.* to match all the results that contain link.

The running results are as follows:

['link3.html"><span class="bold">third item</span></a></li>'.'link5.html">fifth item</a></li>']
Copy the code

As you can see, the re method here iterates through all of the extracted Selector objects, then looks up the source code of the matching node based on the passed regular expression and returns it as a list.

Of course, if the CSS method has already extracted further results, such as the node text value, then the re method will only extract the node text value:

from parsel import Selector
selector = Selector(text=html)
result = selector.css('.item-0 *::text').re('.*item')
print(result)
Copy the code

The running results are as follows:

['first item', 'third item', 'fifth item']
Copy the code

Alternatively, we can use the re_FIRST method to extract the first result that conforms to the rule:

from parsel import Selector
selector = Selector(text=html)
result = selector.css('.item-0').re_first('(.*?) ')
print(result)
Copy the code

Here we call the re_first method, which extracts the value of the text contained in the SPAN tag. The extract results are enclosed in parentheses to represent an extract group. The output is the result of the parentheses.

third item
Copy the code

Through these examples we know some use methods of regular matching, re corresponds to multiple results, re_first corresponds to a single result, you can choose the corresponding method to extract in different situations.

7. To summarize

Parsel is an extract library that combines XPath, CSS Selector, and regular expressions. This library is designed to extract elements from each of the following elements: Parsel. Readthedocs. IO /.

This section code: github.com/Python3WebS…

Thank you very much for reading. For more exciting content, please pay attention to my public account “Attack Coder” and “Cui Qingcai | Jingmi”.