Python - Data extraction for crawlers

Welcome to follow the official wechat account: FSA Full Stack Action 👋

An overview,

1. Classification of response content

Structured response content
- Json strings: You can use re, JSON, JSONPath, and other modules to extract specific data
- XML strings: You can use modules like RE, LXML, and so on to extract specific data
Unstructured response content
- HTML strings: You can extract specific data using modules such as RE, LXML, Beautiful Soup, PyQuery, etc

Note: Re modules need regular syntax, and LXML modules need xpath syntax.

2. The difference between XML and HTML

The data format	describe	Design goals
XML	EXtensible Markup Language	Designed to transmit and store data, the focus is on the structure of the data
HTML	HyperText Markup Language	Display data and how to better display data

2. Jsonpath module

1, the introduction

Scenario: Multiple layers of nested complex dictionaries extract data directly
Installation:pip install jsonpath
Use:

from jsonpath import jsonpath
ret = jsonpath(data, 'JsonPath syntax rule string')
Copy the code

Note: The data type is a dictionary and the RET type is a list

2. Jsonpath syntax rules

Common core syntax

JSONPath	describe
$	The root node
. or []	Take child nodes
.	In any position, take the descendant node

Complete syntax description: kubernetes. IO/useful/docs/ref…

from jsonpath import jsonpath

data = {'key1': {'key2': {'key3': {'key4': {'key5': {'key6': 'lqr'}}}}}}

# print(data['key1']['key2']['key3']['key4']['key5']['key6'])

The result of # jsonPath is a list, and the data needs to be indexed
print(jsonpath(data, '$.key1.key2.key3.key4.key5.key6') [0])
print(jsonpath(data, '$.. key6') [0])
Copy the code

3. LXML Module

1, the introduction

LXML module: Can use XPath rule syntax to quickly locate elements in HTML/XML documents and obtain node information (text content, attribute values)
XPath: A language for finding information in HTML/XML documents that can be used to traverse elements and attributes in HTML/XML documents
Relationships: Extracting data from XML and HTML requires LXML modules and xpath syntax

2. Xpath syntax

Complete syntax description: www.w3school.com.cn/xpath/index…

1) xpath locates nodes and extracts attributes or text content syntax

nodename	Select all the children of this node.
/	Select from the root node.
//	Select nodes in the document from the current node selected by the match, regardless of their location.
.	Select the current node.
.	Select the parent node of the current node.
@	Select properties.
text()	Select the text.

For example:

<! ------- Select the title tag --------> HTML /head/title Absolute path HTML //title relative path //title relative to the entire HTML document //title/. Current node //title/./.. /.. The parent node <! ------- Fetch text content between open and closed tags --------> //title/text() <! ------- Gets the value of the specified attribute from the selected node tag --------> //link/@hrefCopy the code

Amway: Chrome plug-ins: xmlpath – helper 】【 chrome.google.com/webstore/de…

2) xpath selects a specific node syntax

Path expression	The results of
//title[@lang=”eng”]	Select all title elements whose lang attribute value is ENG
/bookstore/book[1]	Selects the first book element that is a child element of bookstore
/bookstore/book[last]	Selects the last book element that is a child element of bookstore
/bookstore/book[last()-1]	Selects the penultimate book element that is a child element of bookstore
/bookstore/book[position()>1]	Select the book element below bookstore, starting with the second
//book/title[text()=’Harry Potter’]	Select all the title elements under book, and select only the title elements whose text is Harry Potter
/ bookstore/book [price > 35.00] / title	Selects all title elements of the book element of the bookstore element whose price child element has a value greater than 35.00

Note: In xpath, the position of the first element is 1; The position of the last element is last(); The second from last is last()-1.

For example:

<! --------> / HTML /body/div[3]/div/div[1]/div
/html/body/div[3]/div/div[1]/div[3]
/html/body/div[3]/div/div[1]/div[last/ HTML /body/div3]/div/div[1]/div[last() -1/ HTML /body/div3]/div/div[1]/div[position() > =10] range selection <! --------> //div[@id="content-left"]/div/@id@ appears in [] to decorate the node with the tag attribute name and attribute value, and /@ appears at the end to take the attribute value <! --------> //span[I >2000]
//div[span[2] > =9.4] <! ------- decorates the node by including --------> //div[contains(@id."qiushi_tag_")]
Copy the code

3) xpath selects unknown node syntax

You can select unknown HTML and XML elements by wildcard characters

The wildcard	describe
*	Matches any element node.
@ *	Matches any attribute node.
node()	Matches any type of node.

Supplement: by in the path expression using “|” operator, you can select several paths.

For example:

/bookstore/* Selects all children of the bookstore element. //* Select all elements in the document. / / title [@ *] select all the title element with attributes / / h2 / a | / / td/a compound use xpath syntaxCopy the code

3. Use of LXML module

1) Import LXML etREE library

from lxml import etree

If the import code above is incorrect, use the import code below instead
from lxml import html
etree = html.etree
Copy the code

Note: The error from LXML import etree does not actually affect the code, just looks a little awkward.

2) use`etree.HTML`Transform the Element object

HTML converts an HTML string (bytes or STR) to an Element object with an xpath method that returns a list of results:

html = etree.HTML(text)
ret_list = html.xpath('xpath syntax rule string ')
Copy the code

3) The xpath method returns a list in three cases

Empty list: No elements are located
String list: Matches the text content or the value of an attribute
Element list: A tag matching the criteria is matched, and Element objects in the list can continue xpath

For example:

from lxml import etree

text = """ 
       
       first item 
        second item 
        third item 
        fourth item 
        fifth item 
       
 """

html = etree.HTML(text)
#print(html)

print(html.xpath('//a[@href="link1.html"]/text()'))
print(html.xpath('//a[@href="link1.html"]/text()') [0])

# String list
text_list = html.xpath('//a/text()')
link_list = html.xpath('//a/@href')
# for text in text_list:
# myindex = text_list.index(text)
# link = link_list[myindex]
# print(text, link)
for text, link in zip(text_list, link_list):
    print(text, link)

# Element list
el_list = html.xpath('//a')
for el in el_list:
    print(el.xpath('//text()'))  # × : // The beginning is relative to the entire document
    print(el.xpath('./text()'))  #)
    print(el.xpath('.//text()'))  #)
    print(el.xpath('text()'))  #)
Copy the code

4)`etree.tostring`Use of functions

The etree. Tostring function converts the Element object back to an HTML string:

from lxml import etree

text = """ 
       
       first item 
        second item 
        third item 
        fourth item 
        fifth item 
       
 """

html = etree.HTML(text)

html_str = etree.tostring(html).decode()
print(html_str)
Copy the code

Note: Some tags may be added, such as and < HTML >, because etree.html (text) automatically completes syntax errors in the original HTML.

If this article has been helpful to you, please feel free to follow my wechat official account: FSA Full Stack Action, which will be the biggest incentive for me. The public account not only has Android technology, iOS, Python and other articles, there may be skills you want to know oh ~

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python – Data extraction for crawlers

An overview,

1. Classification of response content

2. The difference between XML and HTML

2. Jsonpath module

1, the introduction

2. Jsonpath syntax rules

3. LXML Module

1, the introduction

2. Xpath syntax

1) xpath locates nodes and extracts attributes or text content syntax

2) xpath selects a specific node syntax

3) xpath selects unknown node syntax

3. Use of LXML module

1) Import LXML etREE library

2) use`etree.HTML`Transform the Element object

3) The xpath method returns a list in three cases

4)`etree.tostring`Use of functions

Python – Data extraction for crawlers

An overview,

1. Classification of response content

2. The difference between XML and HTML

2. Jsonpath module

1, the introduction

2. Jsonpath syntax rules

3. LXML Module

1, the introduction

2. Xpath syntax

1) xpath locates nodes and extracts attributes or text content syntax

2) xpath selects a specific node syntax

3) xpath selects unknown node syntax

3. Use of LXML module

1) Import LXML etREE library

2) useetree.HTMLTransform the Element object

3) The xpath method returns a list in three cases

4)etree.tostringUse of functions

Related Posts

Day2- Deploy JavaWeb projects using Docker

Dockerfile command CMD and ENTRYPOINT

Follow the old cat to do GO, short step and sent thousands of miles

2) use`etree.HTML`Transform the Element object

4)`etree.tostring`Use of functions