Moment For Technology

Python - Data extraction for crawlers

Posted on Dec. 2, 2022, 5:51 p.m. by 李懿
Category: The back-end Tag: python The back-end The crawler Data mining

Welcome to follow the official wechat account: FSA Full Stack Action ?

An overview,

1. Classification of response content

  • Structured response content
    • Json strings: You can use re, JSON, JSONPath, and other modules to extract specific data
    • XML strings: You can use modules like RE, LXML, and so on to extract specific data
  • Unstructured response content
    • HTML strings: You can extract specific data using modules such as RE, LXML, Beautiful Soup, PyQuery, etc

Note: Re modules need regular syntax, and LXML modules need xpath syntax.

2. The difference between XML and HTML

The data format describe Design goals
XML EXtensible Markup Language Designed to transmit and store data, the focus is on the structure of the data
HTML HyperText Markup Language Display data and how to better display data

2. Jsonpath module

1, the introduction

  • Scenario: Multiple layers of nested complex dictionaries extract data directly
  • Installation:pip install jsonpath
  • Use:
from jsonpath import jsonpath
ret = jsonpath(data, 'JsonPath syntax rule string')
Copy the code

Note: The data type is a dictionary and the RET type is a list

2. Jsonpath syntax rules

Common core syntax

JSONPath describe
$ The root node
. or [] Take child nodes
. In any position, take the descendant node

Complete syntax description: kubernetes. IO/useful/docs/ref...

from jsonpath import jsonpath

data = {'key1': {'key2': {'key3': {'key4': {'key5': {'key6': 'lqr'}}}}}}

# print(data['key1']['key2']['key3']['key4']['key5']['key6'])

The result of # jsonPath is a list, and the data needs to be indexed
print(jsonpath(data, '$.key1.key2.key3.key4.key5.key6') [0])
print(jsonpath(data, '$.. key6') [0])
Copy the code

3. LXML Module

1, the introduction

  • LXML module: Can use XPath rule syntax to quickly locate elements in HTML/XML documents and obtain node information (text content, attribute values)
  • XPath: A language for finding information in HTML/XML documents that can be used to traverse elements and attributes in HTML/XML documents
  • Relationships: Extracting data from XML and HTML requires LXML modules and xpath syntax

2. Xpath syntax

Complete syntax description:

1) xpath locates nodes and extracts attributes or text content syntax

nodename Select all the children of this node.
/ Select from the root node.
// Select nodes in the document from the current node selected by the match, regardless of their location.
. Select the current node.
. Select the parent node of the current node.
@ Select properties.
text() Select the text.

For example:

! ------- Select the title tag -------- HTML /head/title Absolute path HTML //title relative path //title relative to the entire HTML document //title/. Current node //title/./.. /.. The parent node ! ------- Fetch text content between open and closed tags -------- //title/text() ! ------- Gets the value of the specified attribute from the selected node tag -------- //link/@hrefCopy the code

Amway: Chrome plug-ins: xmlpath - helper 】 【

2) xpath selects a specific node syntax

Path expression The results of
//title[@lang="eng"] Select all title elements whose lang attribute value is ENG
/bookstore/book[1] Selects the first book element that is a child element of bookstore
/bookstore/book[last] Selects the last book element that is a child element of bookstore
/bookstore/book[last()-1] Selects the penultimate book element that is a child element of bookstore
/bookstore/book[position()1] Select the book element below bookstore, starting with the second
//book/title[text()='Harry Potter'] Select all the title elements under book, and select only the title elements whose text is Harry Potter
/ bookstore/book [price 35.00] / title Selects all title elements of the book element of the bookstore element whose price child element has a value greater than 35.00

Note: In xpath, the position of the first element is 1; The position of the last element is last(); The second from last is last()-1.

For example:

! -------- / HTML /body/div[3]/div/div[1]/div
/html/body/div[3]/div/div[1]/div[last/ HTML /body/div3]/div/div[1]/div[last() -1/ HTML /body/div3]/div/div[1]/div[position()  =10] range selection ! -------- //div[@id="content-left"]/div/@id@ appears in [] to decorate the node with the tag attribute name and attribute value, and /@ appears at the end to take the attribute value ! -------- //span[I 2000]
//div[span[2]  =9.4] ! ------- decorates the node by including -------- //div[contains(@id."qiushi_tag_")]
Copy the code

3) xpath selects unknown node syntax

You can select unknown HTML and XML elements by wildcard characters

The wildcard describe
* Matches any element node.
@ * Matches any attribute node.
node() Matches any type of node.

Supplement: by in the path expression using "|" operator, you can select several paths.

For example:

/bookstore/* Selects all children of the bookstore element. //* Select all elements in the document. / / title [@ *] select all the title element with attributes / / h2 / a | / / td/a compound use xpath syntaxCopy the code

3. Use of LXML module

1) Import LXML etREE library

from lxml import etree

If the import code above is incorrect, use the import code below instead
from lxml import html
etree = html.etree
Copy the code

Note: The error from LXML import etree does not actually affect the code, just looks a little awkward.

2) useetree.HTMLTransform the Element object

HTML converts an HTML string (bytes or STR) to an Element object with an xpath method that returns a list of results:

html = etree.HTML(text)
ret_list = html.xpath('xpath syntax rule string ')
Copy the code

3) The xpath method returns a list in three cases

  • Empty list: No elements are located
  • String list: Matches the text content or the value of an attribute
  • Element list: A tag matching the criteria is matched, and Element objects in the list can continue xpath

For example:

from lxml import etree

text = """ 
  • first item
  • second item
  • third item
  • fourth item
  • fifth item
""" html = etree.HTML(text) #print(html) print(html.xpath('//a[@href="link1.html"]/text()')) print(html.xpath('//a[@href="link1.html"]/text()') [0]) # String list text_list = html.xpath('//a/text()') link_list = html.xpath('//a/@href') # for text in text_list: # myindex = text_list.index(text) # link = link_list[myindex] # print(text, link) for text, link in zip(text_list, link_list): print(text, link) # Element list el_list = html.xpath('//a') for el in el_list: print(el.xpath('//text()')) # × : // The beginning is relative to the entire document print(el.xpath('./text()')) #) print(el.xpath('.//text()')) #) print(el.xpath('text()')) #) Copy the code

4)etree.tostringUse of functions

The etree. Tostring function converts the Element object back to an HTML string:

from lxml import etree

text = """ 
  • first item
  • second item
  • third item
  • fourth item
  • fifth item
""" html = etree.HTML(text) html_str = etree.tostring(html).decode() print(html_str) Copy the code

Note: Some tags may be added, such as and HTML , because etree.html (text) automatically completes syntax errors in the original HTML.

If this article has been helpful to you, please feel free to follow my wechat official account: FSA Full Stack Action, which will be the biggest incentive for me. The public account not only has Android technology, iOS, Python and other articles, there may be skills you want to know oh ~

About (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.