“This is the 17th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

First, the type of data

The types of data in a web page can be simply divided into the following three categories:

1. Structured data

Data that can be represented in a uniform structure. You can use a relational database to represent and store data in a two-dimensional form. The general characteristics are: data in behavioral units, a row of data to represent the information of an entity, each row of data attributes are the same.

MySQL > select * from ‘MySQL’;

id          name       age      gender

Aid1 Ma Huateng 46 male

Aid2 Ma 53 Male

Aid3 Li Yanhong 49 male

2. Semi-structured data

Is a form of structured data that does not conform to the data model structure associated with relational databases or other forms of data tables, but contains tags that separate semantic elements and layer records and fields. Therefore, it is also called a self-describing structure. Common semi-structured data, such as HTML, XML and JSON, is actually stored as a tree or graph structure.

For example, a simple XML representation:

<person>

    <name>A</name>

<age>13</age>

<class>aid1710</class>

    <gender>female</gender>

</person>
Copy the code

or

<person>

    <name>B</name>

    <gender>male</gender>

</person>
Copy the code

The order of attributes in nodes is not important, and the number of attributes of different semi-structured data is not necessarily the same. This data format allows you to freely express a lot of useful information, including self-describing information (metadata). Therefore, semi-structured data has good scalability and is especially suitable for large-scale dissemination on the Internet.

3. Unstructured data

It’s data that has no fixed structure. Documents, images, video/audio, etc., are unstructured data. For this kind of data, we generally store directly as a whole, and generally store in binary data format;

All data except structured and semi-structured data is unstructured.

XML,HTML,DOM, and JSON files

XML, HTML, DOM

XML is Extentsible Markup Language(EXTENSIBLE Markup Language), is used to define other languages as a meta-language, its predecessor is SGML(Standard general Markup Language). There are no tagsets, no grammatical rules, but there are syntax rules. Any XML document must be well-formed for any type of application and for proper parsing. That is, each open tag must have a matching closing tag, must not contain out-of-order tags, and the statement structure must conform to the requirements of the technical specification. An XML document can be valid, but it does not have to be. Valid documents are those that conform to their document Type definition (DTD). If a document conforms to a schema, it is schema valid.

HTML(Hyper Text Mark-up Language) is the description Language of WWW. Differences and connections between HTML and XML:

XML and HTML, both designed to manipulate data or data structures, are roughly the same in structure, but there are clear differences in nature. The various materials on the Internet are summarized below.

(I) Different grammar requirements:

  1. It is case insensitive in HTML and strictly case sensitive in XML.

  2. In HTML, sometimes not strictly, you can omit paragraphs or list keys if the context clearly shows where they end

    or

  3. And so on. In XML, there is a strict tree structure, and the closing tag must never be omitted.

  4. In XML, elements that have a single tag but no matching closing tag must end with a/character. This way the parser knows not to look for the closing tag.

  5. In XML, attribute values must be enclosed in quotes. In HTML, quotation marks are optional.

  6. In HTML, you can have attribute names without values. In XML, all attributes must have corresponding values.

  7. In XML documents, white space is not automatically deleted by the parser; But HTML filters out Spaces.

XML has stricter syntactic requirements than HTML.

(2) Different marks:

  1. HTML uses native markup; XML has no inherent markup.

  2. HTML tags are predefined; XML tags are free, custom, and extensible.

(3) Different functions:

  1. HTML is used to display data; XML is used to describe and store data, so it can be used as a medium for persistence. HTML combines data and display, displaying the data on a page; XML separates the data from the display. XML is designed to describe data, and its focus is on the content of the data. HTML is designed to display data, and the focus is on what the data looks like.

  2. XML is not a replacement for HTML; XML and HTML are two languages for different purposes. XML is not meant to replace HTML; XML can actually be seen as a complement to HTML. The goal of XML is different from that of HTML, which is designed to display data and focus on its appearance, whereas XML is designed to describe data and focus on its content.

  3. XML has no behavior. Similar to HTML, XML does nothing (common).

  4. XML is probably best described as a cross-platform, hardware – and software – independent tool for processing and transmitting information.

  5. XML will be ubiquitous in the future, and XML will become the most common tool for data processing and data transmission.

About the DOM:

Document Object Model (DOM) is a standard programming interface recommended by W3C to deal with extensible markup language. On a web page, the objects that organize the page (or document) are organized in a tree structure, and the standard model for representing the objects in the document is called the DOM. The history of The Document Object Model can be traced back to the “browser war” between Microsoft and Netscape in the late 1990s. In order to fight for the survival between JavaScript and JScript, both sides endowed browsers with powerful functions on a large scale. Microsoft has added a number of proprietary features to web technology, including VBScript, ActiveX, and Microsoft’s own DHTML format, that make many web pages unusable on non-Microsoft platforms and browsers. DOM was a masterpiece of its time.

DOM allows access to and modification of the content and structure of a Document in a platform – and language-independent manner. In other words, this is a common way to represent and process an HTML or XML document. The DOM is important because its design is based on the conventions of the Object Management Organization (OMG) and can therefore be used in any programming language. It was originally conceived as a way to make JavaScript portable between browsers, but DOM has been used far beyond that. DOM technology allows users to dynamically change the page, such as dynamically show or hide an element, change their attributes, add an element and so on, DOM technology makes the page interaction greatly enhanced.

The DOM is actually a document model described in an object-oriented manner. The DOM defines the objects needed to represent and modify a document, their behavior and properties, and the relationships between these objects. You can think of the DOM as a tree representation of the data and structure on the page, though of course the page may not be implemented as a tree.

With JavaScript, you can refactor an entire HTML document. You can add, remove, change, or rearrange items on the page. To change something on the page, JavaScript needs access to all the elements in the HTML document. This entry, along with methods and attributes for adding, moving, changing, or removing HTML elements, is obtained through the Document Object Model (DOM).

2. JSON file

JSON(JavaScript Object Notation) is a lightweight data interchange format. It is based on a subset of ECMAScript (the W3C’S JS specification) and uses a text format that is completely independent of the programming language to store and represent data. The simplicity and clarity of the hierarchy make JSON an ideal data exchange language. Easy to read and write, but also easy to machine parsing and generation, and effectively improve the efficiency of network transmission.

JSON syntax rules:

In JS, everything is an object. Therefore, any supported types, such as strings, numbers, objects, arrays, and so on, can be represented by JSON.

But objects and arrays are special and commonly used types:

1. Objects are represented as key-value pairs

2. Data is separated by commas

3. Curly braces save the object

4. Square brackets hold arrays

JSON key-value pairs are a way to store JS objects, and they’re written pretty much the same way,

Key/value pair combinations are preceded by the key name and enclosed in double quotation marks (“”), separated by a colon:, followed by the value

{“firstName”: “Json”,”class”:”aid1710″}

This is easy to understand, equivalent to this JavaScript statement:

{firstName : “Json”,”class”:”aid1710″}

JSON and JS object relationship:

Many people don’t understand the relationship between JSON and JS objects, or even who is who. JSON is a string representation of a JS object. It uses text to represent information about a JS object, essentially a string.

Such as:

var obj = {a: 'Hello', b: 'World'}; Var json = '{"a": "Hello", "b": "World"}'; // This is a JSON string, essentially a string.Copy the code

A simple demonstration of JSON in Python:

See josntest.py for a code example

JSON vs. XML:

1. Readability:

JSON and XML are comparable in readability, with the simple syntax on the one hand and the formal tag form on the other, making it hard to tell.

2. Scalability:

XML is naturally extensible, and JSON certainly is, and there is nothing that XML can scale that JSON can’t. However, JSON is the home of Javascript and can store Javascript composite objects, which has an advantage over XML.

3. Coding difficulty:

XML has a wealth of coding tools, such as Dom4j, JDom, and JSON also provides tools. Without tools, skilled developers can write XML documents and JSON strings as quickly as they want, but XML documents have many more structural characters.

4. Decoding difficulty

XML can be parsed in two ways:

One is through document model parsing, which indexes a set of tags by parent tags. For example: xmlData. GetElementsByTagName (” tagName “), but this is to know in advance the document structure under the condition of use, to general packaging.

Another approach is to traverse the nodes (Document and childNodes). This can be done recursively, but the parsed data is still in a variety of forms, and often does not meet the preconceived requirements. Any such extensible structured data must be difficult to parse. The same is true for JSON. Using JSON for data transfer is a great way to write useful, beautiful, and readable code if you know the structure in advance.

If you’re a pure foreground developer, you’ll love JSON. Not so much if you’re an application developer, though, because XML is really a structured markup language for data transfer. Parsing JSON without knowing its structure is a nightmare. Even if it takes time and effort, the code will become redundant and the results will not be satisfactory.

But that doesn’t stop many front-office developers from choosing JSON. Because toJSONString() in json.js shows the string structure of JSON. Not using this string, of course, which would still be a nightmare. People who use JSON often see this string and have a good understanding of the structure of JSON, making it easier to manipulate JSON. This is just parsing XML and JSON for data passing in Javascript.

JSON, after all, is the home turf of Javascript, and certainly has far more advantages than XML.

If you store Javascript composite objects in JSON and don’t know their structure, many programmers will be crying to parse JSON as well. In addition to the above, another big difference between JSON and XML is the effective data rate. JSON is more efficient when transmitted as a packet format because it does not require strict closing tags as XML does, which greatly increases the effective data amount compared to total packets, thus reducing network transmission pressure for the same amount of data traffic.

Example comparison:

XML and JSON both use structured approaches to mark up data, so let’s do a simple comparison.

Data of some provinces and cities in China expressed in XML are as follows:

<? The XML version = "1.0" encoding = "utf-8"? > <country> <name> <province> <name> <cities> <city> <city> <city> <province> <province> <name> <city> <city> <city> <city> <city> <city> <city> <city> <city> <province> <city> <city> <city> <city> <city> <province> <name> Taiwan </name> < City > Taipei </city> <city> Kaohsiung </city> </city> </province> </province> </name> </city> </city> </province> </country>Copy the code

Expressed in JSON as follows:

{" name ":" Chinese ", "province" : [{" name ":" heilongjiang province ", "cities" : {" city ": [" Harbin", "daqing"]}}, {" name ":" guangdong ", "cities" : {" city ": [" guangzhou", "shenzhen", "zhuhai"]}}, {" name ":" Taiwan ", "cities" : {" city ": [" Taipei", "kaohsiung"]}}, {" name ":" xinjiang ", "cities" : {"city": [" city"]}}]}Copy the code

As you can see, JSON’s simple syntax and clear hierarchy are significantly easier to read than XML, and because JSON uses far fewer characters than XML in terms of data exchange, the bandwidth required to transfer data can be significantly reduced.

How to extract information from web pages

XPath vs. LXML

XPath is a language for finding information in XML documents, and an understanding of XPath is the foundation of many advanced XML applications, where XPath navigates through elements and attributes.

LXML is a third party Python library for processing XML. It encapsulates libxml2 and libXSLT written in C and enhances the well-known Element Tree API with a simple and powerful Python API.

PIP install LXML

Use: from LXML import etree

1. XPath terminology:

In the context of XPath, an XML document is treated as a tree of nodes, and the root node of the tree is also called a document node. XPath classifies nodes in a Node tree into seven classes: Element, Attribute, Text, Namespace, processing-Instruction, Comment, and Document nodes.

Take a look at an example XML document:

<? The XML version = "1.0" encoding = "ISO - 8859-1"? > <bookstore> <book> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> The < price > 29.99 < / price > < / book > < / bookstore >Copy the code

In the XML document above:

(This is a “root”)

J. K. Rowling (This is an “element”)

Lang =”en” (this is a “property”)

Look at it from another perspective:

Bookstore (root)

Book (element)

Title (element)

Lang = en (attributes)

Text = Harry Potter

Author (element)

Text = J K. Rowling (Text)

Year (element)

Text = 2005 (text)

Price (element)

Text = 29.99 (text)

2. Relationships between nodes

Parent: Every element must have a Parent, and the Parent of the topmost element is the root node. Similarly, every attribute must have a parent, and their parent is the element. In the XML document above, the root bookstore is the parent of the element book, which is the parent of the elements title, author, Year, and Price, and title is the parent of lang.

Children: Elements can have zero or more Children. In the XML document above, title, author, year, and price are the children of book.

Siblings: Nodes with the same parent are siblings of each other, also called siblings. In the XM document above, title, author, year, and price are siblings of each other.

Ancestor: the parent of a node, the parent of the Ancestor, and so on down to all nodes between the root node. The ancestors of title, author, Year, and price in the XM document above are book, bookstore.

Descendant: Descendant of a node, Descendant of a node, and so on, all nodes in between the last Descendant node. In the XM document above, the descendants of bookstore are title, author, year, price.

3. Select a node

Here’s how to express a basic path. Remember that XPath path expressions are based on a node, for example, the initial current node is usually the root node, just like Linux path switching works.

Expression description:

Nodename Selects the node named nodename under the matched node

/ If it starts with a slash (/), the selection starts from the root node.

// Select nodes from the descendants of matched nodes, regardless of the location of the target node.

. Select the current node.

. Selects the parent element node of the current node.

@ Select properties.

4. A wildcard

* Matches any element.

@* matches any property.

Node () matches any type of node.

5. Predicates or condition selection

Prediction is used to find a particular node or a node that meets a certain condition. The expression of prediction is in square brackets. Use “|” operator, you can choose if it meets the requirements of “or” several paths.

See lxmltest.py for an example.

6. Coordinate

XPath coordinate axes: Coordinate axes are used to define the set of nodes that apply to the current node.

Axis name meaning

Ancestor selects all ancestor elements and root nodes of the current node.

Method-or-self selects all ancestors of the current node and the current node itself.

Attibute selects all attributes of the current node.

Child Selects all the children of the current node.

Descendant Selects all descendant elements of the current node.

Finite-or-self selects all descendant elements of the current node and the current node itself.

Following selects all nodes after the end tag of the current node in the document.

Following-sibling selects all sibling nodes following the current node.

Namespace Selects all namespace nodes of the current node.

Parent Selects the parent node of the current node.

Preceding All nodes before the start label of the current node.

Preceding -sibling Selects all sibling nodes preceding the current node.

Self selects the current node.

7. Expression of position path

A location path can be an absolute path or a relative path. Absolute paths start with slash (/). Each path contains one or more steps separated by slash (/).

Absolute path: /step/step/…

Relative path: step/step/…

Each step is calculated based on the nodes in the current node set.

A step consists of three parts:

Axis: Defines the relationship between the selected node and the current node.

Node-test: Identifies nodes within an axis.

Predicate: Predicate conditions are made to filter a set of nodes.

Step syntax: axis :: node test [prejudgment]

2、 BeautifulSoup4

Beautiful Soup is an HTML/XML parser written in Python that does a great job of handling non-standard tags and generating parse trees. It provides simple and commonly used navigation, searching, and modifying profiling trees. It can save you a lot of programming time.

Install: PIP install beautifuilsoup4

Use:

Import the Beautiful Soup library into the program:

from BeautifulSoup import BeautifulSoup          # For processing HTML

from BeautifulSoup import BeautifulStoneSoup     # For processing XML

import BeautifulSoup                             # To get everything
Copy the code

From bS4 import BeautifulSoup import re doc = ['< HTML ><head><title>Page title</title></head>', '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.', '<p id="secondpara" align="blah">This is paragraph <b>two</b>.', '</html>'] soup = BeautifulSoup(''.join(doc)) print soup.prettify()Copy the code

Locating some soup elements is simple, as in the above example:

soup.contents[0].name

# u'html'



soup.contents[0].contents[0].name

# u'head'



head = soup.contents[0].contents[0]

head.parent.name

# u'html'



head.next

# <title>Page title</title>



head.nextSibling.name

# u'body'



head.nextSibling.contents[0]

# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>



head.nextSibling.contents[0].nextSibling

# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

  
Copy the code

You can also use soup to get a specific tag or tag with a specific attribute, and modify soup is easy;

BS4 vs. LXML:

LXML C implementation, only local traversal, fast; Complex and grammatically unfriendly;

BS4 Python implementation, will load the entire document, slow; Simple, USER-FRIENDLY API;

3. Regular expression re

It is used to retrieve and replace text that conforms to a pattern (rule). Regular expressions are the most powerful tool for text filtering or rule matching. They are essential tools in Python crawlers.

Basic matching rules:

[0-9] any number, equivalent to \d

[A-z] Any lowercase letter

[A-z] Any uppercase letter

[^0-9] matches non-numeric, equivalent to \D

\w is equivalent to [A-z0-9_], alphanumeric underscore

\W is equivalent to not \W

.Any character

[] Matches any internal character or subexpression

[^] Takes not from a collection of characters

  • Matches the preceding character or subexpression 0 or more times
  • Matches the previous character at least once

? Matches the previous character 0 or more times

^ Matches the beginning of the string

$Ends the matching string

Python uses regular expressions

Python’s RE module

Pattern A compiled regular expression

Several important methods:

Match: match once from the beginning;

Search: matches once, from a location;

Findall: Matches all;

The split: separation;

Sub: replace;

Two patterns to watch out for:

Greedy mode :(.*)

Lazy mode :(.*?)

  1. Use regular expressions to achieve the following effect:

The % 0 a & the from = I = d AUTO&to = AUTO&smartresult = dict

Convert to the following form:

i:d%0A

from:AUTO

to:AUTO

smartresult:dict

Summary: Re, BS, LXML comparison