[Python3 web crawler development combat] 2- crawler foundation 2- web page foundation

When you visit a web site with a browser, the page looks different. Have you ever wondered why it looks the way it does? In this section, we will take a look at the basic composition, structure and nodes of the web page.

1. Composition of web pages

Web pages can be divided into three main parts — HTML, CSS, and JavaScript. If you compare a web page to a person, HTML is the skeleton, JavaScript is the muscle, CSS is the skin, the combination of the three to make a complete web page. Let’s introduce the functions of these three parts respectively.

(1) HTML

HTML is a Language used to describe web pages. Its full name is Hyper Text Markup Language. Web pages contain complex elements such as text, buttons, images and video, and their basic architecture is HTML. Different types of text are represented by different types of tags, for example, images are represented by IMG tag, videos are represented by video tag, and paragraphs are represented by P tag. The layout between them is often nested and combined by layout tag DIV, and various tags form the framework of the web page through different arrangement and nesting.

Open Baidu in Chrome, right-click and select Check (or press F12) to open developer mode. The source code for the web page is displayed in the Elements TAB, as shown in Figure 2-9.

Figure 2-9 Source code

This is HTML, and the entire web page is made up of nested tags. The node elements defined by these tags are nested and combined to form a complex hierarchical relationship, which forms the architecture of the web page.

(2) CSS

HTML defines the structure of a web page, but only HTML page layout is not beautiful, may be a simple arrangement of node elements, in order to make the web page look better, here with the help of CSS.

Cascading Style Sheets CSS stands for Cascading Style Sheets. “Cascading” means that when several style files are referenced in HTML and the styles conflict, the browser can process them in a cascading order. “Style” refers to text size, color, element spacing, arrangement, etc.

CSS is currently the only standard for web page layout that helps make pages look better.

The CSS is on the right of Figure 2-9. For example:

#head_wrapper.s-ps-islite .s-p-top {

position: absolute;

bottom: 40px;

width: 100%;

height: 181px;

}

It’s just a CSS style. The braces are preceded by a CSS selector, which means that the node with id head_wrapper and class S-ps-islite is selected first, and then the node with class S-P-top inside. For example, position specifies that the layout of the element is absolute, bottom specifies that the bottom margin of the element is 40 pixels, width specifies that the width is 100% of the parent element, and height specifies the height of the element. That is, we write the position, width, height and other style Settings in this form, and then enclose them in braces, and then add a CSS selector at the beginning. This means that the style applies to the elements selected by the CSS selector, and the elements are displayed according to the style.

In web pages, the style rules of the entire web page are generally defined uniformly and written into the CSS file (its suffix is CSS). In HTML, just use the link tag to import a written CSS file, and the entire page becomes beautiful and elegant.

(3) JavaScript

JavaScript, or JS for short, is a scripting language. HTML and CSS work together to provide users with static information and lack of interactivity. We may see some interactive and animated effects on a web page, such as download progress bars, prompt boxes, and scrolling graphics, which are usually the result of JavaScript. Its appearance makes the relationship between users and information not only a browsing and display, but also realizes a real-time, dynamic and interactive page function.

JavaScript is also usually loaded as a separate file with the suffix js, which is introduced in HTML with script tags, such as:

1	< script SRC = “jquery – 2.1.0. Js” > < / script >

To sum up, HTML defines the content and structure of a web page, CSS describes the layout of a web page, and JavaScript defines the behavior of a web page.

2. The structure of the page

Let’s first get a feel for the basic structure of HTML with examples. Create a new text file with an HTML suffix and the following content:

<html>

<head>

</head>

<body>

<h2 class=”title”>Hello World</h2>

<p class=”text”>Hello, this is a paragraph.</p>

</div>

</body>

</html>

This is an EXAMPLE of HTML in its simplest form. The beginning defines the document type with a DOCTYPE, followed by the outermost HTML tag, and finally the corresponding closing tag to indicate closure. Inside the tags are the head tag and the body tag, representing the header and body of the page respectively, which also need closing tags. The head tag defines some page configurations and references, such as:

1	<meta charset=”UTF-8″>

It specifies a UTF-8 encoding for the web page.

The title tag defines the title of the page and is displayed in the TAB of the page, not in the body. Inside the body tag is what is displayed in the body of the web page. The div tag defines a block in the web page. Its ID is container. This is a very common attribute, and the content of the ID is unique to the web page. Inside this block there is a div tag with a Wrapper class, which is also a very common property and is often used in conjunction with CSS to set styles. The block then has an H2 tag inside, which represents a secondary title. In addition, there is a P tag, which represents a paragraph. Each of them has its own class attribute, and can be rendered on a web page by writing the corresponding content.

After saving the code, open the file in a browser, and you can see the content shown in Figure 2-10.

Figure 2-10 Running results

As you can see, the TAB displays the word This is a Demo, which is the text we defined in the title of the head. The body of the web page is generated by the elements defined inside the body tag, showing the secondary headings and paragraphs.

This example is the general structure of a web page. The standard form of a web page is that the head and body tags are nested within an HTML tag. The head defines the configuration and references to the page, and the body defines the body of the page.

3. Node tree and relationship between nodes

In HTML, all tag definitions are nodes that form an HTML DOM tree.

Let’s take a look at DOM. DOM is a W3C standard, and its full English name is Document Object Model. It defines standards for accessing HTML and XML documents:

The W3C Document Object Model (DOM) is a platform – and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure, and style of documents.

The W3C DOM standard is divided into three distinct parts.

Core DOM: The standard model for any structured document.
XML DOM: Standard model for XML documents.
HTML DOM: The standard model for HTML documents.

According to the W3C HTML DOM standard, everything in an HTML document is a node.

The entire document is a document node;
Each HTML element is an element node;
The text inside an HTML element is a text node;
Each HTML attribute is an attribute node;
Comments are comment nodes.

The HTML DOM treats an HTML document as a tree structure, which is called a node tree, as shown in Figure 2-11.

Figure 2-11 Node tree

With the HTML DOM, all nodes in the tree are accessible through JavaScript, and all HTML node elements can be modified, created or deleted.

Nodes in a node tree have hierarchical relationships with each other. We often use the terms parent, child and sibling to describe these relationships. Parent nodes have children, and sibling children are called siblings.

In the node tree, the top node is called root. In addition to the root node, each node has a parent node and can have any number of children or siblings. Figure 2-12 shows the node tree and the relationship between nodes.

Figure 2-12 Node tree and relationships among nodes

This section reference W3SCHOOL, link: www.w3school.com.cn/htmldom/dom… .

4. The selector

We know that web pages are composed of nodes, and CSS selectors set different style rules for each node, so how to locate nodes?

In CSS, we use CSS selectors to locate nodes. For example, the div node whose ID is container in the example above could be represented as #container, where # indicates the selection ID followed by the name of the ID. Alternatively, if we want to select a node whose class is wrapper, we can use.wrapper, where the dot (.) is used. The beginning represents the selection of class, followed by the class name. Another option is to filter by tag name. For example, if you want to select a secondary title, use H2 directly. These are the three most commonly used representations, filtering by ID, class, and tag name, so keep them in mind.

In addition, CSS selectors support nested selection, which can be represented by separating the selectors with Spaces. For example, # container.wrapper p indicates that the node with the ID of container is selected, and then the node with the inner class of Wrapper is selected. Then further select the p node inside it. In addition, if there is no space, it indicates parallelism. For example, div#container. Wrapper. Select the p node whose internal class is text. This is the CSS selector, and its filtering capabilities are quite powerful.

In addition, CSS selectors have other syntax rules, as shown in Table 2-4.

Table 2-4 Other syntax rules for CSS selectors

The selector	example	Case description
`.class`	`.intro`	Select all nodes with class=”intro”
`#id`	`#firstname`	Select all nodes whose id=” firstName”
`*`	`*`	Select all nodes
`element`	`p`	Select all p nodes
`element,element`	`div,p`	Select all div nodes and all P nodes
`element element`	`div p`	Select all p nodes inside the div node
`element>element`	`div>p`	Select all p nodes whose parent is a div node
`element+element`	`div+p`	Select all p nodes immediately after the div node
`[attribute]`	`[target]`	Select all nodes with the target attribute
`[attribute=value]`	`[target=blank]`	Select all nodes where target=”blank”
`[attribute~=value]`	`[title~=flower]`	Select all nodes whose title property contains the word flower
`:link`	`a:link`	Select all unvisited links
`:visited`	`a:visited`	Select all the links that have been visited
`:active`	`a:active`	Select active links
`:hover`	`a:hover`	Select the link over which the mouse pointer is located
`:focus`	`input:focus`	Select the input node that gets focus
`:first-letter`	`p:first-letter`	Select the first letter of each p node
`:first-line`	`p:first-line`	Select the first row of each p node
`:first-child`	`p:first-child`	Select all p nodes that belong to the first child of the parent node
`:before`	`p:before`	Insert content before the content of each P node
`:after`	`p:after`	Insert content after the content of each P node
`:lang(language)`	`p:lang`	Select all p nodes with the value of the lang attribute starting with it
`element1~element2`	`p~ul`	Select all UL nodes with p nodes before them
`[attribute^=value]`	`a[src^="https"]`	Select all A nodes whose SRC attribute value begins with HTTPS
`[attribute$=value]`	`a[src$=".pdf"]`	Select all A nodes whose SRC attribute ends in.pdf
`[attribute*=value]`	`a[src*="abc"]`	Select all A nodes whose SRC attribute contains an ABC substring
`:first-of-type`	`p:first-of-type`	Select all p nodes that belong to the first P node of its parent
`:last-of-type`	`p:last-of-type`	Select all p-nodes that belong to the last p-node of its parent
`:only-of-type`	`p:only-of-type`	Select all p nodes that belong to a p node unique to its parent
`:only-child`	`p:only-child`	Select all p nodes that belong to the only child of their parent
`:nth-child(n)`	`p:nth-child`	Select all p nodes that belong to the second child of their parent
`:nth-last-child(n)`	`p:nth-last-child`	As above, count from the last child node
`:nth-of-type(n)`	`p:nth-of-type`	Select all p nodes that belong to the second P node of their parent node
`:nth-last-of-type(n)`	`p:nth-last-of-type`	Same as above, but counting from the last child
`:last-child`	`p:last-child`	Select all p nodes that belong to the last child of their parent node
`:root`	`:root`	Select the root node of the document
`:empty`	`p:empty`	Select all P nodes with no children (including text nodes)
`:target`	`#news:target`	Select the currently active #news node
`:enabled`	`input:enabled`	Select each enabled Input node
`:disabled`	`input:disabled`	Select each disabled input node
`:checked`	`input:checked`	Select each selected Input node
`:not(selector)`	`:not`	Select all nodes that are not p nodes
`::selection`	`::selection`	Select the part of the node selected by the user

Another commonly used selector, described in more detail later, is XPath.

This section introduces the basic structure of the web page and the relationship between nodes. After understanding these contents, we can have a clearer idea to parse and extract web content.

This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)

[Python3 web crawler development combat] 2- crawler foundation 2- web page foundation

1. Composition of web pages

(1) HTML

(2) CSS

(3) JavaScript

2. The structure of the page

3. Node tree and relationship between nodes

4. The selector

Related Posts

How to set up and use the Nacos configuration center

Spring MVC Framework: Chapter 2: View parsers and @RequestMapping annotations are used at the class level and get native Servlet API objects

How to test a data lake in the lab