😀 This is the 3rd original crawler column

When you visit a web site with a browser, the page looks different. Have you ever wondered why it looks the way it does? In this section, we will take a look at the composition, structure and nodes of a web page.

1. Composition of web pages

Web pages can be divided into three main parts — HTML, CSS, and JavaScript. If you compare a web page to a person, HTML is the skeleton, JavaScript is the muscle, CSS is the skin, the combination of the three to make a complete web page. Let’s introduce the functions of these three parts respectively.

(1) the HTML

HTML is called HyperText Markup Language in English, and Chinese translation is called HyperText Markup Language, but we usually do not use Chinese translation to call it, generally called HTML.

HTML is a language used to describe web pages. Web pages include text, buttons, images and videos, and their basic architecture is HTML. Different types of elements are represented by different types of tags, such as image represented by IMG tag, video represented by video tag, paragraph represented by P tag. The layout between them is often nested and combined by layout tag DIV. Various tags form the framework of the web page through different arrangement and nesting.

So what does HTML look like? We can open a website at random, such as Taobao www.taobao.com, and then right click on “Check Elements” or press F12 to open the browser developer tool. Switch to the Elements panel, and you can see the corresponding HTML of Taobao displayed here. It consists of a series of tags that are parsed by the browser and rendered into nodes in the web page that form the web page we see. For example, you can see that an input field corresponds to an input tag, which can be used to enter text.

Different tags correspond to different functions. The nodes defined by these tags nest and combine each other to form a complex hierarchical relationship, which forms the architecture of web pages.

(2) the CSS

HTML defines the structure of a web page, but the layout of an HTML page alone is not beautiful and may be a simple arrangement of node elements. To make the page look better, I use CSS.

Cascading Style Sheets CSS stands for Cascading Style Sheets. “Cascading” means that when several style files are referenced in HTML and the styles conflict, the browser can process them in a cascading order. “Style” refers to text size, color, element spacing, arrangement, etc. CSS is currently the only standard for web page layout that helps make pages look better.

In the image above, the Styles panel displays a list of CSS Styles, such as a snippet of CSS that looks like this:

#head_wrapper.s-ps-islite .s-p-top {
  position: absolute;
  bottom: 40px;
  width: 100%;
  height: 181px;
}
Copy the code

This is a CSS style. The braces are preceded by a CSS selector. This selector means that the node whose id is head_wrapper and class is S-ps-islite is selected first, and then the node whose internal class is S-P-top. For example, position specifies the layout of the node as absolute, bottom specifies the bottom margin of the node as 40 pixels, width specifies the width of 100% to fill the parent node, and height specifies the height of the node. In other words, we write the position, width, height and other style configuration in this form, and then enclose it in braces, and then add a CSS selector at the beginning. This means that the style applies to the node selected by the CSS selector, and the node is displayed according to this style.

In web pages, the style rules of the entire web page are generally defined uniformly and written into the CSS file (its suffix is CSS). In HTML, just use the link tag to import a written CSS file, and the entire page becomes beautiful and elegant.

(3) the JavaScript

JavaScript, or JS for short, is a scripting language. HTML and CSS work together to provide users with static information and lack of interactivity. We may see some interactive and animated effects on a web page, such as download progress bars, prompt boxes, and scrolling graphics, which are usually the result of JavaScript. Its appearance makes the relationship between users and information not only a browsing and display, but also realizes a real-time, dynamic and interactive page function.

JavaScript is also usually loaded as a separate file with the suffix js, which is introduced in HTML with script tags, such as:

< script SRC = "jquery - 2.1.0. Js" > < / script >Copy the code

To sum up, HTML defines the content and structure of a web page, CSS describes the style of a web page, and JavaScript defines the behavior of a web page.

2. The structure of the page

Let’s first get a feel for the basic structure of HTML with examples. Create a new text file called test.html with the following contents:

<! DOCTYPEhtml>
<html>
  <head>
    <meta charset="UTF-8" />
    <title>This is a Demo</title>
  </head>
  <body>
    <div id="container">
      <div class="wrapper">
        <h2 class="title">Hello World</h2>
        <p class="text">Hello, this is a paragraph.</p>
      </div>
    </div>
  </body>
</html>
Copy the code

This is an EXAMPLE of HTML in its simplest form. The beginning defines the document type with a DOCTYPE, followed by the outermost HTML tag, and finally the corresponding closing tag to indicate closure. Inside the tags are the head tag and the body tag, representing the header and body of the page respectively, which also need closing tags. The head tag defines some page configurations and references, such as:

<meta charset="UTF-8" />
Copy the code

It specifies a UTF-8 encoding for the web page.

The title tag defines the title of the page and is displayed in the TAB of the page, not in the body. Inside the body tag is what is displayed in the body of the web page. The div tag defines a block in the web page. Its ID is container. This is a very common attribute, and the content of the ID is unique to the web page. Inside this block there is a div tag with a Wrapper class, which is also a very common property and is often used in conjunction with CSS to set styles. The block then has an H2 tag inside, which represents a secondary title. In addition, there is a P tag, which represents a paragraph. Each of them has its own class attribute, and can be rendered on a web page by writing the corresponding content.

After saving the code, double-click on the file to open it in your browser, and you can see what the following figure shows.

As you can see, the TAB displays the word This is a Demo, which is the text we defined in the title of the head. The body of the web page is generated by the elements defined inside the body tag, showing the secondary headings and paragraphs.

This example is the general structure of a web page. The standard form of a web page is that the head and body tags are nested within an HTML tag. The head defines the configuration and references to the page, and the body defines the body of the page.

3 Node tree and relationship between nodes

In HTML, all tag definitions are nodes, which form an HTML node tree, also known as an HTML DOM tree.

So let’s look at what DOM is. DOM is a W3C (World Wide Web Consortium) standard, its English full name Document Object Model, that is, Document Object Model. It defines standards for accessing HTML and XML documents. According to the W3C HTML DOM standard, everything in an HTML document is a node.

  • The entire site document is a document node.
  • Each HTML tag corresponds to a root element node, the HTML tag in the above example, which belongs to a trailing element node.
  • The text inside a node is a text node. For example, the A node represents a hyperlink, and the text inside it is also considered a text node.
  • The property of each node is an attribute node, so for example, a node has an href attribute, which is an attribute node.
  • A comment is a comment node, which in HTML has a special syntax that is interpreted as a comment, but it also corresponds to a node.

Therefore, the HTML DOM treats an HTML document as a tree structure, which is called a node tree, as shown in the figure below:

With the HTML DOM, all nodes in the tree are accessible through JavaScript, and all HTML node elements can be modified, created or deleted.

Nodes in a node tree have hierarchical relationships with each other. We often use the terms parent, child and sibling to describe these relationships. Parent nodes have children, and sibling children are called siblings.

In the node tree, the top node is called root. In addition to the root node, each node has a parent node and can have any number of children or siblings. The figure shows a tree of nodes and the relationships between nodes.

4. The selector

We know that web pages are composed of nodes, and CSS selectors set different style rules for each node, so how to locate nodes?

In CSS, we use CSS selectors to locate nodes. For example, the div node whose ID is container in the example above could be represented as #container, where # indicates the selection ID followed by the name of the ID. Alternatively, if we want to select a node whose class is wrapper, we can use.wrapper, where the dot (.) is used. The beginning represents the selection of class, followed by the class name. Another option is to filter by tag name. For example, if you want to select a secondary title, use H2 directly. These are the three most commonly used representations, filtering by ID, class, and tag name, so keep them in mind.

In addition, CSS selectors support nested selection, which can be represented by separating the selectors with Spaces. For example, # container.wrapper p means that the node with the ID of Container is selected first. Then select the internal node whose class is Wrapper, and further select the internal P node. In addition, if there is no space, it indicates parallelism. For example, div#container. Wrapper. Select the p node whose internal class is text. This is the CSS selector, and its filtering capabilities are quite powerful.

You can test the CSS selector in the browser, still open the developer tools in the browser, and press Ctrl + F (or Command + F if you’re on a Mac). This will bring up a search box in the lower left corner, as shown.

The node with class title selected by typing.title will be highlighted in the page, as shown in the figure below:

Type div# container.wrapper p.ext to select layer by layer the p node in the container whose id is wrapper class, as shown in the figure below:

In addition, CSS selectors have some other syntax rules, as shown in the following table.

Other syntax rules for CSS selectors

The selector example Case description
.class .intro chooseclass="intro"All nodes of
#id #firstname chooseid="firstname"All nodes of
* * Select all nodes
element p Select all thepnode
element,element div,p Select all thedivNodes and allpnode
element element div p choosedivAll within the nodepnode
element>element div>p Select the parent node asdivOwnership of nodespnode
element+element div+p The choice is immediately followingdivEverything after the nodepnode
[attribute] [target] Choose to containtargetProperty of all nodes
[attribute=value] [target=blank] choosetarget="blank"All nodes of
[attribute~=value] [title~=flower] choosetitleProperties contain wordsflowerAll nodes of
:link a:link Select all unvisited links
:visited a:visited Select all the links that have been visited
:active a:active Select active links
:hover a:hover Select the link over which the mouse pointer is located
:focus input:focus Choose the one that gets the focusinputnode
:first-letter p:first-letter Select eachpThe first letter of a node
:first-line p:first-line Select eachpThe first line of a node
:first-child p:first-child Select all that belong to the first child of the parent nodepnode
:before p:before At the end of eachpThe content of a node is inserted before the content
:after p:after At the end of eachpInsert content after the content of the node
:lang(language) p:lang Select withitAt the beginning oflangProperty valuespnode
element1~element2 p~ul Select beforepOwnership of nodesulnode
[attribute^=value] a[src^="https"] Select thesrcThe attribute value tohttpsAll at the beginninganode
[attribute$=value] a[src$=".pdf"] Select thesrcAttributes ending in.pdfanode
[attribute*=value] a[src*="abc"] Select thesrcProperty containsabcAll of the substringsanode
:first-of-type p:first-of-type Select the first one that belongs to its parentpOwnership of nodespnode
:last-of-type p:last-of-type Select the last one that belongs to its parentpOwnership of nodespnode
:only-of-type p:only-of-type Select one that is unique to its parentpOwnership of nodespnode
:only-child p:only-child Select all of the unique child nodes that belong to their parentpnode
:nth-child(n) p:nth-child Select all of the second children of the parent nodepnode
:nth-last-child(n) p:nth-last-child As above, count from the last child node
:nth-of-type(n) p:nth-of-type Select belongs to the second parentpOwnership of nodespnode
:nth-last-of-type(n) p:nth-last-of-type Same as above, but counting from the last child
:last-child p:last-child Select all that belong to the last child of its parent nodepnode
:root :root Select the root node of the document
:empty p:empty Select all that have no child nodespNodes (including text nodes)
:target #news:target Select the currently active#newsnode
:enabled input:enabled Select each enabledinputnode
:disabled input:disabled Select each disabledinputnode
:checked input:checked Select each one that is selectedinputnode
:not(selector) :not Select thepAll nodes of a node
::selection ::selection Select the part of the node selected by the user

There is also a more common selector, XPath, which is described in more detail later.

5. To summarize

This section introduces the structure of the web page and the relationship between nodes. Knowing these contents, we can have a clearer idea to parse and extract web content.

Reference Sources for this section:

  • Document-html-mdn Web Docs: developer.mozilla.org/en-US/docs/…
  • Docs – JavaScript – MDN Web Docs: developer.mozilla.org/en-US/docs/…
  • Documentation – HTML DOM node – W3School:www.w3school.com.cn/htmldom/dom…
  • Documentation – HTML – wikipedia: en.wikipedia.org/wiki/HTML
  • Documentation – CSS Selector – W3School:www.w3schools.com/cssref/css_…

Thank you very much for reading. For more exciting content, please pay attention to my public account “Attack Coder” and “Cui Qingcai | Jingmi”.