Arm your crawlers with advanced Xpath

This paragraph of time didn’t write blogs, one reason is due to graduate the school learning motivation rather than down, another aspect because it took a long time study compilation principle of themselves don’t have much understanding about it, however, so there would be no finishing their own knowledge, now gradually stabilized, will continue to update the blog as before

The introduction

Why XPath? I always thought I knew a little bit about XPath, but the more I learned about it, the more I admired it,

In short, I used to think of XPath as a “gun” for structured documents, not a “cruise missile” that automatically tracks targets.

Let’s talk about the power of XPath slowly from its basics

What is the`XPath`

First of all, XPath is a language. You can understand it as regular, you can understand it as SQL, and the purpose of both is to find what we want from data. Whereas SQL fetches data from a database, XPath fetches data from an XML file.

Okay, so we know what XPath is dealing with, what XML is, it’s a structural document, and we can also think of it as a tree structure.

<root>
    <son> I' m son </son>
</root>
Copy the code

For example, a simple XML document starts at a parent node point and ends at the last parent node. There can be many children or grandchildren in the middle, but for each node, there can only be one parent.

This document is the kind of data that appears as a result of the tree structure in our coding program. XPath deals with a regular pile of text, rather than having regees manipulate text directly, and while you can use regees to manipulate XML documents, the regees don’t capture this XML relationship, which is where XPath is most powerful.

In this relationship, the most common and central one we use is the parent-child relationship, which is simply represented by a /. For example, we now complicate the XML above by adding a son

<root>
        <son id="1"> I' m son1 </son>
        <son id="2"> I'm son2</son>
    </root>
Copy the code

To get the second son we simply use this XPath statement to get it

/root/son[@id='1']
Copy the code

PS: Of course in XPath we can use // to represent a generic reference, by //son[@id=’1′] we can find the son without caring about the father, or even further, I don’t care who it is, as long as its ID is 1, we replace a node with a node() function, //node()[@id=’1′

Let’s look at this XPath, we define a relationship root and son. The power of XPath is that you can define a node relationship in a very simple statement. In this statement, root and son are both nodes, and we use/to specify the node parent-child relationship, Use [] to define a node’s direct relationship to its own internal node or property (@ is to get property)

To master XPath, it is important to understand that XPath is all about the relationship between “faces” and “points”. “faces” represent nodes, and “points” represent attributes. For a face, it can contain many points, and for a point, it can be viewed as consisting of many smaller faces (microscopic).

In the example above, for the two nodes root and son, where root is the parent, we can define it with a number of attributes, such as root[count(son, 2)]. For the son node in the parent relationship, He can also constrain it with attributes such as root[count(son[@id], 2)](meaning root with two sons, each with id). From this we can see that nodes and attributes can be nested with each other.

As you can see from this little example above, the power of XPath is that it can be used to constrain in great detail the relationships between nodes and other nodes or attributes. These relationships can be absolute or relative, depending on how you choose. Absolute means strict, relative means loose.

PS: Of course, we have a broad understanding of attributes here. In XPath, nodes also include text values, which can be regarded as a text attribute of nodes.

`XPath`Other relationships of

Earlier we introduced one of the most important relationships in XPath: the parent-child relationship. This is one of the main relationships we use with XPath, and most tutorials on the web are based on these relationships. We won’t cover them in detail in this blog post, but you can learn more about them on W3CShool.

Let’s use questions to introduce other relationships, and then we’ll modify the simple XML above

<root>
    <son id="1"> I' m son1 </son>
    <target> son1 target</target>
    <son id="2"> I'm son2</son>
    <target> son2 target </target>
</root>
Copy the code

We introduce two targets, and now we want to get the target next to son (id 1). If we use parent-child relationship, we can also get the target with /root/target[1](XPath indexing starts at 1), but we introduce a constraint that it must be root, the first target node, If the XML is random, son and target are collections, but their positions are variable, we can’t just rely on parent-child relationships to determine node positions.

So we’re going to introduce the probability that son and target are siblings, so we can locate target by knowing where son is, so how do we write this XPath, first we need to locate son

/root/son[@id='1']
Copy the code

Next we get the siblings behind it by locating the son (also has the preceding sibling syntax)

/root/son[@id='1']/following-sibling::target[1]
Copy the code

Target [1] = target[1] = target[1] = target[1] = target[1] = target[1] = target[1

Of course, we can find its brother by the second son, the corresponding syntax is as follows

/root/son[@id='2']/preceding-sibling::target[1]
Copy the code

Following sibling, preceding-sibling, following-sibling, what is the following-sibling, what is the following-sibling

Let’s wrap son and Target, which is probably more common in real life

<root>
<group>
    <son id="1"> I' m son1 </son>
    <target> son1 target</target>
</group>
<group>
    <son id="2"> I'm son2</son>
    <target> son2 target </target>
</group>
</root>
Copy the code

If we still use the above statement, we will find that we cannot find the statement, at this time you remove the brother constraint

/root/group/son[@id='2']/preceding::target[1]
Copy the code

You’ll be amazed at how accurately XPath finds our target, and the amazing thing is that it allows for a “mountain over mountain” lookup.

If you use a re or regular parent-child relationship, you must first find its group and then use the for loop to iterate through all the groups to find son… .

conclusion

We use a simple context can easily implement hundreds of lines of code, when I didn’t understand the relationship between the XPath constraints, in order to find this constraint has written a few lines of code to locate, and a simple line is done now, we have to admire the wisdom of predecessors, I just want to say “delicious ~ ~ ~”.

Arm your crawlers with advanced Xpath

The introduction

What is theXPath

XPathOther relationships of

conclusion

Related Posts

Pure dry! Read more than 10 Thread detailed explanation, only Ali P7 big guy this is the king

I want both Go and Rust!

Second kill system design

What is the`XPath`

`XPath`Other relationships of