In this section, we will explain zero-width assertion, a common and important knowledge of regular expressions. In this section, we will explain zero-width assertion, a common and important knowledge of regular expressions.

## Instance is introduced into

We need to extract the following questions and answer pairs from the conversation:

Q: I am using Windows XP+Service Pack 2. Why can’t I install the control that input the card number and password?

A: In Windows XP+Service Pack 2, Windows 2003 and other operating systems, users can choose whether to install controls.

Q: Why does the card number input box I see show an * sign?

A: Your browser forbids downloading and executing ActiveX controls. In this case, you must enable your browser’s ActiveX permissions. Operation method: from the browser menu select “tools” | “Internet options”, in the pop-up dialog, select “security” | “” |” custom level “Internet, in the pop-up dialog box, select the” reset to security level – “dot” reset “button, sure.

If we were implementing it in Python, we would probably naturally think of the split() or findall() method. If we were using the split() method, we would probably write something like this:

 1 2 3 4 import re Results = re. The split (‘ q: | answer: ‘text) for index, result in enumerate(results[1:]): print((‘Q’ if index%2 == 0 else ‘A’) + ‘: ‘ + result)

Here the split () method of the first parameter to the q: | a: the regular expression, that is to this passage with q: or a: separate, this function is a regular expression to string segmentation method, compared with the direct of the string the split () method is more powerful. The result is actually a list of odd lengths. If we print results, it looks like this:

This is because we split the character itself is in the entire text of the character, so we found the mark of the split:, so it is the left of the result is an empty string, so the final result is the first content is an empty string, the following content is a normal short sentence. So here we also need to slice the result, remove the first element, and then print it through. The final result is as follows:

That’s fine, we can extract it without a problem, but it doesn’t feel very elegant because we’re splitting the question and the answer separately, we’re not splitting the answer together, and the split() method doesn’t return the first element of the result that we want, So you have to do some slicing to get rid of it, so it doesn’t feel perfect.

So we came up with the findAll () method again, where we would say:

``1234import reresults = re.findall('asked: (. *?) Answer: (. *?) ', text, re.S)for result in results:    print('Q: ' + result[0], 'A: ' + result[1], sep='\n')Copy the code``

At the end of the query, we do not specify the end of the match, so the result is that the answer is not matched at all.

 1 2 3 4 5 6 7 8 9 10 11 12 Q: I am using Windows XP+Service Pack 2. Why can’t I install the control that input the card number and password? A: Q: Why does the card number input box I see show an * sign? A: Q: After reading the above questions, I still cannot log in. What should I do? A: Q: The public login page of personal online banking cannot be displayed. A: Q: I always make mistakes when I input my account number and card number. How do I input? A: Q: My passbook has no password. How can I check my balance in the popular version of personal Online banking? A:

The end point of a regular expression match is the end point of a regular expression match. So we might rewrite it like this:

``1234import reresults = re.findall('asked: (. *?) Answer: (. *?) Q:", text, re.S)for result in results:    print('Q: ' + result[0], 'A: ' + result[1], sep='\n')Copy the code``

This may seem like a good idea, but it turns out to be this:

The findall() method will findall results that match the regular expression, but it also has an internal lookup index scanning for matches. When we find the first result that meets the requirement, our search index has moved to the first question at the beginning of the second question pair, since we end the query by asking: at the end of the regular expression. Above, the index is already in the position of the second question pair, and the next time it finds a result that meets the requirement, the index moves back to scan, so it asks from the second question pair: So the second question pair is actually split, so it can only find regular expression content when it looks for the third question pair. Therefore, we can observe that the results returned are only the first, third and fifth question pairs.

So, if we want to use this method to find the complete retention pair, we need to use the zero-width assertion.

The solution is as follows:

``1234import reresults = re.findall('asked: (. *?) Answer: (. *?) (? = q: | \ Z) ', text, re.S)for result in results:    print('Q: ' + result[0], 'A: ' + result[1], sep='\n')Copy the code``

The running results are as follows:

Here we are actually using (? =), or the end character \Z. This actually guarantees that the search index will not move further back during a match, but it also marks the end flag, so that it can find the full content.

## Zero width assertion

A zero-width assertion, as its name implies, is a zero-width match that does not store what is matched in the result of the match. The match content of an expression simply represents a position, such as how the right boundary of a character is constructed.

We used? =, this is one of them. What else is there? < =,? ! ,?

• `? =`Represents a zero-width positive prediction ahead assertion, which asserts that the position following its occurrence matches the following expression.
• `? < =`Represents zero-width retrospective postassertion, which asserts that the position before its occurrence matches the expression that follows it.
• `? !`Represents zero-width negative predictive preemption assertion, which asserts that the position following its occurrence cannot match the following expression.
• `? <!`Represents zero-width negative retrospective assertion, which asserts that the position after itself does not match the following expression.

### ? =

First of all, what are we going to do? =, which asserts that the position after its occurrence matches the expression that follows it.

Let’s say we have a string like this:

``1str = 'My personal email is [email protected], my personal blog is cuiqingcai.com, my personal official account is the Coder of attacks'Copy the code``

Here we want to separate the statement “my personal email” from the statement “my personal email”. If we don’t use zero-width assertion, we need to add an end identifier to the statement after the statement “my personal email” or a separate matching email as an identifier. We might write:

``1234import restr = 'My personal email is [email protected], my personal blog is cuiqingcai.com, my personal official account is the Coder of attacks'result = re.search('My personal email is (.*?). , personal blog ', str)print('Whole sentence result:' + result.group(), 'First match result:' + result.group(1), sep='\n')Copy the code``

At the end of the regular expression we add “personal blog” as the end of the match, and then the mailbox part of the match with the pattern of non-greedy match, let’s look at the result:

 1 2 My personal email is [email protected], my personal blog First matching result: [email protected]

We can see that the first matching result successfully got the email information, but we can see that the whole sentence result is not ideal, it matches the ending logo we added, but does not get a normal sentence.

What if we use? =, the result will not have this identifier, rewrite as follows:

 1 2 3 4 import re STR = ‘my personal email is [email protected], my personal blog is cuiqingcai.com, my personal official account is the attack Coder’ Result = re.search(‘ My personal email address is (.*?). (? =, personal blog)’, STR) Print (‘ result.group() ‘, ‘result.group(1)’, sep=’\n’)

Here we have changed the closing identifier to (? = personal blog), so this part of the content is matched as zero width, which means that the personal blog needs to be followed, but it does not appear in the matching result.

The running results are as follows:

 1 2 My personal email is [email protected] First matching result: [email protected]

You can see that there are no useless suffix characters in the result of the whole sentence.

### ? < =

Now what do we do? <= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

 1 2 3 4 import re STR = ‘my personal email is [email protected], my personal blog is cuiqingcai.com, my personal official account is the attack Coder’ result = re.search(‘(? <=,) Personal blog is (.*?). (? =) ‘, STR) Print (‘ result.group() ‘, ‘result.group(1)’, sep=’\n’)

Here we add a zero-width assertion comma to the beginning of the personal blog, using? <=, the end of the sentence? =, so the identifiers before and after will not match, and the result is as follows:

 1 2 My personal blog is Cuiqingcai.com First matching result: cuiqingcai.com

You can see that the whole sentence is a whole sentence.

### ? !

? ! Represents zero-width negative predictive preemption assertion, which asserts that the position following its occurrence cannot match the following expression. It is also used to match the following text, but this is the inverse, which specifies that the following content does not match the flag, we modify the previous example as follows:

 1 2 3 4 import re STR = ‘my personal email is [email protected], my personal blog is cuiqingcai.com, my personal official account is the attack Coder’ Result = re.search(‘ My personal email address is (.*?). (? ! , personal public account)(? =, personal blog)’, STR) Print (‘ result.group() ‘, ‘result.group(1)’, sep=’\n’)

It is (? =, personal blog) identifier, but here we use? ! To specify another identifier, the personal public number, which represents the need for the following statement (? = personal blog) instead of personal public account, the result is as follows:

 1 2 My personal email is [email protected] First matching result: [email protected]

### ?

?

 1 2 3 4 import re STR = ‘my personal email is [email protected], my personal blog is cuiqingcai.com, my personal official account is the attack Coder’ result = re.search(‘(? (< =)? Print (‘ result.group() ‘, ‘result.group(1)’, sep=’\n’)

So what do we have here?

The running results are as follows:

 1 2 My personal blog is Cuiqingcai.com First matching result: cuiqingcai.com

## Common usage

In the example above, we use the search() method to match the content. This is not very common because we are more concerned with matching the contents of the grouped results. In fact, we use the findAll () method to match multiple results, just like our original example. Here we still take the string as an example, to output personal email, personal blog, personal public number three content, the code is as follows:

 1 2 3 4 5 import re STR = ‘my personal email is [email protected], my personal blog is cuiqingcai.com, my personal official account is the attack Coder’ Results = re.findAll (‘ personal (.*?)) results = Re.findAll (‘ personal (.*?)) Is (. *?) (? = | \ Z) ‘, STR) for result in results: print(result[0] + ‘: ‘ + result[1])

Here we match the individual word, and then followed the match not greed, then add a word, the key is the ending identifier, there must be using zero width assertion can match three as a result, here is the content of the match, | \ Z, means that matches a comma or end.

The running results are as follows:

 1 2 3 Email address: [email protected] Blog: cuiqingcai.com Public id: attack Coder

In this way, we successfully output the content of the mailbox, blog and public number, and the match is very smooth and convenient.

## conclusion

In this section, we should have a general understanding of the basic usage and application scenarios of zero-width assertions in regular expressions. We believe that after understanding zero-width assertions, we will be more comfortable with regular matching.

This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)