Python crawlers - Urllib and RE modules (Part 1)

The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with

The following article is from Tencent Cloud by keinYe

(Want to learn Python? Python Learning exchange group: 1039649593, to meet your needs, materials have been uploaded to the group file stream, you can download! There is also a huge amount of new 2020Python learning material.) The main purpose of crawler is to get the web page from the website and parse out the useful information in the web page. Obtaining web content from websites can be achieved through the built-in Urllib module of Python. As for information parsing, it is more complicated, and there are many modules that can be used in Python. Today we mainly use the regular expression “Python built-in RE module” to achieve data parsing. Today we are going to use urllib and RE modules to realize the function of crawler to read the data data from the web page.

Identify the target

Our goal is to obtain the ladder price of components on Lichuang Mall, “different purchase quantities correspond to different prices”. Let’s take a look at the picture of the pageThe data we want is right here

Implementation plan

First of all, we can go to the web site after ‘item.szlcsc.com/213095.html… The urllib.urlopen method reads the content of a web page

url = 'https://item.szlcsc.com/213095.html'
response = urllib2.urlopen(url)
html_text = response.read().decode('utf-8')
Copy the code

To get information about ladder prices, let’s take a look at the corresponding HTML:As can be seen from the figure, each step price is segmented by TR label, while the quantity and corresponding price in each row are displayed by TD label. We can use the following regular expression to extract the quantity and price content.

Extract the regular expression for each step price
'(.*?) '
Extract the number in a row
'(.*?) '
Extract the price in a row
"(.*?) 
"
Copy the code

Now let’s look at the complete program

# -*- coding:utf-8 -*-

import urllib2
import re

def find_number(str) :
    Gets the range of quantities in each row.
    res = r'(.*?) '
    find_str = re.findall(res, str, re.S)[0]
    # remove units
    res_2 = '[1-9]{1}[\\d ~\\s]*\\d'
    find_str = re.findall(res_2, find_str, re.S)[0]
    # Remove whitespace from strings
    strinfo = re.compile('[\\s]')
    return re.sub(strinfo, ' ', find_str)

def find_price(str) :
    Get the price information in each row.
    res = r"(.*?) 
"
    find_str = re.findall(res, str, re.S)
    # Display None if there is no corresponding price
    if len(find_str):
        # Remove units from the price
        res_2 = '[1-9]{1}[\\d\\.]*'
        find_str = re.findall(res_2, find_str[0], re.S)
        return find_str[0]
    else:
        return 'None'

url = 'https://item.szlcsc.com/213095.html'
# Read the content of the web page and decode the relevant content
response = urllib2.urlopen(url)
html_text = response.read().decode('utf-8')
res_tr = r'(.*?) '
m_tr = re.findall(res_tr, html_text, re.S)
print '%4s | %10s | %5s'% ('number'.'number'.'price')
print "-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --"
for n, value in enumerate(m_tr):
    print '%4d | %10s | %5s' %(n + 1, find_number(value), find_price(value))
    print "-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --"
Copy the code

The inspection results

The code has been written, now let’s verify the effect of the execution, the above code execution result is as follows:

Serial number number | | unit price -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --1 |        1~9 |  9.21
-------------------------
   2 |      10~29 |  6.81
-------------------------
   3 |      30~99 |  6.37
-------------------------
   4 |    100~499 |  5.93
-------------------------
   5 |    500~999 |  5.73
-------------------------
   6 |       1000 |  5.64
-------------------------
Copy the code

Compare the execution results with the web page information we saw earlier, and you can see that the program executed correctly and got the correct results.

In we will address change as’ item.szlcsc.com/8796.html,…

At this point, we execute the program again and get the following results:

Serial number number | | unit price -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --1 |        1~9 | 13.82
-------------------------
   2 |      10~29 | 11.75
-------------------------
   3 |      30~99 | 11.37
-------------------------
   4 |    100~499 | 10.99
-------------------------
   5 |    500~999 | 10.82
-------------------------
   6 |  1000~1999 | 10.61
-------------------------
   7 |       2000 |  None
-------------------------
Copy the code

As you can see, the above results are exactly the same as in the web page, and the code does what we intended.

Note: This code is verified in Python version 2.7.10.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python crawlers — Urllib and RE modules (Part 1)

Identify the target

Implementation plan

The inspection results

Python crawlers — Urllib and RE modules (Part 1)

Identify the target

Implementation plan

The inspection results

Related Posts

Six best practices for using Git in your team

Git and Github use Guide

Prometheus Alert Monitoring Indicators Building a Monitoring System vii (Introduction)