Speaking of reptiles we may feel very mysterious, in fact, it is not as magical as we imagine, today we will unveil its mysterious veil. A web weather crawler can be implemented in two simple steps.

A crawler simply consists of two parts:

1. Obtain web page text information.

2. Data analysis to obtain the data we want.



1. Get text information on web pages.

Python is very handy for getting HTML, and with the help of the URllib library, you only need a few lines of code to do what you need.

# import urllib
import urllib
def getHtml(url):
page = urllib.urlopen(url)
html = page.read()
page.close()
return htmlCopy the code

This returns the source code of the web page, which is the HTML code.

So how do we get the information we want from it? Then you need to use the most commonly used tool in web analysis – regular expressions.

2. Get what you need from regular expressions, etc.

When using regular expressions, you need to carefully observe the information structure of the web page and write the correct regular expression.

Python regular expressions are also concise:

# Introduce regular expression library
import re
def getWeather(html):
reg = ' (. *?) .*? (.*?) .*? (.*?) '
weatherList = re.compile(reg).findall(html)
return weatherListCopy the code

Where reg is the regular expression and HTML is the text obtained in the first step. What findall does is findall the strings in the HTML that match the regular matches and put them in the weatherList. We then enumerate the data output in weatheList.

There are two things to note about the regular expression reg here.

One is “(. *?) “. Anything in () is what we’ll get, and if there are multiple parentheses, then each result in FindAll will contain the contents of those parentheses. There are three brackets above, corresponding to city, minimum and maximum temperature.

The other is “.*?” . Python’s re matching is greedy by default, that is, it matches as many strings as possible by default. If a question mark is placed at the end, it indicates non-greedy mode, which matches as few strings as possible. In this case, the information of multiple cities needs to be matched, so the non-greedy mode needs to be used, otherwise there is only one matching result and it is incorrect.