Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

This share is a crawler case, to climb the site is this: www.weather.com.cn/, the goal is to climb the specified city’s current weather conditions.

Analysis of the website

First come to the webpage of target data, taking Jinan, Shandong province as an example:www.weather.com.cn/weather1d/1…

Obviously, there are a lot of charts in this page, and most of the data is obtained dynamically. For this kind of page, we need to press F12, find the network, grab the package, and find that when refreshing the current page, the browser will send a lot of requests, and the request header and the request body can be clearly seen. I found this request www.weather.com.cn/weather1d/1…

Check the Response and find that the desired weather information is in the Response body.

After analyzing the site, we found that we could get all the weather information we wanted. We just asked for it and then used certain rules to parse the page. Just get the data we need.

Code implementation

Here I parse the page with re regular expression, of course, can also use other ways, the code is not difficult, almost no pit, here will not explain, directly put the code:

import requests
import re
class WeatherSpider(object) :
    """ "Weather crawler """ def __init__(self) :
        self.headers = {
                        'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko)'
                                      'the Chrome / 81.0.4044.43 Safari / 537.36'
                        }
        self.url = 'http://www.weather.com.cn/weather1d/101120101.shtml'

    def spider(self) :
        response = requests.get(url=self.url, headers=self.headers)
        content = response.content.decode('utf-8')
        Use re.compile() to create a pattern object based on the string of the contained regular expression
        re_weather = re.compile(' ')
        re_update_time = re.compile(' ')
        weather = re_weather.findall(content)
        update_time = re_update_time.findall(content)
        print(weather[0])
        print('Last Update at:', update_time[0])
        life_index = re.compile('
  • .(.*?) \n(.*?) \n

    '

  • '(. *?)

    .*? \n'
    , re.S) # re.s makes. Match all characters, including newlines more_weather = life_index.findall(content) # print(more_weather) for item in more_weather: print(item[1].':', item[0] +",", item[2]) Copy the code

    Get city number

    This is jinan as an example, then someone said, how to climb the designated city weather? That way, we carefully look at www.weather.com.cn/weather1d/1… This link, found the string of numbers 101120101, then I dare to guess, 101120101 is the website to Ji ‘nan number, is the representative of Ji ‘nan.

    Fact verification is such, each city’s number is different, so we can get the weather of the city as long as we get the number of the specified city, how to get the number of the specified city?

    I found this city search bar on the home page. Found the search will send this link toy1.weather.com.cn/search?city… Jinan:

    I tried it with Postman, and sure enough, I found the number 101120101 in Jinan in the return.

    Toy1.weather.com.cn/search?city…

    The following code emulates this request and then processes the return body to get the city number.

    def get_no(cityname) :
        response = requests.get(url="http://toy1.weather.com.cn/search?cityname=" + cityname)
        content = response.content.decode('utf-8')
        content_t = eval(content)[0] ["ref"]
        print(content_t, type(content_t))
        city_no = content_t.split("~") [0]
        print(city_no)
        return city_no
    Copy the code

    Code integration

    Let’s put the two pieces of code together:

    import requests
    import re
    
    
    class WeatherSpider(object) :
        """ "Weather crawler """ def __init__(self) :
            self.headers = {
                            'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko)'
                                          'the Chrome / 81.0.4044.43 Safari / 537.36'
                            }
            self.url = 'http://www.weather.com.cn/weather1d/%s.shtml'
            self.no_url = "http://toy1.weather.com.cn/search?cityname="
    
        def spider(self, city_no) :
            try:
                print("Send request:", self.url %city_no)
                response = requests.get(url=self.url %city_no, headers=self.headers)
                content = response.content.decode('utf-8')
                Use re.compile() to create a pattern object based on the string of the contained regular expression
                re_weather = re.compile(' ')
                re_update_time = re.compile(' ')
                weather = re_weather.findall(content)
                update_time = re_update_time.findall(content)
                print(weather[0])
                print('Last Update at:', update_time[0])
                life_index = re.compile('
  • .(.*?) \n(.*?) \n

    '

  • '(. *?)

    .*? \n'
    , re.S) # re.s makes. Match all characters, including newlines more_weather = life_index.findall(content) # print(more_weather) for item in more_weather: print(item[1].':', item[0] +",", item[2]) except Exception as e: print("Abnormal", e) return False def get_no(self, cityname) : try: print("Send request:", self.no_url + cityname) response = requests.get(url=self.no_url + cityname, headers=self.headers) content = response.content.decode('utf-8') if len(eval(content)) <= 0: return False content_t = eval(content)[0] ["ref"] # print(content_t, type(content_t)) city_no = content_t.split("~") [0] print("%s No. :" % cityname,city_no) return city_no except Exception as e: print("Abnormal", e) return False if __name__ == '__main__': while True: cityname = input("Please enter a city name (press Q to end the program):") if cityname == "Q": break w_spider = WeatherSpider() city_no = w_spider.get_no(cityname) if city_no: w_spider.spider(city_no) else: print("Input error") Copy the code

    Because China weather network crawl mechanism is not very strong, so the whole process is relatively smooth, in fact, China weather network has access to 7 days, 15 days, 40 days of weather request links, here will not say, interested partners can go to in-depth understanding of it.

    Finally, thank my girlfriend for her tolerance, understanding and support in work and life!