66 lines of code to crawl the current weather conditions in the specified city

Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

This share is a crawler case, to climb the site is this: www.weather.com.cn/, the goal is to climb the specified city’s current weather conditions.

Analysis of the website

First come to the webpage of target data, taking Jinan, Shandong province as an example:www.weather.com.cn/weather1d/1…

Obviously, there are a lot of charts in this page, and most of the data is obtained dynamically. For this kind of page, we need to press F12, find the network, grab the package, and find that when refreshing the current page, the browser will send a lot of requests, and the request header and the request body can be clearly seen. I found this request www.weather.com.cn/weather1d/1…

Check the Response and find that the desired weather information is in the Response body.

After analyzing the site, we found that we could get all the weather information we wanted. We just asked for it and then used certain rules to parse the page. Just get the data we need.

Code implementation

Here I parse the page with re regular expression, of course, can also use other ways, the code is not difficult, almost no pit, here will not explain, directly put the code:

import requests
import re
class WeatherSpider(object) :
    """ "Weather crawler """ def __init__(self) :
        self.headers = {
                        'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko)'
                                      'the Chrome / 81.0.4044.43 Safari / 537.36'
                        }
        self.url = 'http://www.weather.com.cn/weather1d/101120101.shtml'

    def spider(self) :
        response = requests.get(url=self.url, headers=self.headers)
        content = response.content.decode('utf-8')
        Use re.compile() to create a pattern object based on the string of the contained regular expression
        re_weather = re.compile(' ')
        re_update_time = re.compile(' ')
        weather = re_weather.findall(content)
        update_time = re_update_time.findall(content)
        print(weather[0])
        print('Last Update at:', update_time[0])
        life_index = re.compile('.(.*?) \n(.*?) \n'
                                '(. *?) 
.*? \n', re.S)  # re.s makes. Match all characters, including newlines
        more_weather = life_index.findall(content)
        # print(more_weather)
        for item in more_weather:
            print(item[1].':', item[0] +",", item[2])
Copy the code

Get city number

This is jinan as an example, then someone said, how to climb the designated city weather? That way, we carefully look at www.weather.com.cn/weather1d/1… This link, found the string of numbers 101120101, then I dare to guess, 101120101 is the website to Ji ‘nan number, is the representative of Ji ‘nan.

Fact verification is such, each city’s number is different, so we can get the weather of the city as long as we get the number of the specified city, how to get the number of the specified city?

I found this city search bar on the home page. Found the search will send this link toy1.weather.com.cn/search?city… Jinan:

I tried it with Postman, and sure enough, I found the number 101120101 in Jinan in the return.

Toy1.weather.com.cn/search?city…

The following code emulates this request and then processes the return body to get the city number.

def get_no(cityname) :
    response = requests.get(url="http://toy1.weather.com.cn/search?cityname=" + cityname)
    content = response.content.decode('utf-8')
    content_t = eval(content)[0] ["ref"]
    print(content_t, type(content_t))
    city_no = content_t.split("~") [0]
    print(city_no)
    return city_no
Copy the code

Code integration

Let’s put the two pieces of code together:

import requests
import re


class WeatherSpider(object) :
    """ "Weather crawler """ def __init__(self) :
        self.headers = {
                        'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko)'
                                      'the Chrome / 81.0.4044.43 Safari / 537.36'
                        }
        self.url = 'http://www.weather.com.cn/weather1d/%s.shtml'
        self.no_url = "http://toy1.weather.com.cn/search?cityname="

    def spider(self, city_no) :
        try:
            print("Send request:", self.url %city_no)
            response = requests.get(url=self.url %city_no, headers=self.headers)
            content = response.content.decode('utf-8')
            Use re.compile() to create a pattern object based on the string of the contained regular expression
            re_weather = re.compile(' ')
            re_update_time = re.compile(' ')
            weather = re_weather.findall(content)
            update_time = re_update_time.findall(content)
            print(weather[0])
            print('Last Update at:', update_time[0])
            life_index = re.compile('.(.*?) \n(.*?) \n'
                                    '(. *?) 
.*? \n', re.S)  # re.s makes. Match all characters, including newlines
            more_weather = life_index.findall(content)
            # print(more_weather)
            for item in more_weather:
                print(item[1].':', item[0] +",", item[2])
        except Exception as e:
            print("Abnormal", e)
            return False

    def get_no(self, cityname) :
        try:
            print("Send request:", self.no_url + cityname)
            response = requests.get(url=self.no_url + cityname, headers=self.headers)
            content = response.content.decode('utf-8')
            if len(eval(content)) <= 0:
                return False
            content_t = eval(content)[0] ["ref"]
            # print(content_t, type(content_t))
            city_no = content_t.split("~") [0]
            print("%s No. :" % cityname,city_no)
            return city_no
        except Exception as e:
            print("Abnormal", e)
            return False

if __name__ == '__main__':
    while True:
        cityname = input("Please enter a city name (press Q to end the program):")
        if cityname == "Q":
            break
        w_spider = WeatherSpider()
        city_no = w_spider.get_no(cityname)
        if city_no:
            w_spider.spider(city_no)
        else:
            print("Input error")
Copy the code

Because China weather network crawl mechanism is not very strong, so the whole process is relatively smooth, in fact, China weather network has access to 7 days, 15 days, 40 days of weather request links, here will not say, interested partners can go to in-depth understanding of it.

Finally, thank my girlfriend for her tolerance, understanding and support in work and life!

66 lines of code to crawl the current weather conditions in the specified city

Analysis of the website

Code implementation

Get city number

Code integration

Related Posts

Talk about the rollover experience and solution of Go language inheritance

Redis Proficient Series 3 – Hash Dictionary details

With five iterations in five years, Douyin is based on the evolution of Flink’s recommendation system