preface

– I wanted to update scrapy, but it’s not that hard to do, you can basically do it by looking at the official documentation, mostly the first one. If you don’t have a good crawler base, you’re not going to be able to do that either, and installing scrapy can be a problem for most people, because of some historical issues. After all, it’s an old framework from python2. There’s another reason, of course, that IF I’m going to do something that doesn’t use scrapy, if I’m going to do a crawler, it’s going to be a distributed crawler, but what I’m going to do here is probably just a client, which is a spider, so I’m not going to be able to do that.

The target

Today we are going to get the weather, using the API of China Weather network.

BaseUrl = "http://wthrcdn.etouch.cn/weather_mini?city={}"
Copy the code

There are also a lot of crawlers on the Internet, such as the crawler that directly crawls the Weather net of China, but I just don’t understand why I have to go to the web page and then go to xpath or regular to do it. Clearly the data comes out of the same API. Why do I have to go to the page and reverse parse the rendered results to get the data? Why don’t I just take the data?

The request format

Back here, our interface is a GET request, and then we just put the city or the number in the city field, and the result is json, which is what it looks like when we turn this thing into a dictionary

{'data':
 {'yesterday': 
 {'date': 'Saturday the 5th'.'high': 'high temperature 16 ℃'.'fx': 'Northeast wind'.'low': 'low temperature of 9 ℃.'fl': '
      ".'type': 'cloudy'}, 
 'city': 'jiujiang'.'forecast': [{'date': 'Sunday the 6th'.'high': 'high temperature 12 ℃.'fengli': '
      ".'low': 'low temperature 7 ℃'.'fengxiang': 'Northeast wind'.'type': 'rain'}, 
 {'date': 'Monday the 7th'.'high': '14 ℃ high temperature'.'fengli': '
      ".'low': 'low temperature 7 ℃'.'fengxiang': 'the north'.'type': 'cloudy'}, 
 {'date': 'Tuesday the 8th'.'high': '19 ℃ high temperature'.'fengli': '
      ".'low': 'low temperature 8 ℃'.'fengxiang': Southeast wind.'type': 'or'}, 
 {'date': 'Wednesday the 9th'.'high': 'high temperature of 21 ℃'.'fengli': '
      ".'low': '11 ℃ low temperature'.'fengxiang': Southeast wind.'type': 'or'},
 {'date': 'Thursday the 10th'.'high': 'high temperature of 23 ℃.'fengli': '
      ".'low': '11 ℃ low temperature'.'fengxiang': 'the south'.'type': 'cloudy'}].'ganmao': 'Common cold season, appropriate to reduce the frequency of going out, appropriate hydration, appropriate clothing. '.'wendu': '8'}, 'status': 1000.'desc': 'OK'}
Copy the code

Request limits

Here have to say, The Chinese weather network YYDS this interface is completely unlimited. Why, what I need to do is to obtain the weather information of the whole country, including the county seat, there are thousands of county seats in China, and also have to analyze in different stages, so the daily request to visit at least 2W start. Well, if there were restrictions, we’d have to crawl backwards and backwards, but by my test, it’s fine.

Requests are not retrieved asynchronously

So, let’s do a comparison, no comparison, no harm, right? Because it’s so simple I’m going to go straight to the code.

import requests
from datetime import datetime

class GetWeather(object) :

    urlWheather = "http://wthrcdn.etouch.cn/weather_mini?city={}"
    requests = requests
    error = {}
    today = datetime.today().day
    weekday = datetime.today().weekday()
    week = {0:"Monday".1:"Tuesday".2:"Wednesday".3:"Thursday".4:"Friday".5:"Saturday".6:"Sunday"}

    def __getday(self) - >str:
        day = str(self.today)+"Day"+self.week.get(self.weekday)
        return day


    def get_today_wheather(self,city:str) - >dict:

        data = self.getweather(city)
        data = data.get("data").get("forecast")
        today = self.__getday()
        for today_w in data:
            if(today_w.get("date")==today):
                return today_w

    def getweather(self,city:str,timeout:int=3) - >dict:
        url = self.urlWheather.format(city)
        try:
            resp = self.requests.get(url,timeout=timeout)
            jsondata =  resp.json()
            return jsondata
        except Exception as e:
            self.error['error'] = "Weather acquisition anomaly"
            return self.error
    def getweathers(self,citys:list,timeout:int=3) :
        wheathers_data = {}
        for city in citys:
            url = self.urlWheather.format(city)
            try:
                resp = self.requests.get(url=url,timeout=timeout)
                wheather_data = resp.json()
                wheathers_data[city]=wheather_data
            except Exception as e:
                self.error['error'] = "Weather acquisition anomaly"
                return self.error

        return wheathers_data



if __name__ == '__main__':
    getwheather = GetWeather()

    start = time.time()
    times = 1
    for i in range(5000):
        data = getwheather.get_today_wheather("Jiujiang")
        if((times%100= =0)) :print(data,"The first",times,"This visit")
        times+=1

    print("Access",times,"Secondary time",time.time()-start,"Seconds")

Copy the code

So this code, I’ve done a simple wrapper. Let’s see how long it took 5,000 visits

Here I have visited the same city jiujiang 5,000 times

Asynchronous access

I didn’t wrap this code, so it looks messy. There are a couple of caveats here

Ceiling system

And because of that, it’s kind of an underlying operating system that you’re using asynchronously, there’s a limit to the concurrency, because the coroutine asynchronously has to be switched. It looks a bit like Python’s own multithreading, except that this “multithreading” switches only when I/O is used, or else it doesn’t switch. So yo, limit it

coding

import time

import aiohttp
from datetime import datetime
import asyncio

BaseUrl = "http://wthrcdn.etouch.cn/weather_mini?city={}"

WeekIndex = {0:"Monday".1:"Tuesday".2:"Wednesday".3:"Thursday".4:"Friday".5:"Saturday".6:"Sunday"}

today = datetime.today().day
day = str(today)+"Day"+WeekIndex.get(datetime.today().weekday())

TIMES = 0

async def request(city:str,semaphore:asyncio.Semaphore,timeout:int = 3) :
    url = BaseUrl.format(city)
    try:
        async with semaphore:
            async with aiohttp.request("GET", url) as resp:
                data = await resp.json(content_type=' ')
                return data
    except Exception as e:
        raise e


def getwheater(task) :
    data = task.result()
    return data

def get_today_weather(task) :
    global TIMES
    data = task.result() # get the result

    data = data.get("data").get("forecast")

    for today_w in data:
        if (today_w.get("date") == day):
            TIMES+=1The ++ operation is still an atomic operation
            if(TIMES%100= =0) :print(today_w,"The first",TIMES,"This visit")
            return today_w



if __name__ == '__main__':
    semaphore = asyncio.Semaphore(500)
    The operating system maximum is 509/1024 concurrency at a time,windows509 / Linux 1024
    start = time.time()
    tasks = []
    for i in range(5000):
        c = request("Jiujiang",semaphore,3)
        task = asyncio.ensure_future(c)
        task.add_done_callback(get_today_weather)
        tasks.append(task)
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait(tasks))
    print("Take",time.time() - start,"Seconds")
Copy the code