“This is the 19th day of my participation in the Gwen Challenge in November. See details: The Last Gwen Challenge in 2021”.

preface

Using Python to crawl is today’s headlines street photos. Without further ado.

Let’s have a good time

The development tools

Python version: 3.6.4

Related modules:

Re;

Requests module;

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

Detailed Browser Information

Get the code associated with the article link:

import requests
import json
import re

headers = {
    'user-agent''the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}

def get_first_data(offset) :
    params = {
        'offset': offset,
        'format''json'.'keyword''photos'.'autoload''true'.'count''20'.'cur_tab''1'.'from':'search_tab'
    }
    response = requests.get(url='https://www.toutiao.com/search_content/', headers=headers, params=params)
    try:
        response.raise_for_status()
        return response.text
    except Exception as exc:
        print("Fetch failed")
        return None

def handle_first_data(html) :
    data = json.loads(html)
    if data and "data" in data.keys():
        for item in data.get("data") :yield item.get("article_url")
Copy the code

Call raise_for_status() on the Response object. If an error occurs downloading the file, an exception will be thrown. You’ll need to wrap the lines with try and except statements to handle this error without crashing the program.

Also attached is the Requests module technical documentation at http://cn.python-requests.org/zh_CN/latest/

Get the image link related code:

def get_second_data(url) :
    if url: 
        try:
            reponse = requests.get(url, headers=headers)
            reponse.raise_for_status()
            return reponse.text
        except Exception as exc:
            print("Error entering link")
            return None

def handle_second_data(html) :
    if html:
        pattern = re.compile(r'gallery: JSON.parse\((.*?) \], ', re.S)
        result = re.search(pattern, html)
        if result:
            imageurl = []
            data = json.loads(json.loads(result.group(1)))
            if data and "sub_images" in data.keys():
                sub_images = data.get("sub_images")
                images = [item.get('url'for item in sub_images]
                for image in images:
                    imageurl.append(images)
                return imageurl
        else:
            print("have no result")
Copy the code

Get the code related to the image:

def download_image(imageUrl) :
    for url in imageUrl:
        try:
            image = requests.get(url).content
        except:
            pass
        with open("images"+str(url[-10:]) +".jpg"."wb"as ob:
            ob.write(image)
            ob.close()
            print(url[-10:] + "Download successful!" + url)

def main() :
    html = get_first_data(0)
    for url in handle_first_data(html):
        html = get_second_data(url)
        if html:
            result = handle_second_data(html)
            if result:
                try:
                    download_image(result)
                except KeyError:
                    print("{0} has a problem, skipped".format(result))
                    continue

if __name__ == '__main__':
    main()

Copy the code

Finally download successfully

Check the details