Colleagues before at the same time, also sent a tieba jokes, just a moment, the guest officer immediately give connection: the home of Duan You https://tieba.baidu.com/f?ie=utf-8&kw= Duan You home

Then, I saw a lot of friends on it, so I wanted to crawl their pictures and videos, so I got the theme of this article:

Actually, using Python to crawl website data is the most basic thing, it is not difficult, but I also want to share with you, learn and communicate.

The main modules used to crawl data from these sites are BS4, Requests, and OS, all common modules

The basic idea is to request HTML data of web pages through the Requests module, and then analyze the requested web pages through BeautifulSoup under THE BS4 module, and then find the addresses of pictures and videos with connotations through the CSS finder. The main code is as follows:

def download_file(web_url):
    """Get resource URL"""
    # Download web page
    print('Downloading web page: %s... ' % web_url)
    result = requests.get(web_url)
    soup = bs4.BeautifulSoup(result.text, "html.parser")
    # Find image resources
    img_list = soup.select('.vpic_wrap img')
    if img_list == []:
        print('No image resource found! ')
    else:
        # find the resource and start writing
        for img_info in img_list:
            file_url = img_info.get('bpic')
            write_file(file_url, 1)
    # Find video resources
    video_list = soup.select('.threadlist_video a')
    if video_list == []:
        print('No video resource found! ')
    else:
        # find the resource and start writing
        for video_info in video_list:
            file_url = video_info.get('data-video')
            write_file(file_url, 2)
    print('End of downloading resource:', web_url)
    next_link = soup.select('#frs_list_pager .next')
    if next_link == []:
        print('Download material finished! ')
    else:
        url = next_link[0].get('href')
        download_file('https:' + url)
Copy the code

Get the address of the picture and video, certainly not enough, but also have to write these resources to the local, the way is through the binary way to read remote file resources, and then write to the local classification, the main code is as follows:

def write_file(file_url, file_type):
    """Write file"""
    res = requests.get(file_url)
    res.raise_for_status()
    Write file types into folders
    if file_type == 1:
        file_folder = 'nhdz\\jpg'
    elif file_type == 2:
        file_folder = 'nhdz\\mp4'
    else:
        file_folder = 'nhdz\\other'
    folder = os.path.exists(file_folder)
    If no folder exists, create folder
    if not folder:
        os.makedirs(file_folder)
    Open the file resource and write
    file_name = os.path.basename(file_url)
    str_index = file_name.find('? ')
    if str_index > 0:
        file_name = file_name[:str_index]
    file_path = os.path.join(file_folder, file_name)
    print('Writing resource file:', file_path)
    image_file = open(file_path, 'wb')
    for chunk in res.iter_content(100000):
        image_file.write(chunk)
    image_file.close()
    print('Write complete! ')
Copy the code

Finally, here’s the full code. Otherwise, can be said by the person, speak to say half, say welfare, also do not give complete, this too not enough meaning. Don’t worry, I’ll bring you…

#! /usr/bin/env python
# -*- coding: utf-8 -*-

""Author: Cuizy Time: 2018-05-19""

import requests
import bs4
import os


def write_file(file_url, file_type):
    """Write file"""
    res = requests.get(file_url)
    res.raise_for_status()
    Write file types into folders
    if file_type == 1:
        file_folder = 'nhdz\\jpg'
    elif file_type == 2:
        file_folder = 'nhdz\\mp4'
    else:
        file_folder = 'nhdz\\other'
    folder = os.path.exists(file_folder)
    If no folder exists, create folder
    if not folder:
        os.makedirs(file_folder)
    Open the file resource and write
    file_name = os.path.basename(file_url)
    str_index = file_name.find('? ')
    if str_index > 0:
        file_name = file_name[:str_index]
    file_path = os.path.join(file_folder, file_name)
    print('Writing resource file:', file_path)
    image_file = open(file_path, 'wb')
    for chunk in res.iter_content(100000):
        image_file.write(chunk)
    image_file.close()
    print('Write complete! ')


def download_file(web_url):
    """Get resource URL"""
    # Download web page
    print('Downloading web page: %s... ' % web_url)
    result = requests.get(web_url)
    soup = bs4.BeautifulSoup(result.text, "html.parser")
    # Find image resources
    img_list = soup.select('.vpic_wrap img')
    if img_list == []:
        print('No image resource found! ')
    else:
        # find the resource and start writing
        for img_info in img_list:
            file_url = img_info.get('bpic')
            write_file(file_url, 1)
    # Find video resources
    video_list = soup.select('.threadlist_video a')
    if video_list == []:
        print('No video resource found! ')
    else:
        # find the resource and start writing
        for video_info in video_list:
            file_url = video_info.get('data-video')
            write_file(file_url, 2)
    print('End of downloading resource:', web_url)
    next_link = soup.select('#frs_list_pager .next')
    if next_link == []:
        print('Download material finished! ')
    else:
        url = next_link[0].get('href')
        download_file('https:' + url)


# Main program entry
if __name__ == '__main__':
    web_url = 'the home of https://tieba.baidu.com/f?ie=utf-8&kw= Duan You'
    download_file(web_url)
Copy the code