• Tried to write a crawler to crawl pexels image content, encountered some problems to record

The main problem

  • Site crawl back, with selenium bypass
  • Selenium is also anti-crawling. When selenium is identified as WebDriver, js file 403 is obtained. Try to hide webDriver identity and reverse crawl enough
  • Selenium could not collect the desired links on the page (accurately, it could collect the links of the small images, but the resolution of the small images was not enough). After studying the rules of the images, it found that each image has its own ID
  • I don’t know how to improve the resolution of concatenated URL. Fortunately, Pexels provides the default download method, which is a Download link. Images can be downloaded using this link. Instead, it uses its redirected links to download content

The source code

import requests
import time
import os
import logging

from urllib.parse import urlparse
from selenium import webdriver
from multiprocessing import Pool


PEXELS_URL = 'https://www.pexels.com/'
DOWNLOAD_URL_KEY = 'https://www.pexels.com/photo/{image_id}/download/'
headers = {
    'User-Agent': 'the Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'
}
DOWNLOAD_LENGTH = 50  This is the minimum number of page elements, the actual number of downloads is greater than this number
SCROLL_HEIGHT = 2000  # Scroll pixels
SLEEP_SECONDS = 5  # Number of sleep seconds
CPU_COUNT = os.cpu_count()
logging.basicConfig(
    filename='log.txt',
    level=logging.INFO,
    filemode='w+'.format='%(levelname)s:%(asctime)s: %(message)s',
    datefmt='%Y-%d-%m %H:%M:%S'
)
IMAGE_PATH = './images/'
EXISTED_IMAGES = set(os.listdir(IMAGE_PATH))


def get_image_ids() :
    "" Get the image IDS in the website through Selenium ""
    browser = webdriver.Chrome(executable_path='./chromedriver')
    # hidden window. The navigator. Webdriver avoid the climbing process
    # waste had a lot of trouble trying to find the answer in this article https://juejin.cn/post/6844904095749242887 thanks to the author
    browser.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
        "source": """ Object.defineProperty(navigator, 'webdriver', { get: () => undefined }) """
    })

    url = PEXELS_URL
    browser.get(url)
    browser.maximize_window()
    elements = browser.find_elements_by_xpath('//article')
    scroll_height = SCROLL_HEIGHT
    while len(elements) < DOWNLOAD_LENGTH:
        browser.execute_script('window.scrollTo(0, {})'.format(scroll_height))  Scroll to the bottom of the page using Selenium to execute JS
        time.sleep(SLEEP_SECONDS)
        scroll_height += SCROLL_HEIGHT
        elements = browser.find_elements_by_xpath('//article')
    image_ids = [ele.get_attribute('data-photo-modal-medium-id') for ele in elements]
    browser.close()
    logging.info(f'image_ids: {image_ids}')
    return image_ids


def get_download_urls(image_ids) :
    return [DOWNLOAD_URL_KEY.format(image_id=_id) for _id in image_ids]


def download_image(image_url) :
    parse_result = urlparse(image_url)
    path = parse_result.path
    image_name = path.split('/')[-1]
    if image_name in EXISTED_IMAGES:
        logging.info(F 'pictures{image_name}Existing no need to re-download ')
        return None

    response = requests.get(image_url, headers)
    ifresponse.status_code ! =200:
        message = 'Downloading {} failed. Status_code: {}'.format(image_url, response.status_code)
        logging.error(message)
        return None

    prefix = IMAGE_PATH
    with open(prefix + image_name, 'wb') as image:
        image.write(response.content)
    message = 'Download {} successful. url: {}'.format(image_name, image_url)
    logging.info(message)


def get_image_url(need_redirect_url) :
    Since anti-crawl cannot be solved, there are other ways to bypass anti-crawl 1. Selenium is used to obtain the URL of the Download button on the page 2. The URL of the download button in this place cannot get the URL of the picture. After testing, it is found that the url redirected is the picture URL. 3. The download button url also crawls back, and the test found that the get request does not circumvent 4. HTTP code 302: Response headers url """
    response = requests.head(need_redirect_url, headers=headers)
    ifresponse.status_code ! =302:
        message = '{} no redirection occurred. Code: {}'.format(need_redirect_url, response.status_code)
        logging.error(message)
        return None
    location = response.headers.get('location')
    logging.info(f'get_image_url success. location: {location}')
    return location


def download(need_redirect_url) :
    image_url = get_image_url(need_redirect_url)
    if image_url:
        download_image(image_url)


def main() :
    image_ids = get_image_ids()
    download_urls = get_download_urls(image_ids)
    logging.info(f'image_ids: {image_ids}, download_urls: {download_urls}')

    p = Pool(CPU_COUNT // 2)
    for url in download_urls:
        p.apply_async(func=download, args=(url,))

    p.close()
    p.join()


if __name__ == "__main__":
    main()

Copy the code

Matters needing attention

  • If you want to use this crawler, you need to download the corresponding WebDriver of the browser, and you need to pay attention to the version number of the browser, which I have hereThe chrome version 92.0.4515.107
  • The current version of Chrome hides webDriver identity the way I wrote about it in this article, but other versions do not know if it has been changed, so it is not known whether the application will work
  • I personally test valid
  • It turns out that Pexels also provides an API to the outside world, but it has a frequency limit. If you can’t write a crawler and need to use pictures, you can check his API address