Suck the cat with code! This paper is participating in[Cat Essay Campaign]

preface

Use Python to crawl cat pictures and make thousands of images for cat 🐱!

Crawl cat pictures

The Python version used in this article is version 3.10.0, which can be downloaded from www.python.org.

The Pythonn installation and configuration process is not described in detail here.

1, climb the art material website

Crawl site: cat pictures

First install the necessary libraries:

pip install BeautifulSoup4
pip install requests
pip install urllib3
pip install lxml
Copy the code

Code for climbing pictures:

from bs4 import BeautifulSoup
import requests
import urllib.request
import os

# first page cat picture url
url = 'https://www.huiyi8.com/tupian/tag-%E7%8C%AB%E5%92%AA/1.html'
# Image save path, where r stands for unescaped
path = r"/Users/lpc/Downloads/cats/"
If the directory exists, skip it; if the directory does not exist, create it
if os.path.exists(path):
    pass
else:
    os.mkdir(path)


# Get all cat pics page addresses
def allpage() :
    all_url = []
    # Loop page turn 20 times
    for i in range(1.20) :# replace the number of pages turned, where [-6] is the sixth from the bottom of the page address
        each_url = url.replace(url[-6].str(i))
        Add all obtained urls to the all_URL array
        all_url.append(each_url)
    Return all retrieved addresses
    return all_url


Main function entry
if __name__ == '__main__':
    Call allPage to get all web addresses
    img_url = allpage()
    for url in img_url:
        Get the source code for the web page
        requ = requests.get(url)
        req = requ.text.encode(requ.encoding).decode()
        html = BeautifulSoup(req, 'lxml')
        Add an array of urls
        img_urls = []
        Get the contents of all img tags in HTML
        for img in html.find_all('img') :# filter matches SRC tag content that starts with HTTP and ends with JPG
            if img["src"].startswith('http') and img["src"].endswith("jpg") :Add the qualifying IMG tag to the img_urls array
                img_urls.append(img)
        # loop through all SRC in array
        for k in img_urls:
            Get the image URL
            img = k.get('src')
            Get the image name, cast is important
            name = str(k.get('alt'))
            type(name)
            # Name the image
            file_name = path + name + '.jpg'
            # Download cat pictures by url and name
            with open(file_name, "wb") as f, requests.get(img) as res:
                f.write(res.content)
            Print the image of the crawl
            print(img, file_name)
Copy the code

📢 note: the above code cannot directly copy operation, need to modify the download image path: / Users/LPC/Downloads/cats, please amend the preservation of the local path for the reader!

Climb success:

A total of 346 cat pictures!

2. Climb the ZOL website

ZOL url: Cute cat

Crawl code:

import requests
import time
import os
from lxml import etree

The requested path
url = 'https://desk.zol.com.cn/dongwu/mengmao/1.html'
# Image save path, where r stands for unescaped
path = r"/Users/lpc/Downloads/ZOL/"
# Here is the location of the path you want to save. The r in front of it means that this section is not escaped
if os.path.exists(path):  If the directory exists, skip it; if the directory does not exist, create it
    pass
else:
    os.mkdir(path)
# request header
headers = {"Referer": "Referer: http://desk.zol.com.cn/dongman/1920x1080/"."User-Agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36", }

headers2 = {
    "User-Agent": "Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36 SE 2.x MetaSr 1.0",}def allpage() :  # Get all pages
    all_url = []
    for i in range(1.4) :# number of loop page turns
        each_url = url.replace(url[-6].str(i))  # replace
        all_url.append(each_url)
    return all_url  # return address list


TODO retrieves the Html page for parsing
if __name__ == '__main__':
    img_url = allpage()  # call function
    for url in img_url:
        # send request
        resq = requests.get(url, headers=headers)
        Show whether the request was successful
        print(resq)
        The page obtained after parsing the request
        html = etree.HTML(resq.text)
        # Get the URL of the HD image page under the A tag
        hrefs = html.xpath('.//a[@class="pic"]/@href')
        # TODO go deeper to get hd images
        for i in range(1.len(hrefs)):
            # request
            resqt = requests.get("https://desk.zol.com.cn" + hrefs[i], headers=headers)
            # parse
            htmlt = etree.HTML(resqt.text)
            srct = htmlt.xpath('.//img[@id="bigImg"]/@src')
            # The name of the screenshot
            imgname = srct[0].split('/')[-1]
            Get the image from the URL
            img = requests.get(srct[0], headers=headers2)
            Write the image to the file
            with open(path + imgname, "ab") as file:
                file.write(img.content)
            Print the image of the crawl
            print(img, imgname)
Copy the code

Climb success:

A total of 81 cat pictures!

3, climb baidu picture website

Climb baidu website: Baidu cat pictures

1, crawl the picture code:

import requests
import os
from lxml import etree
path = r"/Users/lpc/Downloads/baidu1/"
If the directory exists, skip it; if the directory does not exist, create it
if os.path.exists(path):
    pass
else:
    os.mkdir(path)

page = input('Please enter how many pages to climb:')
page = int(page) + 1
header = {
    'User-Agent': 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
n = 0
pn = 1
# PN is obtained from the number of pictures baidu pictures slide default display 30 photos at a time
for m in range(1, page):
    url = 'https://image.baidu.com/search/acjson?'

    param = {
        'tn': 'resultjson_com'.'logid': '7680290037940858296'.'ipn': 'rj'.'ct': '201326592'.'is': ' '.'fp': 'result'.'queryWord': 'cat'.'cl': '2'.'lm': '1'.'ie': 'utf-8'.'oe': 'utf-8'.'adpicid': ' '.'st': '1'.'z': ' '.'ic': '0'.'hd': '1'.'latest': ' '.'copyright': ' '.'word': 'cat'.'s': ' '.'se': ' '.'tab': ' '.'width': ' '.'height': ' '.'face': '0'.'istype': '2'.'qc': ' '.'nc': '1'.'fr': ' '.'expermode': ' '.'nojc': ' '.'acjsonfr': 'click'.'pn': pn,  # Start with the number of pictures
        'rn': '30'.'gsm': '3c'.'1635752428843 =': ' ',
    }
    page_text = requests.get(url=url, headers=header, params=param)
    page_text.encoding = 'utf-8'
    page_text = page_text.json()
    print(page_text)
    Fetch the dictionary of all links and store it in a list
    info_list = page_text['data']
    Delete the last element in the list because the last element in the list is empty
    del info_list[-1]
    # define a list of stored image addresses
    img_path_list = []
    for i in info_list:
        img_path_list.append(i['thumbURL'])
    # Then take out all the picture addresses and download them
    # n will be the name of the image
    for img_path in img_path_list:
        img_data = requests.get(url=img_path, headers=header).content
        img_path = path + str(n) + '.jpg'
        with open(img_path, 'wb') as fp:
            fp.write(img_data)
        n = n + 1

    pn += 29
Copy the code

2. Crawl code

# -*- coding:utf-8 -*-
import requests
import re, time, datetime
import os
import random
import urllib.parse
from PIL import Image  Import a module

imgDir = r"/Volumes/DBA/python/img/"
To prevent anti-pickling, set multiple headers
# Chrome, Firefox, Edge
headers = [
    {
        'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'.'Accept-Language': 'zh-CN,zh; Q = 0.8, useful - TW; Q = 0.7, useful - HK; Q = 0.5, en - US; Q = 0.3, en. Q = 0.2 '.'Connection': 'keep-alive'
    },
    {
        "User-Agent": 'the Mozilla / 5.0 (Windows NT 10.0; Win64; x64; The rv: 79.0) Gecko / 20100101 Firefox 79.0 / '.'Accept-Language': 'zh-CN,zh; Q = 0.8, useful - TW; Q = 0.7, useful - HK; Q = 0.5, en - US; Q = 0.3, en. Q = 0.2 '.'Connection': 'keep-alive'
    },
    {
        "User-Agent": 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19041'.'Accept-Language': 'zh-CN'.'Connection': 'keep-alive'
    }
]

picList = []  Empty List to store images

keyword = input("Please enter the search term:")
kw = urllib.parse.quote(keyword)  # transcoding


# Get 1000 thumbnail list from Baidu search
def getPicList(kw, n) :
    global picList
    weburl = r"https://image.baidu.com/search/acjson?tn=resultjson_com&logid=11601692320226504094&ipn=rj&ct=201326592&is=&fp=result&q ueryWord={kw}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=&ic=&hd=&latest=&copyright=&word={kw}&s=&se=&tab=&width=&heigh t=&face=&istype=&qc=&nc=1&fr=&expermode=&force=&cg=girl&pn={n}&rn=30&gsm=1e&1611751343367=".format(
        kw=kw, n=n * 30)
    req = requests.get(url=weburl, headers=random.choice(headers))
    req.encoding = req.apparent_encoding  # Prevent Chinese garbled characters
    webJSON = req.text
    imgurlReg = '"thumbURL":"(.*?) "'  # regular
    picList = picList + re.findall(imgurlReg, webJSON, re.DOTALL | re.I)


for i in range(150) :# The number of loops is large. If there are not so many diagrams, the picList data will not increase.
    getPicList(kw, i)

for item in picList:
    # suffix and first name
    itemList = item.split(".")
    hz = ".jpg"
    picName = str(int(time.time() * 1000))  # millisecond timestamp
    # request image
    imgReq = requests.get(url=item, headers=random.choice(headers))
    # Save images
    with open(imgDir + picName + hz, "wb") as f:
        f.write(imgReq.content)
    Open the Image with the Image module
    im = Image.open(imgDir + picName + hz)
    bili = im.width / im.height  Get the ratio of width to height and adjust the image size according to the ratio of width to height
    newIm = None
    # Resize the image, set the smallest side to 50
    if bili >= 1:
        newIm = im.resize((round(bili * 50), 50))
    else:
        newIm = im.resize((50.round(50 * im.height / im.width)))
    # Capture the 50*50 portion of the image
    clip = newIm.crop((0.0.50.50))  Crop cut
    clip.convert("RGB").save(imgDir + picName + hz)  # Save the captured image
    print(picName + hz + "Done")
Copy the code

Climb success:

Bottom line: 1,600 cat pictures from three sites!

Thousand figure imaging

After crawling through thousands of images, the next step is to splice the images together into a single cat image, known as a kilo-image.

1. Foto-mosaik-edda software implementation

Download foto-Mosaik-edda Installer. If you can’t download foto-Mosaik-edda Installer, search For foto-Mosaik-edda.

Windows installation foto-Mosaik-EDda process is relatively easy!

📢 Note:.net Framework 2 must be installed in advance. Otherwise, the following error message cannot be successfully installed.

How to enable.NET Framework 2:

Confirm successful enablement:

Now you can continue installing!

After installation, open the following:

Step 1: Create a gallery:

The second step, thousands of images:

Check the gallery created in step 1:

A moment of wonder:

Make another cute cat:

And you’re done!

2. Use Python

First, pick an image:

Run the following code:

# -*- coding:utf-8 -*-
from PIL import Image
import os
import numpy as np

imgDir = r"/Volumes/DBA/python/img/"
bgImg = r"/Users/lpc/Downloads/494.jpg"


Get the average color value of the image
def compute_mean(imgPath) :
    Param imgPath: thumbnail path :return: (r, g, b) Average value of the entire thumbnail.
    im = Image.open(imgPath)
    im = im.convert("RGB")  Change to RGB mode
    # Convert image data to data sequence. Stores the color of each pixel per row in behavioral units
    "' such as: [[60 33 24] [58 34 24]... [188 152 136] [99 96 113]] [[60 33 24] [58 34 24]... [188 152 136] [99 96 113]]"
    imArray = np.array(im)
    # mean()
    R = np.mean(imArray[:, :, 0])  Get the average of all R values
    G = np.mean(imArray[:, :, 1])
    B = np.mean(imArray[:, :, 2])
    return (R, G, B)


def getImgList() :
    """ Get thumbnail path and average color :return: list stores image path and average color value. "" "
    imgList = []
    for pic in os.listdir(imgDir):
        imgPath = imgDir + pic
        imgRGB = compute_mean(imgPath)
        imgList.append({
            "imgPath": imgPath,
            "imgRGB": imgRGB
        })
    return imgList


def computeDis(color1, color2) :
    To calculate the color difference between two images, the computer calculates the color space distance. Dis = (R**2 + G**2 + B**2)**0.5
    dis = 0
    for i in range(len(color1)):
        dis += (color1[i] - color2[i]) ** 2
    dis = dis ** 0.5
    return dis


def create_image(bgImg, imgDir, N=2, M=50) :
    ImgDir: Avatar directory N: Zoom of the background image M: Size of the avatar (MxM) ""
    Get a list of images
    imgList = getImgList()

    # fetch image
    bg = Image.open(bgImg)
    # bg = bg. Resize ((bg) size [0] / / N, bg. Size [1] / / N)) # zoom. It is recommended to zoom the original image, the image is too large and the operation time is long.
    bgArray = np.array(bg)
    width = bg.size[0] * M  The width of the newly generated image. Each pixel is magnified M times
    height = bg.size[1] * M  The height of the newly generated image

    Create a blank new image
    newImg = Image.new('RGB', (width, height))

    # Loop fill diagram
    for x in range(bgArray.shape[0) :# x, line data, can be replaced by the original image width
        for y in range(bgArray.shape[1) :# y, column data,, can be replaced by the original graph height
            # Find the image with the smallest distance
            minDis = 10000
            index = 0
            for img in imgList:
                dis = computeDis(img['imgRGB'], bgArray[x][y])
                if dis < minDis:
                    index = img['imgPath']
                    minDis = dis
            # end of loop, index stores the image path with the closest color
            # minDis stores color differences
            # fill
            tempImg = Image.open(index)  # Open the image with the smallest chromatic aberration distance
            # resize the image. I don't need to resize the image because I already resized the image when I downloaded it
            tempImg = tempImg.resize((M, M))
            Paste the small image onto the new image. Notice the x, y, and column don't get confused. Let me paste it M apart.
            newImg.paste(tempImg, (y * M, x * M))
            print('(%d, %d)' % (x, y))  Print progress. Format output x, y

    # Save images
    newImg.save('final.jpg')  # Save the image last


create_image(bgImg, imgDir)
Copy the code

Running results:

As you can see from the image above, the resolution of the image is comparable to that of the original image.

📢 Note: Python is slow to run!

Write in the last

😄 good, and can happily suck the cat ~

Reference for this article:

  • Python batch crawls cat pictures
  • Python implements multi-threaded concurrent downloading of large files
  • Python crawls hd images from ZOL desktop wallpaper
  • Python crawls baidu images
  • Python- How to implement kilobath imaging: A primer
  • Note 17: Playing with thousands of graphs