A tempting site for Python crawlers, The Cute Picture Web, as it's called

This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money.

Cute picture net dual thread crawl

This blog post is going to speed up the Python crawler by implementing a two-threaded crawler. And in the process of crawling, there are unexpected goods.

Crawl target analysis

Crawl target

Lovely pictures website www.keaitupian.net/
Picture classification is very rich, want to grab all have, such as lovely girls, sexy beauty, but in order to better learning technology, I decided to only grab cartoon cartoon classification pictures, the other left you big guy readers.

Python modules used

The userequests.re.threading.
New thread parallel module addedthreading.

Key learning content

Crawler basic routine;
Uncertain page number data crawl;
Fixed thread count crawler.

List page and detail page analysis

The list page captures cartoon pictures, so the list page ishttps://www.keaitupian.net/dongman/, click multiple page numbers, you can get the following rules:
www.keaitupian.net/dongman/lis…
www.keaitupian.net/dongman/lis…
www.keaitupian.net/dongman/lis…

List due to the total number of pages to be able to obtain directly, consider using tests of large number, when enter https://www.keaitupian.net/dongman/list-110.html page does not exist, as shown in the following figure.

After the actual test, it is found that the following table pages of this classification exist 77 pages.

Click any picture detail page to check the specific content of the picture page. It is found that the detail page also has page-turning, and this page-turning can jump between the list pages. For example, after page-turning to 9/9, you can enter the next set of pictures. Therefore, data fetching can be carried out directly against the detail page.

Obtain the last group of photos on page 77 of the list, check the page turning data code of the last group of photos, and find that the code of the last page turning to the right is empty, that is, the page cannot be turned.

Last page data view:https://www.keaitupian.net/article/280-8.html#.

Target site analysis completed, combing the overall logic, demand.

Organize requirements logic

A detail page address was randomly selected as the starting page of crawler.
One thread saves the image;
A thread saves the next page address.

Encoding time

Fetch the target request address

Based on the above requirements, the first implementation of the loop URL thread, the thread is mainly used to repeatedly climb the URL address, save to a global list.

Threading.thread is used to create and start threads, and Thread mutex is used to ensure data transfer between threads.

Lock declaration:

mutex = threading.Lock()
Copy the code

Use of locks:

global urls
# locked
mutex.acquire()

urls.append(next_url)
mutex.release()
Copy the code

Write the URL to obtain the address code as follows:

import requests
import re
import threading
import time

headers = {
    "User-Agent": "Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"}

# global urls
urls = []

mutex = threading.Lock()


# loop to get URL
def get_image(start_url) :
    global urls
    urls.append(start_url)
    next_url = start_url
    whilenext_url ! ="#":
        res = requests.get(url=next_url, headers=headers)

        if res is not None:
            html = res.text
            pattern = re.compile(' ')
            match = pattern.search(html)
            if match:
                next_url = match.group(1)
                if next_url.find('www.keaitupian') < 0:
                    next_url = f"https://www.keaitupian.net{next_url}"
                print(next_url)
                # locked
                mutex.acquire()

                urls.append(next_url)
                # releases the lock
                mutex.release()


if __name__ == '__main__':
    # Fetch image thread
    gets = threading.Thread(target=get_image, args=(
        "https://www.keaitupian.net/article/202389.html",))
    gets.start()
Copy the code

Run the code to get the target address to be captured, and the console output is as follows:

Extract the target address picture

The following is the last step, through the above code to grab the link address, extract the picture address, and save the picture.

Save image is also a thread, which corresponds to save_image function as follows:

# Save image thread
def save_image() :
    global urls
    print(urls)

    while True:
     	# locked
        mutex.acquire()
        if len(urls) > 0:
        	Get the first item in the list
            img_url = urls[0]
            Delete the first item in the list
            del urls[0]
            # releases the lock
            mutex.release()
            res = requests.get(url=img_url, headers=headers)

            if res is not None:
                html = res.text

                pattern = re.compile(
                    ' ')

                img_match = pattern.search(html)

                if img_match:
                    img_data_url = img_match.group(1)
                    print("Grab the picture:", img_data_url)
                    try:
                        res = requests.get(img_data_url)
                        with open(f"images/{time.time()}.png"."wb+") as f:
                            f.write(res.content)
                    except Exception as e:
                        print(e)
        else:
            print("Waiting, long waiting, can be turned off.")
Copy the code

Synchronization adds a thread based on that function to the main function and starts:

if __name__ == '__main__':
    # Fetch image thread
    gets = threading.Thread(target=get_image, args=(
        "https://www.keaitupian.net/article/202389.html",))
    gets.start()

    save = threading.Thread(target=save_image)
    save.start()
Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

A tempting site for Python crawlers, The Cute Picture Web, as it’s called

Cute picture net dual thread crawl

Crawl target analysis

Organize requirements logic

Encoding time

A tempting site for Python crawlers, The Cute Picture Web, as it’s called

Cute picture net dual thread crawl

Crawl target analysis

Organize requirements logic

Encoding time

Related Posts

Github enables local hosts acceleration

Redis database StackExchange.Redis package source code (C# version)

Install elK and ElasticSearch LogStash Kibana on a Mac