This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money.

Cute picture net dual thread crawl

This blog post is going to speed up the Python crawler by implementing a two-threaded crawler. And in the process of crawling, there are unexpected goods.

Crawl target analysis

Crawl target

  • Lovely pictures website www.keaitupian.net/
  • Picture classification is very rich, want to grab all have, such as lovely girls, sexy beauty, but in order to better learning technology, I decided to only grab cartoon cartoon classification pictures, the other left you big guy readers.

Python modules used

  • The userequests.re.threading.
  • New thread parallel module addedthreading.

Key learning content

  1. Crawler basic routine;
  2. Uncertain page number data crawl;
  3. Fixed thread count crawler.

List page and detail page analysis

  • The list page captures cartoon pictures, so the list page ishttps://www.keaitupian.net/dongman/, click multiple page numbers, you can get the following rules:
  • www.keaitupian.net/dongman/lis…
  • www.keaitupian.net/dongman/lis…
  • www.keaitupian.net/dongman/lis…

List due to the total number of pages to be able to obtain directly, consider using tests of large number, when enter https://www.keaitupian.net/dongman/list-110.html page does not exist, as shown in the following figure.

After the actual test, it is found that the following table pages of this classification exist 77 pages.

Click any picture detail page to check the specific content of the picture page. It is found that the detail page also has page-turning, and this page-turning can jump between the list pages. For example, after page-turning to 9/9, you can enter the next set of pictures. Therefore, data fetching can be carried out directly against the detail page.

Obtain the last group of photos on page 77 of the list, check the page turning data code of the last group of photos, and find that the code of the last page turning to the right is empty, that is, the page cannot be turned.

Last page data view:https://www.keaitupian.net/article/280-8.html#.

Target site analysis completed, combing the overall logic, demand.

Organize requirements logic

  1. A detail page address was randomly selected as the starting page of crawler.
  2. One thread saves the image;
  3. A thread saves the next page address.

Encoding time

Fetch the target request address

Based on the above requirements, the first implementation of the loop URL thread, the thread is mainly used to repeatedly climb the URL address, save to a global list.

Threading.thread is used to create and start threads, and Thread mutex is used to ensure data transfer between threads.

Lock declaration:

mutex = threading.Lock()
Copy the code

Use of locks:

global urls
# locked
mutex.acquire()

urls.append(next_url)
mutex.release()
Copy the code

Write the URL to obtain the address code as follows:

import requests
import re
import threading
import time

headers = {
    "User-Agent": "Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"}

# global urls
urls = []

mutex = threading.Lock()


# loop to get URL
def get_image(start_url) :
    global urls
    urls.append(start_url)
    next_url = start_url
    whilenext_url ! ="#":
        res = requests.get(url=next_url, headers=headers)

        if res is not None:
            html = res.text
            pattern = re.compile(' ')
            match = pattern.search(html)
            if match:
                next_url = match.group(1)
                if next_url.find('www.keaitupian') < 0:
                    next_url = f"https://www.keaitupian.net{next_url}"
                print(next_url)
                # locked
                mutex.acquire()

                urls.append(next_url)
                # releases the lock
                mutex.release()


if __name__ == '__main__':
    # Fetch image thread
    gets = threading.Thread(target=get_image, args=(
        "https://www.keaitupian.net/article/202389.html",))
    gets.start()
Copy the code

Run the code to get the target address to be captured, and the console output is as follows:

Extract the target address picture

The following is the last step, through the above code to grab the link address, extract the picture address, and save the picture.

Save image is also a thread, which corresponds to save_image function as follows:

# Save image thread
def save_image() :
    global urls
    print(urls)

    while True:
     	# locked
        mutex.acquire()
        if len(urls) > 0:
        	Get the first item in the list
            img_url = urls[0]
            Delete the first item in the list
            del urls[0]
            # releases the lock
            mutex.release()
            res = requests.get(url=img_url, headers=headers)

            if res is not None:
                html = res.text

                pattern = re.compile(
                    ' ')

                img_match = pattern.search(html)

                if img_match:
                    img_data_url = img_match.group(1)
                    print("Grab the picture:", img_data_url)
                    try:
                        res = requests.get(img_data_url)
                        with open(f"images/{time.time()}.png"."wb+") as f:
                            f.write(res.content)
                    except Exception as e:
                        print(e)
        else:
            print("Waiting, long waiting, can be turned off.")
Copy the code

Synchronization adds a thread based on that function to the main function and starts:

if __name__ == '__main__':
    # Fetch image thread
    gets = threading.Thread(target=get_image, args=(
        "https://www.keaitupian.net/article/202389.html",))
    gets.start()

    save = threading.Thread(target=save_image)
    save.start()
Copy the code