preface

The last post covered a little bit of crap, but let’s get back to the technical stuff. As the last part of the basic knowledge of crawler, this article will end with the module design of crawler program.

During my long career in crawler development, I usually divided crawler programs into four modules.

As shown, in addition to the proxy module, which introduces programs as needed, the request, parse, and store modules are essential.

Agent module

The proxy module mainly builds the proxy IP address pool. In the third chapter, I explained why proxy IP is needed, because many websites identify crawlers by request frequency, that is, they record the number of requests of an IP within a period of time, so they can improve crawler efficiency by changing proxy IP.

concept

What is a proxy IP address pool?

Like thread pools and connection pools, multiple proxy IP addresses are pre-placed in a common area for multiple crawlers to use, and then returned after each use.

Why do you need a proxy pool?

Normally, this is how we add proxy IP in our program.

proxies = {
    'https': 'https://183.220.xxx.xx:80'
}
response = requests.get(url, proxies=proxies)
Copy the code

So we can only use one IP, and someone might say,

Even if the collection can store multiple proxy IP addresses, it still needs to terminate the program and modify the code if the IP address becomes invalid and needs to be deleted or a new IP is added. When I was learning to program, my teacher used to say:

Even now, the words are often heard in my ears. The proxy module provides the functions of adding and deleting proxy IP addresses flexibly and verifying IP validity.

implementation

Currently, MySQL is generally used to store proxy IP addresses. Take a look at the table design for the broker pool.

CREATE TABLE `proxy` (
  `ip` varchar(100) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Copy the code

My table structure design is relatively simple and crude, only one field. You can divide it according to your needs during development.

Take another look at the data in the table:

In the figure, the proxy IP address consists of supported protocols, IP addresses, and ports.

Finally, the agent pool code is provided:

import requests
import pymysql

class proxyPool:
    Initialize the data connection
    def __init__(self, host, db, user, password, port) :
        self.conn = pymysql.connect(host=host,
                                    database=db,
                                    user=user,
                                    password=password,
                                    port=port,
                                    charset='utf8')
    Get IP from database
    def get_ip(self) :
        cursor = self.conn.cursor()
        cursor.execute('select ip from proxy order by rand() limit 1')
        ip = cursor.fetchone()
        If there is data in the proxy IP table, return null
        if ip:
            judge = self.judge_ip(ip[0])
            If the IP address is available, put it back directly. If not, call this method again to get an IP address from the database
            if judge:
                return ip[0]
            else:
                self.get_ip()
        else:
            return ' '

    Check whether the IP address is available
    def judge_ip(self, ip) :
        http_url = 'https://www.baidu.com'
        try:
            proxy_dict = {
                "http": ip,
            }
            response = requests.get(http_url, proxies=proxy_dict)
        except Exception:
            self.delete_ip(ip)
            return False
        else:
            code = response.status_code
            if code in (200.299) :return True
            else:
                self.delete_ip(ip)
                return False

    Delete invalid IP from database
    def delete_ip(self, ip) :
        delete_sql = f"delete from proxy where ip='{ip}'"
        cursor = self.conn.cursor()
        cursor.execute(delete_sql)
        self.conn.commit()
Copy the code

The agent pool workflow is divided into two parts:

  1. Get IP from data. If no IP address is available in the database, no proxy is used and null is returned. If yes, go to the next step
  2. Verify the validity of the IP address. If the IP is invalid, delete it and repeat step 1. If the IP is valid, the IP is returned

use

The ultimate purpose of proxy pools is to provide valid proxy IP addresses. You can separate the proxy pool from the crawler, separate the proxy pool into a Web interface, and obtain the proxy IP by URL. You need to use Flask or Django to build a Web service.

I usually just put it in my crawler. The sample code is as follows:

pool = proxyPool('47.102. XXX. XXX'.'test'.'root'.'root'.3306)
proxy_ip = pool.get_ip()
url = 'https://v.qq.com/detail/m/m441e3rjq9kwpsc.html'
proxies = {
    'http': proxy_ip
}
if proxy_ip:
    response = requests.get(url, proxies=proxies)
else:
    response = requests.get(url)
print(response.text)
Copy the code

Source of the proxy IP address

As discussed earlier, proxy IP can be purchased for a fee or used free from the web. Because of the low survival rate of free IP, the proxy pool is mainly for free IP.

Generally, a separate crawler is developed to crawl free IP, and put into the database, and then verify availability.

Request/resolution module

In the previous crawler samples, they were all crawling on a single URL. The crawler is usually a crawler based on the website. In the final analysis, is based on the request module and parsing module to design implementation.

If you want to crawl the entire site, you must first determine a site entry, the url that the crawler visits first. The returned page is then parsed to retrieve data or to retrieve the next level of URL to continue the request.

Here take Tencent video for example, we come to ** climb animation information *.

1. Select the website entrance

Analyze the requirements and select the site entrance. At this point, it needs to be clear that the animation channel URL is the site entrance.

After we request the website entrance, namely animation channel, we parse the returned web content. We can find from the page, animation channel under the Chinese comic, Japanese comic, combat and other categories.

View the source of the web page:

As shown in the figure above, we can parse out the URL of each category from the home page of animation.

2. Category request

After obtaining the URL for each category, the request continues. Here, the URL of China Comics is requested first, and the webpage content returned is as follows:

As shown in the picture, it is the animation list under the classification of Chinese comics. In the browser, we can click which animation can enter its play page, so on this page we can parse the links to the play page of these Chinese comics.

We view the source of this page:

As shown in the figure, we can get the URL of each manmanplay page.

3. Direct to the information page

Take the first Chinese manchuri-Douro continent as an example, we get its playing page URL, make a request and return the content of the playing page.

We found that clicking on the top right corner of douro mainland will enter the details page. Therefore, we need to parse the URL of the details page in the upper right corner to obtain the web content of the details page.

4. Obtain data

Analyze the web content of the detail page and get the data you want. The specific code is in the sample of the first article.

From the above four steps, crawler’s crawling of the website is step by step, step by step visit. We want to find the site entrance, clear want to obtain the data content, plan the site entrance to obtain data path.

Of course, there are still a lot of areas that can be optimized, for example, step 2 can skip step 3 and directly request the details page of step 4. Let’s compare the urls of the Play page and the details page.

# dou luo # https://v.qq.com/x/cover/m441e3rjq9kwpsc.html https://v.qq.com/detail/m/m441e3rjq9kwpsc.html page and the details of the page Small fox demon matchmaker https://v.qq.com/x/cover/0sdnyl7h86atoyt.html https://v.qq.com/detail/0/0sdnyl7h86atoyt.html page and the details of the pageCopy the code

It’s not hard to see the pattern in the two pairs of urls above. So we can get the URL of the details page directly after parsing the URL of the page in the second step.

Note: The crawl analysis of Tencent Video above is only for process reference, the actual development may involve the knowledge of asynchronous request and other aspects.

Storage module

The crawler becomes more meaningful only when the crawler data is stored.

Usually, the data format is text, image, etc. Here is how to download the image and save it to the local directory.

Images are downloaded

When I used scrapy’s built-in Image Pipeline to download giFs, it took me a long time to download them. So return to the original, the final success, a line of code comfort life.

The code is as follows:

urllib.request.urlretrieve(imageUrl, filename)
Copy the code

So, regardless of whether you’re using crawlers or scrapy in the future, just download the image and remember one line of code!

Find an image link to test it out:

Right-click the image and select Copy image address.

Put the picture address into the program, the code is as follows:

import urllib.request
urllib.request.urlretrieve('http://puui.qpic.cn/vcover_vt_pic/0/m441e3rjq9kwpsc1607693898908/0'.'./1.jpg')
Copy the code

I saved the image to the current directory, named it 1.jpg, and ran the program.

The text data

  1. Store it in a file
with open("/path/file.txt".'a', encoding='utf-8') as f:
  f.write(data + '\n')
Copy the code
  1. Use the PyMSQL module to store data in MySQL tables

  1. Use the PANDAS or XLWT module to store the data in Excel

conclusion

This article mainly describes my understanding of the crawler module design, which is also a summary and conclusion of the basic knowledge of crawlers. Looking forward to our next meeting.



Writing is daily work in personal practice, oneself from their own point of view from 0 to 1, to ensure that we can really understand.

This article will be published in the public account [entry to the road to give up], looking forward to your attention.