Hello, this is Kuls

1. What you write first

Let’s get right to today’s topic – can you really write a crawler? The reason for the title is that we usually write crawlers in a PY file with several requests, but if you write a formal project, you have to consider a lot of situations, so we need to modularize all of these functions to make our crawlers more robust.

2. The architecture and running process of basic crawler

First of all, what does the architecture of a basic crawler look like? JAP gives you a rough picture:

Here are the functions of these five categories:

The crawler scheduler is mainly used to call the other four modules. The so-called scheduling is to call other template URL managers, which is responsible for managing URL links. URL links can be divided into crawlers and uncrawlers, which requires URL managers to manage them, and it also provides interfaces for obtaining new URL links. The HTML downloader, which is the HTML of the page to be climbed, downloads the HTML and the HTML parser, which is the data to be climbed from the HTML source code, sends the new URL link to the URL manager and sends the processed data to the data store. Data storage is the local storage of data sent from the HTML download

3. Actual combat crawling rookie notes

Almost all of these things, I believe you have a preliminary understanding of the overall architecture, the following I simply find a website to show you a crawler architecture to crawl information:

Let’s get the information from the list above. I’m going to skip analyzing the site. If you don’t know how to analyze the site, check out the crawler project I wrote earlier.

First, let’s write the URL manager (urlmanage.py)

class URLManager(object):
   def __init__(self):
       self.new_urls = set()
       self.old_urls = set()

   def has_new_url(self):
       Check if there are uncrawled urls
       returnself.new_url_size()! =0 def get_new_url(self):Get an unclimbed link
       new_url = self.new_urls.pop()
       After extracting, add it to the crawled link
       self.old_urls.add(new_url)
       return new_url

   def add_new_url(self, url):
       # Add new link to unclimbed collection (single link)
       if url is None:
           return
       if url not in self.new_urls and url not in self.old_urls:
           self.new_urls.add(url)

   def add_new_urls(self,urls):
       # add new link to unclimbed collection (collection)
       if urls is None or len(urls)==0:
           return
       for url in urls:
           self.add_new_url(url)

   def new_url_size(self):
       Get the size of uncrawled urls
       return len(self.new_urls)

   def old_url_size(self):
       Get the size of the url that was crawled
       return len(self.old_urls)
Copy the code

Here is mainly two sets, one is the set of URL has been climbed, the other is the set of URL has not been climbed. I’m using the set type here, because set has a built-in weight removal function.

Next, the HTML downloader.py

import requests
class HTMLDownload(object):
    def download(self, url):
        if url is None:
            return
        s = requests.Session()
        s.headers['User-Agent'] ='Mozilla / 5.0(Windows NT 10.0; WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 63.0.3239.132Safari / 537.36'
        res = s.get(url)
        # Check whether it is properly acquired
        if res.status_code == 200:
            res.encoding='utf-8'
            res = res.text
            return res
        return None
Copy the code

So you can see here we’re just simply getting the HTML source in the URL

Then look at the HTML parser (htmlparser.py)

import re
from bs4 import BeautifulSoup
class HTMLParser(object):

    def parser(self, page_url, html_cont):
        ' ''Param page_URL: the URL of the downloaded page :param html_cont: the content of the downloaded page :return: returns the URL and data'' '
        if page_url is None or html_cont is None:
            return
        soup = BeautifulSoup(html_cont, 'html.parser')
        new_urls = self._get_new_urls(page_url, soup)
        new_data = self._get_new_data(page_url, soup)
        return new_urls, new_data

    def _get_new_urls(self,page_url,soup):
        ' ''Extract new URL set :param page_URL: download page URL :param soup: soup data :return: return new URL set'' '
        new_urls = set(a)for link inRange (1100) :Add a new URL
            new_url = "http://www.runoob.com/w3cnote/page/"+str(link)
            new_urls.add(new_url)
            print(new_urls)
        return new_urls

    def _get_new_data(self,page_url,soup):
         ' ''Extract valid data :param page_url: download page url: Param soup: :return: return valid data'' '
         data={}
         data['url'] = page_url
         title = soup.find('div', class_='post-intro').find('h2')
         print(title)
         data['title'] = title.get_text()
         summary = soup.find('div', class_='post-intro').find('p')
         data['summary'] = summary.get_text()
         return data
Copy the code

Here, we have analysed and parsed the HTML loader source code to get the data we want. BeautifulSoup can look up my previous article if you don’t understand.

Moving on, data storage (dataoutput.py)

import codecs
class DataOutput(object):

    def __init__(self):
        self.datas = []

    def store_data(self,data):
        if data is None:
            return
        self.datas.append(data)

    def output_html(self):
        fout = codecs.open('baike.html'.'a', encoding='utf-8')
        fout.write("<html>")
        fout.write("<head><meta charset='utf-8'/></head>")
        fout.write("<body>")
        fout.write("<table>")
        for data in self.datas:
            fout.write("<tr>")
            fout.write("<td>%s</td>"%data['url'])
            fout.write(" " % data['title'])
            fout.write("<td>[%s]</td>" % data['summary'])
            fout.write("</tr>")
            self.datas.remove(data)
        fout.write("</table>")
        fout.write("</body>")
        fout.write("</html>")
        fout.close()
Copy the code

You may notice that I’m storing the data in an HTML file, but you can also store it in a Mysql file or a CSV file, so it’s up to you, but I’m just putting it in HTML for demonstration purposes.

And finally, the crawler scheduler (SpiderMan. Py)

from base.DataOutput import DataOutput from base.HTMLParser import HTMLParser from base.HTMLDownload import HTMLDownload  from base.URLManager import URLManager class SpiderMan(object): def __init__(self): self.manager = URLManager() self.downloader = HTMLDownload() self.parser = HTMLParser() self.output = DataOutput() def crawl(self, root_url):Add the entry URL
        self.manager.add_new_url(root_url)
        Check whether there is a new URL in the URL manager, and determine how many urls to fetch
        while(self.manager.has_new_url() and self.manager.old_url_size()<100):
            try:
                Get a new URL from the URL manager
                new_url = self.manager.get_new_url()
                print(new_url)
                # HTML downloader downloads web pages
                html = self.downloader.download(new_url)
                The HTML parser extracts web page data
                new_urls, data = self.parser.parser(new_url, html)
                print(new_urls)
                Add the extracted URL to the URL manager
                self.manager.add_new_urls(new_urls)
                # Data storage stores files
                self.output.store_data(data)
                print("%s links have been fetched" % self.manager.old_url_size())
            except Exception as e:
                print("failed")
                print(e)
            The data store outputs the file to the specified format
            self.output.output_html()


if __name__ == '__main__':
    spider_man = SpiderMan()
    spider_man.crawl("http://www.runoob.com/w3cnote/page/1")
Copy the code

I believe we can understand here, I just wrote the previous four templates here to call them, we run the result:

conclusion

We simply explained here, five of the crawler architecture template, both small and large crawler projects crawler cannot leave the five project template, hope that we can according to the code to write again, this will help you understand, after you write the crawler program also want to write in this architecture, so that your look will be more standard and perfect.