These days I just started learning Python, just like writing a small reptile project to practice my hand, since the book “×× dolphin” incident moved to nuggets, I feel nuggets do very good in all aspects, especially the quality of the article and the editor to write articles do very comfortable.

However, every time I tried to search for a keyword I was interested in, there were so many articles that I couldn’t even find the button to sort by “likes” and had to go all the way down the row to find the article I wanted. So, I was wondering if I could make a feature with the crawler that I just learned: just by entering the keyword and the number of likes, it would automatically give us a list of articles that meet our needs (with more likes than we set). Let’s start with the results:

Let’s start the formal work:

1. Project composition

2. URL Manager (URL_manager)

The URL manager is mainly responsible for the generation and maintenance of the website links that need to be climbed. For the website “nuggets” that we want to climb, it is mainly divided into two categories: static page URLS and AJAX dynamically constructed pages.

The COMPOSITION of the URLS in these two requests is very different, and the content returned is different: the static page URL returns an HTML page, and the AJAX request returns a JSON string. For both types of access, we can write the URL manager as follows:

class UrlManager(object):
    def __init__(self):
        self.new_urls = set()   A collection of new urls

    Build urls to access static pages
    def build_static_url(self, base_url, keyword):
        static_url = base_url + '? query=' + keyword
        return static_url

    Construct a new URL based on the input base_URL and params(parameter dictionary)
    # eg:https://search-merger-ms.juejin.im/v1/search? query=python&page=1&raw_result=false&src=web
    In the # argument, start_page is the number of pages to access, and gap is the interval between pages to access
    def build_ajax_url(self, base_url, params, start_page=0, end_page=4, gap=1):
        if base_url is None:
            print('Invalid param base_url! ')
            return None
        if params is None or len(params)==0:
            print('Invalid param request_params! ')
            return None
        if end_page < start_page:
            raise Exception('start_page is bigger than end_page! ')
        equal_sign = '='    # key value pair internal concatenation
        and_sign = '&'  # concatenation between key-value pairs
        Concatenate the base_URL and parameter into a URL and put it into the collection
        for page_num in range(start_page, end_page, gap):
            param_list = []
            params['page'] = str(page_num)
            for item in params.items():
                param_list.append(equal_sign.join(item))    # each key-value pair in the dictionary is joined by a '='
            param_str = and_sign.join(param_list)       # different key-value pairs are joined with an ampersand
            new_url = base_url + '? ' + param_str
            self.new_urls.add(new_url)
        return None

    Get a new URL from the url collection
    def get_new_url(self):
        if self.new_urls is None or len(self.new_urls) == 0:
            print('there are no new_url! ')
            return None
        return self.new_urls.pop()

    Check if there are any urls in the collection
    def has_more_url(self):
        if self.new_urls is None or len(self.new_urls) == 0:
            return False
        else:
            return True
Copy the code

As shown above, the initialization function __init__ maintains a collection into which the URL will be created. Then according to the structure of the url is divided into the basic URL + access parameters, between the two through ‘? ‘link, parameters are linked by ‘&’. The build_ajax_url function is used to connect the two to form a complete URL into the set. Get_new_url is used to get a URL from the set. Has_more_url determines whether there are any unconsumed urls in the set.

2. Downloaders (html_downloader)

import urllib.request

class HtmlDownloader(object):
    def download(self, url):
        if url is None:
            print('one invalid url is found! ')
            return None
        response = urllib.request.urlopen(url)
        ifresponse.getcode() ! =200:
            print('response from %s is invalid! ' % url)
            return None
        return response.read().decode('utf-8')
Copy the code

This code is relatively simple, using the URllib library to access the URL and return the returned data.

3.JSON Parser (json_parser)

import json
from crawler.beans import result_bean


class JsonParser(object):
    Parse the JSON character into an object
    def json_to_object(self, json_content):
        if json_content is None:
            print('parse error! json is None! ')
            return None
        print('json', str(json_content))
        return json.loads(str(json_content))

    Extract title, link, collectionCount and other data from the JSON object, and encapsulate it into a Bean object, and finally add these objects to the result list
    def build_bean_from_json(self, json_collection, baseline):
        if json_collection is None:
            print('build bean from json error! json_collection is None! ')
        list = json_collection['d'] # list of articles
        result_list = []    # list of results
        for element in list:
            starCount = element['collectionCount']  # Number of favorites, i.e. number of likes
            if int(starCount) > baseline:   If the number of collections exceeds the baseline, collude the result object and add it to the result list
                title = element['title']
                link = element['originalUrl']
                result = result_bean.ResultBean(title, link, starCount)
                result_list.append(result)      Add to the result list
                print(title, link, starCount)
        return result_list
Copy the code

JSON parsing is divided into two parts: 1. Convert the JSON string into a dictionary object; 2. 2. Extract the article title, links, likes and other information from the dictionary object, and judge whether to encapsulate these data into the result object and add them to the result list according to the Baseline.

3.HTML Parser (HTML_parser)

We can do this by visiting: ‘https://juejin.im/search? Query =python’ gets an HTML page, but only one page of data, equivalent to accessing ‘https://search-merger-ms.juejin.im/v1/search? Query = PYTHon&PAGE =0&raw_result=false&src=web’ The amount of data obtained, but the difference is that one content is returned in HTML format and the second one is returned in JSON format. Here we also put the HTML parser here, so we don’t need it in the project:

from bs4 import BeautifulSoup
from crawler.beans import result_bean


class HtmlParser(object):

    # Create BeautifulSoup object to structure the HTML
    def build_soup(self, html_content):
        self.soup = BeautifulSoup(html_content, 'html.parser')
        return self.soup

    # filter the tag according to the number of likes
    def get_dom_by_star(self, baseline):
        doms = self.soup.find_all('span', class_='count')
        Keep only nodes not less than the baseline based on the least likes
        for dom in doms:
            if int(dom.get_text()) < baseline:
                doms.remove(dom)
        return doms

    Build the resulting object from the node and add it to the list
    def build_bean_from_html(self, baseline):
        doms = self.get_dom_by_star(baseline)
        if doms is None or len(doms)==0:
            print('doms is empty! ')
            return None
        results = []
        for dom in doms:
            starCount = dom.get_text()      # Number of likes
            root = dom.find_parent('div', class_='info-box')    # Nodes of this article
            a = root.find('a', class_='title', target='_blank') # tag contains the title of the article and the link
            link = 'https://juejin.cn' + a['href'] + '/detail'  # construct link
            title = a.get_text()
            results.append(result_bean.ResultBean(title, link, starCount))
            print(link, title, starCount)
        return results
Copy the code

To parse HTML files more efficiently, you need the ‘BS4’ module.

4. Result object (result_bean)

The result object is an encapsulation of the crawler result, which encapsulates the article name, the corresponding link and the number of likes into an object:

Save each post as a bean containing title, link, and likes
class ResultBean(object):
    def __init__(self, title, link, starCount=10):
        self.title = title
        self.link = link
        self.starCount = starCount
Copy the code

5.HTML output (HTML_outputer)

class HtmlOutputer(object):

    def __init__(self):
        self.datas = []     Enter a list of results

    Build input data (result list)
    def build_data(self, datas):
        if datas is None:
            print('Invalid data for output! ')
            return None
        Determine whether to append or overwrite
        if self.datas is None or len(self.datas)==0:
            self.datas = datas
        else:
            self.datas.extend(datas)

    Output HTML file
    def output(self):
        fout = open('output.html'.'w', encoding='utf-8')
        fout.write('<html>')
        fout.write("
      ")
        fout.write(< link rel = "\" stylesheet \ "href = \ \" http://cdn.static.runoob.com/libs/bootstrap/3.3.7/css/bootstrap.min.css\ ">")
        fout.write("< script SRC = \ \" http://cdn.static.runoob.com/libs/bootstrap/3.3.7/js/bootstrap.min.js\ "> < / script >")
        fout.write("</head>")
        fout.write("<body>")
        fout.write("<table class=\"table table-striped\" width=\"200\">")

        fout.write("< thead > < tr > < td > < strong > article < / strong > < / td > < td > < strong > star number < / strong > < / td > < / tr > < thead >")
        for data in self.datas:
            fout.write("<tr>")
            fout.write("<td width=\"100\"><a href=\"%s\" target=\"_blank\">%s</a></td>" % (data.link, data.title))
            fout.write(" %s" % data.starCount)
            fout.write("</tr>")

        fout.write("</table>")
        fout.write("</body>")
        fout.write("</html>")
        fout.close()
Copy the code

Save the data from the parsed list of resulting objects in an HTML table.

6. Controller (main_controller)

from crawler.url import url_manager
from crawler.downloader import html_downloader
from crawler.parser import html_parser, json_parser
from crawler.outputer import html_outputer


class MainController(object):
    def __init__(self):
        self.url_manager = url_manager.UrlManager()
        self.downloader = html_downloader.HtmlDownloader()
        self.html_parser = html_parser.HtmlParser()
        self.html_outputer = html_outputer.HtmlOutputer()
        self.json_paser = json_parser.JsonParser()

    def craw(self, func):
        def in_craw(baseline):
            print('begin to crawler.. ')
            results = []
            while self.url_manager.has_more_url():
                content = self.downloader.download(self.url_manager.get_new_url())  Get static pages based on the URL
                results.extend(func(content, baseline))
            self.html_outputer.build_data(results)
            self.html_outputer.output()
            print('crawler end.. ')
        print('call craw.. ')
        return in_craw

    def parse_from_json(self, content, baseline):
        json_collection = self.json_paser.json_to_object(content)
        results = self.json_paser.build_bean_from_json(json_collection, baseline)
        return results

    def parse_from_html(self, content, baseline):
        self.html_parser.build_soup(content)  Build HTML pages into soup trees using BeautifulSoup
        results = self.html_parser.build_bean_from_html(baseline)
        return results
Copy the code

In the controller, the __init__ function creates instances of the first few modules. The functions parse_from_json and parse_from_html are responsible for parsing the results from JSON and HTML, respectively; The craW function abstracts the parser function with closures to make it easier to select the desired parser. We pass the parser as parameter ‘func’ to the CRAW function. This is similar to the use of interfaces in Java, but more flexible.

if __name__ == '__main__':
    base_url = 'https://juejin.im/search'    # The URL of the HTML site to crawl (no arguments)
    ajax_base_url = 'https://search-merger-ms.juejin.im/v1/search'      # Url to be accessed via Ajax (no arguments, return JSON)
    keyword = 'python'  # search keywords
    baseline = 10   # The fewest likes
    Create a controller object
    crawler_controller = MainController()
    static_url = crawler_controller.url_manager.build_static_url(base_url, keyword)     Build static urls

    # craw_html = crawler_controller.craw(crawler_controller.parse_from_html) # select the HTML parser
    # static_url, baseline

    Example url for # Ajax request: 'https://search-merger-ms.juejin.im/v1/search? query=python&page=0&raw_result=false&src=web'
    params = {}     Request parameters
    Initialize the request parameters
    params['query'] = keyword
    params['page'] = '1'
    params['raw_result'] = 'false'
    params['src'] = 'web'
    crawler_controller.url_manager.build_ajax_url(ajax_base_url, params)    Build ajax access urls
    craw_json = crawler_controller.craw(crawler_controller.parse_from_json) # Select JSON parser
    craw_json(baseline)     # start fetching
Copy the code

complete