There seems to be no better way to write a crawler than with Python. The Python community provides a plethora of crawler tools that can be used directly from the Library. You can write a crawler in a minute. Download Liao xuefeng’s Python tutorial as a PDF ebook for offline reading.

Before you start writing the crawler, we come to analyze the structure the website page 1, page outline of the left is a list of the tutorial, each URL corresponding to the right of an article, on the right top is the title of the article, the middle is the body of the article, the text contents is the focal point that we care, we want to crawl all data is the text of web page, Below is the user’s comment section, which is of no use to us, so we can ignore it.

Tools to prepare

Once you understand the basic structure of your site, you can start preparing the toolkits that crawlers rely on. Requests and Beautifulsoup are two of the most powerful crawlers, reuqests for web requests, and Beautifusoup for manipulating HTML data. With these two shuttles, we don’t need crawler frames like scrapy. Wkhtmltopdf is a great tool for converting HTML to PDF for multiple platforms. Pdfkit is a Python package for wkhtmlTopdf. First install the following dependencies, then install wkHTMLTopdf

pip install requests
pip install beautifulsoup
pip install pdfkitCopy the code

Install wkhtmltopdf

Windows platform directly in wkHTMLTopdf official website 2 download the stable version of the installation, after the installation of the program’s execution PATH into the system environment $PATH variable, Otherwise, PDFkit cannot find wkhtmlTopdf and error “No wkhtmlTopDF executable Found” will occur. Ubuntu and CentOS can be installed directly from the command line

$ sudo apt-get install wkhtmltopdf  # ubuntu
$ sudo yum intsall wkhtmltopdf      # centosCopy the code

The crawler implementation

Once everything is ready, you can go to the code, but before you write the code, you need to clear your mind. The purpose of the program is to save all the URL corresponding HTML body parts locally, and then use PDfkit to convert these files into a PDF file. Let’s split the task. First, save the HTML text corresponding to a URL locally, and then find all urls to perform the same operation.

Use Chrome to find the tag in the body of the page. Press F12 to find the corresponding div tag:

. This div is the body content of the page. After loading the entire page locally with Requests, you can use Beautifulsoup to manipulate the HTML DOM element to extract the body content.


def parse_url_to_html(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html5lib")
    body = soup.find_all(class_="x-wiki-content") [0]
    html = str(body)
    with open("a.html".'wb') as f:
        f.write(html)Copy the code

The second step is to parse out all the urls on the left side of the page. In the same way, go to the left menu TAB

    Because there are two uk-nav uk-nav-side class attributes on the page, the actual directory list is the second. All urls are retrieved, and the URL-to-HTML function is written in the first step.

    def get_url_list(a):
        """ Get a list of all URL directories """
        response = requests.get("http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000")
        soup = BeautifulSoup(response.content, "html5lib")
        menu_tag = soup.find_all(class_="uk-nav uk-nav-side") [1]
        urls = []
        for li in menu_tag.find_all("li"):
            url = "http://www.liaoxuefeng.com" + li.a.get('href')
            urls.append(url)
        return urlsCopy the code

    The final step is to convert HTML to PDF. Converting to PDF is easy because pdfKit encapsulates all the logic. You just need to call pdfkit.from_file

    def save_pdf(htmls):
        Convert all HTML files to PDF files.
        options = {
            'page-size': 'Letter'.'encoding': "UTF-8".'custom-header': [('Accept-Encoding'.'gzip')
            ]
        }
        pdfkit.from_file(htmls, file_name, options=options)Copy the code

    Execute save_pdf to generate the ebook PDF file.

    conclusion

    The total amount of code adds up to less than 50 lines, but wait, the above code omits some details, such as how to get the title of the article, and the img tag for the body content uses a relative path, which is required to change to an absolute path for images to display properly in PDF. Save temporary HTML files to delete, all the details are put on Github.

    The complete code can be downloaded on Github, the code in Windows platform test effective, welcome fork download their own improvement. Github address 3, students who can not access github can use code cloud 4, “Liao Xuefeng Python Tutorial” e-book PDF file can be followed by the public account “a programmer’s micro station” reply “PDF” download free reading.

    This article was first published on the public account of “A programmer’s micro blog” (id:VTtalk).

    A programmer’s micro site