Recently saw tutorials.jenkov.com/java-nio Java nio tutorial is high quality. I checked the website and found no link for PDF download. The website is also outside the net, need to hang agent to access, so have their own grab resources to make e-books.

The main idea

View the structure of the web page is mainly to get the links of each article in the left tutorial directory, grab the content of the web page into PDF. Of course, this kind of thing cannot be done by hand.

This is done by extracting web text from Python’s Requests library, extracting the required portions using BeautifulSoup, and generating PDF ebooks from HTML using PDFKit.

The resulting result is a PDF ebook with bookmarks:

All code with encountered pits

  1. The first step is to change the CSS display: None for the parts of the retrieved resource that are not needed
  2. You need to install WKHTMLPDF on your PC (if you are ubuntu, apt install WKHTMLPDF).
  3. Need to generate ebook bookmarks, watermarking style requirements, through the PDfKit to set the relevant parameters. You can refer to wkhtmltopdf.org/usage/wkhtm…
  4. Solve the HTML to PDF article text, table broken page. The page-break-before and other styles on the Internet are not perfectly solved.
  5. This code is mainly aimed at tutorials.jenkov.com/java-nio/, smart if you have python foundation believe that soon can extrapolate generating ebook grab other Internet resources.
import requests
import pdfkit
from bs4 import BeautifulSoup
options = {
    'page-size': 'Letter'.'margin-top': '5mm'.'margin-bottom': '20mm'.'encoding': "UTF-8".'no-outline': None.'outline-depth': 10.'header-right': "This resource was collected from the Internet by www.codehome.vip".'header-font-size':8.'outline-depth':5.'outline':' '
}


def dealHtml(url) :
    print(url)
    r = requests.get(url)
    html = r.text
    html = html.replace("</head>".'''''')
    html = html.replace("#bottomNavBar2Parent{"."#bottomNavBar2Parent{display:none;")
    html=html.replace("#lastUpdate{"."#lastUpdate{display:none;")
    html = html.replace("#bottomSocial{"."#bottomSocial{display:none;")
    html = html.replace('img src="/images/'.'img src="http://tutorials.jenkov.com/images/')
    html=html.replace('class="codeBox"'.'class="codeBox" style="overflow:hidden; display: inline-block; page-break-inside: avoid! important; page-break-after: avoid! important; page-break-before: avoid! important;" ')
    html=html.replace('.topBar, .footer {'.'.topBar, .footer {display:none; ')
    return html

website = "http://tutorials.jenkov.com"
r = requests.get("http://tutorials.jenkov.com/java-nio/index.html")
soup = BeautifulSoup(r.content, "lxml")
hreflist = []
hrefs = soup.select("#trailToc > ol > li > a")
allbook=""
for href in hrefs:
    newHref = website + href['href']
    hreflist.append(newHref)
    allbook=allbook+dealHtml(newHref)
try:
   pdfkit.from_string(allbook,"java-nio.pdf", options=options)
except:
    pass
Copy the code

If you want to get this e-book, wechat follow the public number: programmers push background to send Nioo.