Making address:Github.com/96chh/crawl…

function

Crawl the best parts of planet Knowledge and make them into PDF e-books.

rendering

usage

if __name__ == '__main__':
    start_url = 'https://api.zsxq.com/v1.10/groups/454584445828/topics?scope=digests&count=20'
    make_pdf(get_data(start_url))
Copy the code

Change start_URL to the corresponding URL of the planet you need to climb.

Also, install WKHTMLtox and see “Creating a PDF Ebook” below.

Simulation on

Climb the web version of knowledge planet, wx.zsxq.com/dweb/#.

The site does not rely on cookies to determine whether you are logged in, but rather the Authorization field in the request header.

So, you need to replace the Authorization, user-agent with your own. (Note that the user-agent should also be your own.)

headers = {
    'Authorization': '3704A4EE-377E-1C88-B031-0A42D9E9Bxxx'.'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'
}
Copy the code

Analysis of the page

After a successful login, I usually right-click, check, or view the source code.

This particular page, however, does not place the content in the current URL bar. Instead, it loads it asynchronously (XHR) by simply finding the right interface.

Interface to elite area: api.zsxq.com/v1.10/group…

This interface is the latest 20 pieces of data, and the following data corresponds to different interfaces, which is the next to say the page turning.

Create a PDF ebook

  • Install wkhtmltox wkhtmltopdf.org/downloads.h… . Add the bin directory to the environment variable after installation.
  • PIP install pdfkit
  • This tool converts HTML documents to PDF format. According to the HTML document in the H tag, that is, the title tag to automatically extract the table of contents.
  • There was no title in the essence section originally, so I used the first 6 characters of each question as the title to distinguish different questions.

Next step perfect operation:

Climb take pictures

Obviously, the images key in the returned data is the image, just extract the LARGE, or high-resolution, URL.

The key is to insert the image tag IMG into the HTML document.

I use BeautifulSoup to manipulate the DOM.

Note that there may be more than one image, so use the for loop to iterate through them all

if content.get('images'):
    soup = BeautifulSoup(html_template, 'html.parser')
    for img in content.get('images'):
        url = img.get('large').get('url')
        img_tag = soup.new_tag('img', src=url)
        soup.body.append(img_tag)
        html_img = str(soup)
        html = html_img.format(title=title, text=text)
Copy the code

Flip problem

  • Api.zsxq.com/v1.10/group…

  • The end_time after the path indicates the last date the post was loaded to turn the page.

  • The end_time is url escaped and can be escaped using the urllib.parse.quote method. The key is to find out where the end_time came from.

  • After careful observation, I found that each request returned 20 posts, and the last post was related to the end_time of the next link.

  • For example, if the last post create_time is 2018-01-10T11:49:39.668+0800, then the next link end_time is 2018-01-10T11:49:39.667+0800, note that a 668, A 667. That’s a difference of 1.

end_time = create_time[:20]+str(int(create_time[20:23])- 1)+create_time[23:]
Copy the code
  • The data returned at the end of the page is:
{"succeeded":true."resp_data": {"topics": []}}Copy the code

Next_page = rsp.json().get(‘resp_data’).get(‘topics’)

Create a beautiful PDF

Use CSS styles to control font size, layout, color, etc. See the test.css file.

Import this file into the Options field.

    options = {
        "user-style-sheet": "test.css". }Copy the code

The hardest question is: Old Iron, give me a star! ! !