Article climb, a article all done!

“This is the 13th day of my participation in the First Challenge 2022. For details: First Challenge 2022.”

preface

Small make up before finishing recently wrote a blog, in the process encountered a problem because of previously published blog part no in the local retained, so I want to move this part of the article will encounter a problem, manually copy the rich text paste in the markdown is very troublesome, there will be missing images, formats and disorderly all sorts of problems, such as It’s better to rewrite than to go all the way around. To solve this problem completely, I manually wrote a crawler script that crawls the blogs that don’t have source files in bulk and automatically converts them into markdown files, which are then ready to use.

preparation

Since our script is Python-based, make sure you have the Python@3 environment installed first, and then manually install the two libraries.

The first one

PIP Install Newspaper3K

Newspaper3k is a library for crawling articles. There are many crawler libraries for crawling articles, such as Requests, requests- HTML, HTTPX, etc. Newspaper3k is chosen because it is tailored for articles. For example, the author of the article, static resources in the article, publication time and so on can be climbed directly.

The second

PIP install html2text

Html2text, it is a library that can convert HTML to text format, we can use it to easily convert the HTML content of the article we climb directly into markdown format text, and then save it to the file.

With the above ideas, we just need to automate the process in batches, and we can accomplish our tasks quickly.

The practice part

Crawl the content of the article

Original page:

code

from newspaper import Article

# Plan to crawl the article
url = "https://www.u1s1.vip/docs/MacOS/MacOS-1"

Instantiate the crawler
article = Article(url,language='zh') If we want to get the text content of the article, we need to specify the language as Chinese
# Download the article
article.download()

View the HTML content of the article
print(article.html)

# Format the content for easy access to text
article.parse()
print(article.text)
Copy the code

Code run result

Convert markdown

Obviously, the text we crawled above is not enough for our needs, so we need to add a step to convert to markdown.

code

from newspaper import Article
import html2text as ht

url = "https://www.u1s1.vip/docs/MacOS/MacOS-1"

article = Article(url)
article.download()

Get the HTML content
html = article.html

Instantiate the html2text object
runner = ht.HTML2Text()
# Convert HTML to Markdown
res = runner.handle(html)
# print(res)

Save the markdown content to res.md
with open ('res.md',mode='w') as f:
    f.write(res)
Copy the code

Code execution results

Why UnicodeEncodeError? Since we are writing the file in Chinese, we need to declare the encoding format to be UTF-8.

The modified code

from newspaper import Article
import html2text as ht

url = "https://www.u1s1.vip/docs/MacOS/MacOS-1"

article = Article(url)
article.download()

Get the HTML content
html = article.html

Instantiate the html2text object
runner = ht.HTML2Text()
# Convert HTML to Markdown
res = runner.handle(html)
# print(res)

Save the markdown content to res.md
with open ('res.md',mode='w',encoding='utf-8') as f:
    f.write(res)
Copy the code

Code execution results

results

Let’s open res.md and have a look

We can see that the markdown content is basically the same as the original article, but there are more headers and footers of some websites, we can delete them. There are a lot of yellow warnings about markdown syntax, but it doesn’t affect the presentation of the content, so let’s preview it and see what it looks like.

conclusion

From the above experience, we can confirm that this approach can perfectly meet our needs. The content of the original article, format, code, graph bed links, etc., are all available.

If we need to process a lot of articles, we can simply store them in a list and loop through them.

preface

preparation

The first one

The second

The practice part

Crawl the content of the article

Convert markdown

results

conclusion

Related Posts

BeanFactoryProcessor extension point

BIO, NIO, AIO and multiplexer details

01 matrix