【 play 】 Batch save web pages to local - m_downlink

GitHub

At first, I wrote a script to download the blog articles and other favorites to the local collection. I developed a directory file for easy reference. Later, after optimizing the relevant content, I posted it to Github for everyone to use. I will also package it as an exe executable file for those who don’t have an environment to use.

And downloaded to the local bloggers are mostly feel worthy of collection of dry goods, or to write good articles, etc., and collect the benefits of links are space saving and convenient, and the harm is when one day visit to the 404 or the wording, such as the content does not exist, this article author may be deleted.

Examples of finished products:

Project Document Description

The basic construct consists of the following files:

m_downlink.pyDownload the main running file of the web page, you can also specify related parameters
config.pyConfiguration file to configure some parameters
m_service.pyWill download directory files called servers for access to the service startup file
chromedriver.exeSelenium is a Chrome plugin called

config.py

This section describes the configuration first. After opening the config file config.py, you can set the following parameters:

# configure the folder where the downloaded files are saved. Note: Write the absolute file path
DOWNLOAD = r'C:\Users\Administrator\Desktop\xxx'

# saved you need to download to a local address, such as https://juejin.cn/user/2805609406139950
# one line is an address, you can configure multiple lines of address, and ignore empty lines
LINK_FILE = r'./m_links.txt'

If your default input method is Chinese, you need to specify True to switch the input method, if English, use False
SHIFT = True

Start the Debug mode of the service, either True or False
DEBUG = False

# enable service binding address, such as "127.0.0.1" only the current host access, "0.0.0.0" the current LAN access
HOST = '0.0.0.0'

The binding port number used to start the service
PORT = 8009

# the first named number of the download directory, followed by +1
NUMBER = 1

# Maximum number of scrollbar operations
SCROLL_MAX = 5

# Scrollbar distance per operation, in px(pixels)
SCROLL_MOVE = 700
Copy the code

Note that the DOWNLOAD parameter requires an absolute path, not a relative path.

m_downlink.py

This is the main run file with the following functions:

Download path to web page content
Optimize picture presentation problems
Write directory file

The ChromeDriver uses selenium to download files using keyboard operations and write them to a folder named with a number. The chromedriver cache images first and then writes them to a directory file named links.html after all the download tasks are complete.

The basic use is to run directly with a complete configuration file:

python3 m_downlink.py
Copy the code

Of course, in the case of special requirements, using Python logic, we write the following parameters:

-t Tests whether the component runs properly. -d DOWNLOAD The links in the specified LINK_FILE to the DOWNLOAD directory. -w Generates files in the DOWNLOAD directory Number :bumber] Finds the link data at the specified index location, or uses the data as download input. -n [number] Specifies the starting number of the download directory nameCopy the code

For example, to test whether chromeDriver is running properly:

python3 m_downlink.py -t
Copy the code

Only need to download, do not generate directory files:

python3 m_downlink.py -d
Copy the code

Generate directory for link files that have been downloaded:

python3 m_downlink.py -w
Copy the code

The NUMBER and -n parameters in the configuration file are the same. The latter takes precedence over the former. If you want to name the folder starting with 100, you can use the -n 100 parameter, which is 0100:

python3 m_downlink.py -n 100
Copy the code

If there are 100 links in the configured LINK_FILE file, I don’t need to download all of them, only some of them, you can use -f xx to specify the range:

#Download only the fourth link
python3 m_downlink.py -f 4 -d

#Download links 5 to 55
python3 m_downlink.py -f 5:55 -d

#Download the link starting from 55
python3 m_downlink.py -f 5:999 -d
Copy the code

Note: do not operate the mouse after the script starts to make it lose the cursor, otherwise the download may encounter errors, the default operation scroll bar movement maximumSCROLL_MAX by SCROLL_MOVEIf it has not moved to the bottom after the maximum value:

① If image loading is not important to you, you can directly click1To continue the operation

② If you need pictures to load, you can manually operate the mouse to scroll to the desired height and then press1Continue to work

Here using the manual way is to solve such as the Denver nuggets, you can refer to the bottom of the article and then slipping, there will be a lot of recommendations, and partial refresh, sliding a lot of content, here cannot make accurate predictions to text, and the recommended nor I want to save, so the artificial objectively make a judgment.

After the runtime is complete, follow the instructions:

m_service.py

This file is mainly used by Flask to provide web services. A few lines of code can be run directly with no parameters:

python3 m_service.py
Copy the code

Once started, you can access it using 127.0.0.1 plus your port, like this:

m_links.txt

This is the file specified by LINK_FILE in the configuration file, and the content is the link you need to download, for example:

https://juejin.cn/post/6859632848761651213
https://juejin.cn/post/6859625125365415949
...
Copy the code

You can name the file whatever you want, just specify it in LINK_FILE on config.py.

No Python environment is running

The project also includes packaged Windows executables.exeRun the file, so that means you can do withoutPythonThe runtime environment can use this project, the project package file is inlibFolder. You can find it in this one 和 Two files, run method and.pyBasically the same thing, you can use parameters, except that without parameters, you can just double-click.exeFile to start using, if you want to specify parameters when using, you need to specify parameters in yourbinUnder the files directory, click on the directory window and typecmdAnd press Enter to open the black window of the terminal:

In the use of:

downlink.exe -t
downlink.exe -n 100 -d
Copy the code

The same service.exe can also be started using a terminal:

service.exe
Copy the code

Allow access.

download

If you have a Python environment, you are advised to download the m_downlink.zip file directly, which contains only files other than lib. Since the size of the package file is about 88MB, it is slow to download and does not have any effect on your use. I suggest you download project.zip.

Local File Description

Your web page will have two.html files when downloaded locally,

The folder is some static resources of the page, the red.html is the original page of the downloaded interface, and the green m_index. HTML is the optimized interface page.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

【 play 】 Batch save web pages to local — m_downlink

Project Document Description

config.py

m_downlink.py

m_service.py

m_links.txt

No Python environment is running

download

Local File Description

【 play 】 Batch save web pages to local — m_downlink

Project Document Description

config.py

m_downlink.py

m_service.py

m_links.txt

No Python environment is running

download

Local File Description

Related Posts

Some understanding of the working mechanism of using VMware to build hadoop cluster

Front desk little sister looking for me to write an automatic file sorting script, finish please me to eat noodles. I’m going to eat this noodle!

Brief talk about database industry understanding