preface
Baidu ah, you think you baidu net disk to cancel the speed limit, we satisfied? Of course not, there is a certain degree of library! Had a good document, must not let us download… Today, we will teach you to write with me a library download Weeker, reject a library, start from me.
Our downloader is a GUI program that writes the core file (get. Py), then the command line parsing file (weeker.py), then the command line generation using Fire, and finally the CLI conversion using Gooey.
To prepare
The installation
- Install Python 3.8;
- PIP install requests docx beautifulsoup4 Gooey
directory
Initialize the project (the following script is run on Unix or Linux) :
Of course there will be difficulties on the way to learning Python. How can you learn without good learning materials? Q group number: 928946953, a group of like-minded friends, help each other, there are good video tutorials and PDF! And Daniel answers! ``` cd /path/to/project mkdir Weeker touch get.py weeker.pyCopy the code
The crawler core
The first step is to open get.py and introduce the class library:
From OS import getcwd, system from re import sub import requests import docx from BS4 import BeautifulSoupCopy the code
The functions of each module are as follows:
The name of the module
role
os
Get the current directory
re
Replaces a specific character in a document
requests
It’s used to make network requests. No more talking.
docx
Used to convert TXT to DOCX format.
bs4
Used to parse text out of HTML.
Since we need to determine the path when saving the file, we define a PWD constant to store “current path:
Copy code to hide code PWD = getcwd()Copy the code
To declare a get url: ua: path: the output: the convert method, to achieve our crawler function, including:
The parameter name
role
url
Document addresses, such as literally searched a: wenku.baidu.com/view/11ebd2…
ua
The User Agent. I tried it out and if using the browser UA didn’t work, it would crawl to an AD screen and tell you that you need to log in to do this, so we had to use Googlebot or Baiduspider to bypass UA detection (that’s why search engines find it) and think we were a search engine. With recommended use of the latter, after all, Baidu and library family.
path
Storage directory, excluding file names.
output
File name with a suffix.
convert
Converted format.
Because the author is lazy,
Therefore, this field can only be filled in docX.
Write the get::::: function
Get HTML & parse
Move the cursor to the get::::: function. First we’ll need requests as usual, and we’ll need bS4 for all requests:
Headers = {' user-agent ': Ua} result = requests. Get (URL, headers=headers) soup = BeautifulSoup(res.text, "html.parser") We define an array to store each line of the document everyline = []Copy the code
Add the title
We give the document a title, which is the title of the page.
Everyline.append (soup.title.string)Copy the code
But this will have a problem, add out the title is “XXXXXXx_ Baidu library”, very unsightly. So lift the re.sub to replace it with:
Everyline.append (re.sub('_ ', ", soup.title.string, 1)) everyline.append(re.sub('_ ', soup.title.string, 1))Copy the code
Access to the body
Bd doc-reader = bd doc-reader = BD doc-reader = BD doc-reader
\n, \x0c, and Spaces (\n is a newline character). We split it into arrays and delete the other two characters separately:
For doc in soup. Find_all ('div', attrs={"class": "bd doc-reader"}): Everyline.extend (doc.get_text().split('\n')) # everyline = [i.place (' ', '') for i in everyline] everyline = [i.replace('\x0c', '') for i in everyline]Copy the code
Save the file
The next step is to save the file. Save TXT as TXT first, and then judge the convert parameter. If docx is entered, then suffix TXT and change it to DOCx.
Final_path = path # If it is a relative path, change the connection PWD to an absolute path, otherwise Python does not support it. if not path.startswith('/'): final_path = pwd + '/' + final_path try: file = open(final_path + '/' + output, 'w', encoding='utf-8') for line in everyline: file.write(line) file.write('\n') file.close() except FileNotFoundError as err: print("wenku: error: / / contract/formspapers/formspapers/formspapers/formspapers/formspapers/formspapers/formspapers/formspapers/formspapers/formspapers with open(final_path + '/' + output) as f: Add_paragraph (f.read()) # Add paragraph doku.save (final_path + '/' + output + '.' + convert) # Docx system('rm '+ final_path + '/' + output) # Delete file saved in tryCopy the code
Create the GUI
Open the weeker. Py. The first is a pair of import sentences, where Gooey converts CLI to GUI using argparse-like syntax.
From gooey import gooey, GooeyParser import getCopy the code
Then add if __name__ == ‘__main__’ :
If __name__ == '__main__': main()Copy the code
Let’s define this main() :
@gooey (encoding=' utF-8 ', program_name="Weeker ", language=' Chinese ') def main(): encoding='utf-8', program_name="Weeker ", language=' Chinese ') def main(): Parser = GooeyParser(Description =" Parser, Cheers!" ) parser. Add_argument ("url", metavar=' document address ', widget="TextField") parser. Add_argument ("ua", metavar=' user UA', widget="Dropdown", choices={"Googlebot": 1, 'Baiduspider': Add_argument ("path", metavar=' save path ', widget="DirChooser") parser. Add_argument ("output", metavar=' rename ', Widget ="TextField") parser. Add_argument ("convert", metavar=' format conversion ', widget="Dropdown", choices={'docx': 1}) args = parser.parse_args() get.get(args.url, ua=args.ua, path=args.path, output=args.output, convert=args.convert)Copy the code
@gooey is a decorator that converts main() to a Gooey function. In main, we write the parser.add_argument function similar to argparse, and finally define args = parser.parse_args(), which takes input for each argument from the args member and passes it to get. Py. Let’s run it and something amazing happens:
We have successfully converted CLI to GUI!!
Note: If you prefer the command line, you can search python-fire on GitHub and expose functions and arguments directly to the CLI for better results. Note II: due to computer problems, the finished product can not be packaged, so if necessary, please compile by yourself. Note III: Attached are two py files. Note IV: I just saw a wrong import in the source code. If you downloaded the source code, please check it against the code in the article first.