preface

Recently I have been working on monitoring related infrastructure and found that many scripts are based on Python. I’ve heard about it a long time ago. Life is short. I learned Python, and that’s not a joke. With the rise of artificial intelligence, machine learning and deep learning, most of the artificial intelligence codes on the market are written in Python. So in the age of artificial intelligence, it’s time to learn some Python.

Enter the guide

For students without any language development experience, it is recommended to start from the beginning, whether it is a book, video or text tutorial.

If you have experience in other languages, you are advised to start with a case study, such as crawling through a map of a website.

Because you can think about the language, the syntax and so on and you can read the code pretty much as long as you have a sense of the language.

It’s not recommended for experienced developers to start from scratch, whether it’s a video or a book, because it’s just too time-consuming to start learning a language.

Of course, when you go deep into, or to learn systematically, this is later.

Software tools

Python3

The selected version is Python 3.7.1, the latest version

Installation tutorial recommended:

http://www.runoob.com/python3/python3-install.html

Win download address:

https://www.python.org/downloads/windows

Linux Download address:

https://www.python.org/downloads/source

PyCharm

Visual development tools:

http://www.jetbrains.com/pycharm

case

Implementation steps

Take the girl picture for example, it’s actually very simple, divided into the following four steps:

  • Get the page number of the home page and create folders corresponding to the page number

  • Gets the column address of the page

  • Enter the column and obtain the page number of the column (there are multiple pictures under each column, which are displayed in pages)

  • Get the image in the TAB below and download it

Matters needing attention

There are a few other things to note during the crawl process that may help you:

1) Guide library, actually similar to the Framework or tool class in Java, the bottom layer is encapsulated

Installing third-party Libraries


     
  1. # python3 installed directly under Win

  2. PIP install requests PIP install requests

  3. # Linux python2 python3 coexist

  4. Pip3 install BS4, pip3 Install Requests

Copy the code

Importing third-party Libraries


     
  1. # import requests library

  2. import requests

  3. # Import file operation library

  4. import os

  5. # bS4 BeautifulSoup is one of the most popular libraries for writing Python crawlers that parse HTML tags.

  6. import bs4

  7. from bs4 import BeautifulSoup

  8. Base class library

  9. import sys

  10. # Python 3.x resolves the Chinese encoding problem

  11. import importlib

  12. importlib.reload(sys)

Copy the code

2) Define method functions. A crawler may have hundreds of lines, so try not to write it as a lump


     
  1. def download(page_no, file_path):

  2. Write the code logic here

Copy the code

3) Define global variables


     
  1. Assign a request header to the request to mimic chrome

  2. Global headers # tells the compiler that this is the global headers variable

  3. Headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}

  4. Required before use within a function

  5. Tell the compiler that the a I use in this method is the global headers variable I just defined, not a local variable inside the method.

  6. global headers

Copy the code

4) Anti-theft chain

Some sites have added anti-theft, do-it-all Python solutions


     
  1. headers = {'Referer': href}

  2. img = requests.get(url, headers=headers)

Copy the code

5) Switch the version

Linux server uses Ali cloud server, default version PYTHon2, Python3 installed by itself


     
  1. [root@AY140216131049Z mzitu]# python2 -V

  2. Python 2.7.5

  3. [root@AY140216131049Z mzitu]# python3 -V

  4. Python 3.7.1

  5. # Default version

  6. [root@AY140216131049Z mzitu]# python -V

  7. Python 2.7.5

  8. # temporarily switch version <whereis python>

  9. [root @ AY140216131049Z mzitu] # alias python = '/ usr/local/bin/python3.7'

  10. [root@AY140216131049Z mzitu]# python -V

  11. Python 3.7.1

Copy the code

6) Exception capture

There may be abnormal pages in the process of crawling, so we will capture them without affecting subsequent operations


     
  1. try:

  2. # Business logic

  3. except Exception as e:

  4.   print(e)

Copy the code

Code implementation

Edit the script: vi mzitu.py


     
  1. #coding=utf-8

  2. #! /usr/bin/python

  3. # import requests library

  4. import requests

  5. # Import file operation library

  6. import os

  7. import bs4

  8. from bs4 import BeautifulSoup

  9. import sys

  10. import importlib

  11. importlib.reload(sys)

  12. Assign a request header to the request to mimic chrome

  13. global headers

  14. Headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}

  15. # map address

  16. mziTu = 'http://www.mzitu.com/'

  17. Define the storage location

  18. global save_path

  19. save_path = '/mnt/data/mzitu'

  20. Create folder

  21. def createFile(file_path):

  22.    if os.path.exists(file_path) is False:

  23.        os.makedirs(file_path)

  24. Change the path to the folder created above

  25.    os.chdir(file_path)

  26. # Download file

  27. def download(page_no, file_path):

  28.    global headers

  29.    res_sub = requests.get(page_no, headers=headers)

  30. # parse HTML

  31.    soup_sub = BeautifulSoup(res_sub.text, 'html.parser')

  32. Get the column address of the page

  33.    all_a = soup_sub.find('div',class_='postlist').find_all('a',target='_blank')

  34.    count = 0

  35.    for a in all_a:

  36.        count = count + 1

  37.        if (count % 2) == 0:

  38. Print ("内 内页 : "+ STR (count))

  39. # extract href

  40.            href = a.attrs['href']

  41. Print (" + href ")

  42.            res_sub_1 = requests.get(href, headers=headers)

  43.            soup_sub_1 = BeautifulSoup(res_sub_1.text, 'html.parser')

  44. # ------ It is best to use exception handling ------ here

  45.            try:

  46. Get the maximum number of sets

  47.                pic_max = soup_sub_1.find('div',class_='pagenavi').find_all('span')[6].text

  48. Print (" pic_max ")

  49.                for j in range(1, int(pic_max) + 1):

  50. # print(" + STR (j) ")

  51. # j int needs to be converted to a string

  52.                    href_sub = href + "/" + str(j)

  53.                    print(href_sub)

  54.                    res_sub_2 = requests.get(href_sub, headers=headers)

  55.                    soup_sub_2 = BeautifulSoup(res_sub_2.text, "html.parser")

  56.                    img = soup_sub_2.find('div', class_='main-image').find('img')

  57.                    if isinstance(img, bs4.element.Tag):

  58. # extract SRC

  59.                        url = img.attrs['src']

  60.                        array = url.split('/')

  61.                        file_name = array[len(array)-1]

  62.                        # print(file_name)

  63. Add Referer

  64.                        headers = {'Referer': href}

  65.                        img = requests.get(url, headers=headers)

  66. # print(' Start saving images ')

  67.                        f = open(file_name, 'ab')

  68.                        f.write(img.content)

  69. # print(file_name, 'image saved! ')

  70.                        f.close()

  71.            except Exception as e:

  72.                print(e)

  73. # main method

  74. def main():

  75.    res = requests.get(mziTu, headers=headers)

  76. # Use HTML. Parser

  77.    soup = BeautifulSoup(res.text, 'html.parser')

  78. Create folder

  79.    createFile(save_path)

  80. Get the total number of pages on the first page

  81.    img_max = soup.find('div', class_='nav-links').find_all('a')[3].text

  82. # print(" total pages :"+img_max)

  83.    for i in range(1, int(img_max) + 1):

  84. Get the URL for each page

  85.        if i == 1:

  86.            page = mziTu

  87.        else:

  88.            page = mziTu + 'page/' + str(i)

  89.        file = save_path + '/' + str(i)

  90.        createFile(file)

  91. # Download images for each page

  92. Print (" + page ")

  93.        download(page, file)

  94. if __name__ == '__main__':    main()

Copy the code

Run the following command to run the script on a Linux server


     
  1. python 3 mzitu.py

  2. # or background execution

  3. nohup python3 -u mzitu.py > mzitu.log 2>&1 &

Copy the code

At present, only one column of the set of pictures, a total of 17G, 5332 pictures.


     
  1. [root@itstyle mzitu]# du -sh

  2. 17G     .

  3. [root@itstyle mzitu]# ll -s

  4. total 5332

Copy the code

Next, please open your eyes, chicken frozen hearts of the set of time to come.

summary

As a beginner, scripts are bound to have some problems or need to be optimized. If you encounter Python aunts, please feel free to comment.

In fact, the script is very simple, from configuring the environment, installing the INTEGRATED development environment, writing the script to executing the whole script smoothly, it took about four or five hours, and finally the script was executed in a single track. Limited by the server bandwidth and configuration, it took about three or four hours to download the 17GB image. As for the rest 83G, you can download it by yourself.