I used Python to crawl the 100G set map of Girl network

preface

Recently I have been working on monitoring related infrastructure and found that many scripts are based on Python. I’ve heard about it a long time ago. Life is short. I learned Python, and that’s not a joke. With the rise of artificial intelligence, machine learning and deep learning, most of the artificial intelligence codes on the market are written in Python. So in the age of artificial intelligence, it’s time to learn some Python.

Enter the guide

For students without any language development experience, it is recommended to start from the beginning, whether it is a book, video or text tutorial.

If you have experience in other languages, you are advised to start with a case study, such as crawling through a map of a website.

Because you can think about the language, the syntax and so on and you can read the code pretty much as long as you have a sense of the language.

It’s not recommended for experienced developers to start from scratch, whether it’s a video or a book, because it’s just too time-consuming to start learning a language.

Of course, when you go deep into, or to learn systematically, this is later.

Software tools

Python3

The selected version is Python 3.7.1, the latest version

Installation tutorial recommended:

http://www.runoob.com/python3/python3-install.html

Win download address:

https://www.python.org/downloads/windows

Linux Download address:

https://www.python.org/downloads/source

PyCharm

Visual development tools:

http://www.jetbrains.com/pycharm

case

Implementation steps

Take the girl picture for example, it’s actually very simple, divided into the following four steps:

Get the page number of the home page and create folders corresponding to the page number
Gets the column address of the page
Enter the column and obtain the page number of the column (there are multiple pictures under each column, which are displayed in pages)
Get the image in the TAB below and download it

Matters needing attention

There are a few other things to note during the crawl process that may help you:

1) Guide library, actually similar to the Framework or tool class in Java, the bottom layer is encapsulated

Installing third-party Libraries


     
      # python3 installed directly under Win
      PIP install requests PIP install requests
      # Linux python2 python3 coexist
      Pip3 install BS4, pip3 Install Requests
     
Copy the code

Importing third-party Libraries


     
      # import requests library
      import requests
      # Import file operation library
      import os
      # bS4 BeautifulSoup is one of the most popular libraries for writing Python crawlers that parse HTML tags.
      import bs4
      from bs4 import BeautifulSoup
      Base class library
      import sys
      # Python 3.x resolves the Chinese encoding problem
      import importlib
      importlib.reload(sys)
     
Copy the code

2) Define method functions. A crawler may have hundreds of lines, so try not to write it as a lump


     
      def download(page_no, file_path):
      Write the code logic here
     
Copy the code

3) Define global variables


     
      Assign a request header to the request to mimic chrome
      Global headers # tells the compiler that this is the global headers variable
      Headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
      
      Required before use within a function
      Tell the compiler that the a I use in this method is the global headers variable I just defined, not a local variable inside the method.
      global headers
     
Copy the code

4) Anti-theft chain

Some sites have added anti-theft, do-it-all Python solutions


     
      headers = {'Referer': href}
      img = requests.get(url, headers=headers)
     
Copy the code

5) Switch the version

Linux server uses Ali cloud server, default version PYTHon2, Python3 installed by itself


     
      [root@AY140216131049Z mzitu]# python2 -V
      Python 2.7.5
      [root@AY140216131049Z mzitu]# python3 -V
      Python 3.7.1
      # Default version
      [root@AY140216131049Z mzitu]# python -V
      Python 2.7.5
      # temporarily switch version <whereis python>
      [root @ AY140216131049Z mzitu] # alias python = '/ usr/local/bin/python3.7'
      [root@AY140216131049Z mzitu]# python -V
      Python 3.7.1
     
Copy the code

6) Exception capture

There may be abnormal pages in the process of crawling, so we will capture them without affecting subsequent operations


     
      try:
      # Business logic
      except Exception as e:
         print(e)
     
Copy the code

Code implementation

Edit the script: vi mzitu.py


     
      #coding=utf-8
      #! /usr/bin/python
      # import requests library
      import requests
      # Import file operation library
      import os
      import bs4
      from bs4 import BeautifulSoup
      import sys
      import importlib
      importlib.reload(sys)
      
      
      Assign a request header to the request to mimic chrome
      global headers
      Headers = {' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
      # map address
      mziTu = 'http://www.mzitu.com/'
      Define the storage location
      global save_path
      save_path = '/mnt/data/mzitu'
      
      
      Create folder
      def createFile(file_path):
          if os.path.exists(file_path) is False:
              os.makedirs(file_path)
      Change the path to the folder created above
          os.chdir(file_path)
      
      
      # Download file
      def download(page_no, file_path):
          global headers
          res_sub = requests.get(page_no, headers=headers)
      # parse HTML
          soup_sub = BeautifulSoup(res_sub.text, 'html.parser')
      Get the column address of the page
          all_a = soup_sub.find('div',class_='postlist').find_all('a',target='_blank')
          count = 0
          for a in all_a:
              count = count + 1
              if (count % 2) == 0:
      Print ("内 内页 : "+ STR (count))
      # extract href
                  href = a.attrs['href']
      Print (" + href ")
                  res_sub_1 = requests.get(href, headers=headers)
                  soup_sub_1 = BeautifulSoup(res_sub_1.text, 'html.parser')
      # ------ It is best to use exception handling ------ here
                  try:
      Get the maximum number of sets
                      pic_max = soup_sub_1.find('div',class_='pagenavi').find_all('span')[6].text
      Print (" pic_max ")
                      for j in range(1, int(pic_max) + 1):
      # print(" + STR (j) ")
      # j int needs to be converted to a string
                          href_sub = href + "/" + str(j)
                          print(href_sub)
                          res_sub_2 = requests.get(href_sub, headers=headers)
                          soup_sub_2 = BeautifulSoup(res_sub_2.text, "html.parser")
                          img = soup_sub_2.find('div', class_='main-image').find('img')
                          if isinstance(img, bs4.element.Tag):
      # extract SRC
                              url = img.attrs['src']
                              array = url.split('/')
                              file_name = array[len(array)-1]
                              # print(file_name)
      Add Referer
                              headers = {'Referer': href}
                              img = requests.get(url, headers=headers)
      # print(' Start saving images ')
                              f = open(file_name, 'ab')
                              f.write(img.content)
      # print(file_name, 'image saved! ')
                              f.close()
                  except Exception as e:
                      print(e)
      
      
      # main method
      def main():
          res = requests.get(mziTu, headers=headers)
      # Use HTML. Parser
          soup = BeautifulSoup(res.text, 'html.parser')
      Create folder
          createFile(save_path)
      Get the total number of pages on the first page
          img_max = soup.find('div', class_='nav-links').find_all('a')[3].text
      # print(" total pages :"+img_max)
          for i in range(1, int(img_max) + 1):
      Get the URL for each page
              if i == 1:
                  page = mziTu
              else:
                  page = mziTu + 'page/' + str(i)
              file = save_path + '/' + str(i)
              createFile(file)
      # Download images for each page
      Print (" + page ")
              download(page, file)
      
      
      if __name__ == '__main__':    main()


    Copy the code
Run the following command to run the script on a Linux server

     
      python 3 mzitu.py 
      # or background execution
      nohup python3 -u mzitu.py > mzitu.log 2>&1 &
     
Copy the code
At present, only one column of the set of pictures, a total of 17G, 5332 pictures.

     
      [root@itstyle mzitu]# du -sh 
      17G     .
      [root@itstyle mzitu]# ll -s
      total 5332
     
Copy the code
Next, please open your eyes, chicken frozen hearts of the set of time to come.


summary
As a beginner, scripts are bound to have some problems or need to be optimized. If you encounter Python aunts, please feel free to comment.
In fact, the script is very simple, from configuring the environment, installing the INTEGRATED development environment, writing the script to executing the whole script smoothly, it took about four or five hours, and finally the script was executed in a single track. Limited by the server bandwidth and configuration, it took about three or four hours to download the 17GB image. As for the rest 83G, you can download it by yourself.

I used Python to crawl the 100G set map of Girl network

preface

Software tools

Python3

PyCharm

case

Matters needing attention

Code implementation

summary

Related Posts

【Spring Cloud】 Protection mechanism -Hystrix

Jquery Jquery.md5.js Encryption

Matplotlib common statistical graph drawing