Python3 develops crawlers. Anaconda is a Python distribution that facilitates Python management, with a review of the basic conda directive.

1- Common Anaconda commands

  • Check Conde version:
conde --version
Copy the code
  • Upgrade conde version:
conda update conda
Copy the code
  • Create and activate an environment

Use the “conda create” command followed by whatever name you wish to call it:

conda create --name ybyCrawler python=3.8
Copy the code

Command to create a virtual environment whose python version is X.X and name is your_env_name. The ybyCrawler file directory can be found in the Anaconda installation directory envs file.

  • To see which packages are installed:
conda list
Copy the code
  • See which environments have been created:
Conda env list or conda info-eCopy the code
  • Activate a virtual environment:
activate your_env_name
Copy the code
  • Install packages in a virtual environment

Install package into your_env_name:

conda install -n your_env_name [package]
Copy the code
  • Disable the virtual environment:
deactivate
Copy the code
  • Delete a virtual environment:
Conda remove -n your_env_name(virtual environment name) --allCopy the code
  • Delete a package from the virtual environment:
conda remove --name your_env_name package_name
Copy the code

2- Related libraries required by crawlers

Crawlers can be simply divided into several steps: grabbing pages, analyzing pages and storing data

  • Request library:

In the process of fetching a page, we need to simulate the browser sending a request to the server, so we need to use some Python library to implement the HTTP request operation

- requests
- Selenium
- aiohttp.....
Copy the code
  • Parsing library:

After grabbing the web code, the next step is to extract information from the web page. There are many ways to extract information, which can be extracted using re, but it is relatively tedious to write. So parsing libraries is useful

- lxml
- Beautiful Soup
- pyquery
Copy the code

It also includes many powerful parsing methods, such as Xpath parsing and CSS selection parsing.

  • Database:

Array storage part

- Mysql
- Redis.....
Copy the code
  • Repository:

PyMysql is used to interact with Python

  • App crawls relevant libraries:

In addition to Web pages, crawlers can also grab App data. To load a page in an App, you first need to fetch data, which is usually obtained by requesting the server’s interface. Since the App does not have a tool that can intuitively see the background request through a browser, it mainly uses some packet capture technologies to capture data.

- Charles
- mitmproxy
- mitmdump.....
Copy the code

Automated operation page

- Appium
Copy the code
  • The crawler frame

A lot of code is reusable, and a number of frameworks are bound to result

- pyspider
- Scrapy
Copy the code
  • Deploying libraries:

Deploy the crawler to the host

- Docker
- Scrapy
Copy the code