1. The first lesson of the crawler

Python3 develops crawlers. Anaconda is a Python distribution that facilitates Python management, with a review of the basic conda directive.

1- Common Anaconda commands

Check Conde version:

conde --version
Copy the code

Upgrade conde version:

conda update conda
Copy the code

Create and activate an environment

Use the “conda create” command followed by whatever name you wish to call it:

conda create --name ybyCrawler python=3.8
Copy the code

Command to create a virtual environment whose python version is X.X and name is your_env_name. The ybyCrawler file directory can be found in the Anaconda installation directory envs file.

To see which packages are installed:

conda list
Copy the code

See which environments have been created:

Conda env list or conda info-eCopy the code

Activate a virtual environment:

activate your_env_name
Copy the code

Install packages in a virtual environment

Install package into your_env_name:

conda install -n your_env_name [package]
Copy the code

Disable the virtual environment:

deactivate
Copy the code

Delete a virtual environment:

Conda remove -n your_env_name(virtual environment name) --allCopy the code

Delete a package from the virtual environment:

conda remove --name your_env_name package_name
Copy the code

2- Related libraries required by crawlers

Crawlers can be simply divided into several steps: grabbing pages, analyzing pages and storing data

Request library:

In the process of fetching a page, we need to simulate the browser sending a request to the server, so we need to use some Python library to implement the HTTP request operation

- requests
- Selenium
- aiohttp.....
Copy the code

Parsing library:

After grabbing the web code, the next step is to extract information from the web page. There are many ways to extract information, which can be extracted using re, but it is relatively tedious to write. So parsing libraries is useful

- lxml
- Beautiful Soup
- pyquery
Copy the code

It also includes many powerful parsing methods, such as Xpath parsing and CSS selection parsing.

Database:

Array storage part

- Mysql
- Redis.....
Copy the code

Repository:

PyMysql is used to interact with Python

App crawls relevant libraries:

In addition to Web pages, crawlers can also grab App data. To load a page in an App, you first need to fetch data, which is usually obtained by requesting the server’s interface. Since the App does not have a tool that can intuitively see the background request through a browser, it mainly uses some packet capture technologies to capture data.

- Charles
- mitmproxy
- mitmdump.....
Copy the code

Automated operation page

- Appium
Copy the code

The crawler frame

A lot of code is reusable, and a number of frameworks are bound to result

- pyspider
- Scrapy
Copy the code

Deploying libraries:

Deploy the crawler to the host

- Docker
- Scrapy
Copy the code

1- Common Anaconda commands

2- Related libraries required by crawlers

Related Posts

Heap memory, stack memory and static memory in C++

Implement distributed consensus algorithm -Raft algorithm

Digger Community: OpenStack Train Offline Deployment | Series of Tutorials