Python Web crawler (part 1) Getting started

Python Web crawler (I) – Getting started

Python web crawler (2) – Urllib crawler case

Python Web crawler (3) – Advanced crawlers

Python Web crawler (4) – XPath

Python Web crawler (5) – Requests and Beautiful Soup

Python Web crawler (6) – Scrapy framework

Python Web crawler (7) – Deep CrawlSpider

Python Web crawler (8) – Implement a simple translation program using youdao Dictionary

A web crawler (also known as a web spider, a web bot, and more commonly as a web chaser in the FOAF community) is a program or script that automatically crawls information from the World Wide Web according to certain rules. Other less commonly used names include ant, autoindex, simulator or worm.
Python learning web crawler can be divided into three major sections: clear target, grasping, analysis and storage

Be clear about your goals (know what area or site you are going to search on)

Climb (all the content of the site to climb down)

Take (remove data that is of no use to us)

Processing data (storing and using it the way we want) is what a web crawler does, in a nutshell, just like a browser. By specifying the URL, the user is directly returned to the required data, without the need to manually manipulate the browser step by step.

One article to recommend: This one is enough for anti-crawler

###1. General crawler: the crawler system used by search engines

Objective: to download as many web pages as possible from the Internet to form a backup on the local server
Implementation: the web page in the form of snapshot saved on the server, keyword extraction and garbage data elimination, to provide users with a way to access 3. 4. Search engine ranking – PageRank value – according to the site traffic order ranking ####1.1. Crawl process 1. Select an existing URL address and add it to crawl queue 2. From extracting the URL, DNS resolves the host IP and adds the destination host IP to crawl queue 3. Analyze the web content, extract the links, and proceed to ####1.2. 1. Actively push URL address -> submit URL address to search engine -> Baidu Webmaster platform 2. External chain of other websites 3. Search engines and DNS service providers jointly process, including new website information ####1.3. General crawler restrictions: Robots protocol [Convention protocol robots.txt]

Robots protocol: The protocol specifies the permissions that a universal crawler can access a web page
Robots protocol is a convention, generally, the programs of large companies or search engines comply with ####1.4. Defect:
Crawl only text-related data, not multimedia (pictures, music, video) and other binary files (code, scripts, etc.)
The results provided are uniform, providing one result that is universal to all and cannot be differentiated according to the specific type of person

2. Focused crawler: In order to solve the defects of general crawler, the data acquisition program developed by developers for specific users is demand-oriented and demand-driven

#2.HTTP & HTTPS

HTTP: Hyper Text Transfer Protocal
HTTPS: Secure Hypertext Transfer Protocol Indicates a Secure Hypertext Transfer Protocol
HTTP request: Web page access on the Network generally uses hypertext transfer protocol to transmit various data for data access. Each URL access initiated from the browser is called a request, and the process of obtaining data is called response data
Packet capture tool: The tool used to capture data packets transmitted over the network during the access is called packet capture tool. Packet capture: A professional term in network programming, it refers to the process of capturing and parsing data transmitted over the network. Before I use Wireshark and other professional caught tools such as Sniffer, Wireshark, WinNetCap. The WinSock, use the Fiddler now caught, Fiddler download address.
- Field Description 2). Statistics Performance data analysis of the request 3). Inspectors 4) 6). Timeline Request response time
- Fiddler sets up the decryption of HTTPS network data
- Fiddler grabs Iphone/Android data packets
- Fiddler has built-in commands and breakpoints
Browser set Proxy for data capture – It is recommended to use Google plug-in to quickly set up different Proxy – Falcon Proxy

#3.urllib2

Urllib2 can be used as an extension of urllib. The obvious advantage of urllib2 is that urllib2 can accept Request objects as arguments. In this way, the HTTP Request headers can be controlled to simulate browser and login operations.
In Python3, urllib2 is optimized and perfected, encapsulated as urllib.request for processing.
Use details of the Python standard library URllib2
urllib:

Encoding function: urlencode() Remote data retrieval: urlRetrieve ()

Urllib2:

Request urlopen () ()

Urllib2 -- urlopen() -urlopen()->response ->response->read() Fetch web page data ->response->info() fetch web page request header information ->response->geturl() fetch access address ->response-> getCode () fetch access error codeCopy the code

Comments:

Urlopen (URL, data, timeout) urlopen(URL, data, timeout)
- The first parameter url is the link,
- The second parameter, data, is the data to be passed when accessing the URL,
- The third timeout is to set the timeout period.
The response object has a read method that returns the retrieved web content, response.read()
The urlopen argument can be passed in as a Request, which is essentially an instance of the Request class, constructed by passing in the Url,Data, and so on

Code operation 1

# -*- coding:utf-8 -*-
# introduction
import urllib2

response=urllib2.urlopen('https://www.baidu.com')
content=response.read()
print(content)

Copy the code

####1. Introduction to the headers attribute

User-agent: some servers or proxies use this value to determine whether a request is sent by a browser. Content-type: When using the REST interface, the server checks this value to determine how to parse the Content in the HTTP Body. Application/XML: Application /json for XML RPC calls such as RESTful/SOAP; Application /x-www-form-urlencoded for JSON RPC calls: When the browser submits a Web form When the RESTful or SOAP service provided by the server is used, an incorrect content-Type setting will cause the server to reject the service

Note: Sublime does use regular match replace ^ (. *) : (. *) $– — > “\ 1” : “2 \”, is in pycharm ^ (. *) : (. *) $– — > “$1” : “$2”,

The user-Agent was randomly added or modified

You can add/modify a specific header by calling Request.add_header() or view existing headers by calling Request.get_header().

# urllib2_add_headers.py

import urllib2
import random

url = "http://www.itcast.cn"

ua_list = [
    "Mozilla/5.0 (Windows NT 6.1;) Apple.... "."Mozilla / 5.0 (X11; CrOS i686 2268.111.0)... "."Mozilla / 5.0 (Macintosh; U; PPC Mac OS X.... "."Mozilla / 5.0 (Macintosh; Intel Mac OS... "
]

user_agent = random.choice(ua_list)

request = urllib2.Request(url)

A specific header can also be added/modified by calling request.add_header ()
request.add_header("User-Agent", user_agent)

# Uppercase the first letter, lower case all the rest
request.get_header("User-agent")

response = urllib2.urlopen(req)

html = response.read()
print html
Copy the code

Operation two, disguise browser access

# -*- coding:utf-8 -*-
# introduction
import urllib2
from urllib2 import Request

# Disguise browser access
my_header={'User-Agent':'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.108 Safari/537.36 2345Explorer/8.7.0.16013'}

request=Request('https://www.baidu.com',headers=my_header)
response=urllib2.urlopen(request)
content=response.read()
print(content)

Copy the code

####2.Referer (page jump) Referer: indicates the URL from which the requested web page is generated, and the user accesses the current requested page from this Referer page. This property can be used to track which page the Web request came from, which site it came from, and so on.

Sometimes when you download a website image, you need the corresponding referer, otherwise you can’t download the image, that is because they have made anti-theft chain, the principle is to judge whether it is the address of this website according to the referer, if not, it will be rejected, if it is, you can download;

####3. accept-encoding (File codec format)

** accept-encoding: ** Indicates the Encoding method acceptable to the browser. Encoding is different from file format in that it is used to compress files and speed up file transfer. The browser decodes the Web response after it receives it and then checks the file format, which in many cases can save a lot of download time.

Example: the Accept – Encoding: gzip; Q = 1.0, identity; Q = 0.5,; q=0

If multiple Encoding matches, they are sorted in q value order, which in this case supports GZIP, identity compression Encoding, and gZip-enabled browsers return gZip-encoded HTML pages. If this domain is not set in the request message, the server assumes that the client is receptive to various content encodings.

####4. Accept-language Accept-langeuage: indicates the Language that the browser can Accept. For example, en or en-us indicates English, and en or zh-cn indicates Chinese.

####5. Accept-charset (character encoding)

Accept-charset: Specifies the character encoding that is acceptable to the browser.

Example: the Accept – Charset: iso – 8859-1, gb2312, utf-8

Iso8859-1: Commonly called Latin-1. Latin-1 contains additional characters that are indispensable for writing all Western European languages. The default value for English browsers is ISO-8859-1.
Gb2312: Standard Simplified Chinese character Set;
Utf-8: a variable-length character encoding of UNICODE that solves the problem of multilingual text display, thereby internationalizing and localizing applications. If this field is not set in the request message, any character set is accepted by default.

####6. Cookie (Cookie)

Cookie: The browser sends cookies to the server with this property. Cookie is a small data body stored in the browser, which can record user information related to the server, and can also be used to realize the session function, which will be discussed in more detail later.

####7. Content-type (POST data Type)

Content-type: The Type of Content to be represented in a POST request.

Example: Content-type = Text/XML; Charset = gb2312:

Indicates that the message body of the request contains plain text XML data with the character encoding gb2312.

####7. HTTP response from the server

The HTTP response also consists of four parts: the status line, the message header, the blank line, and the response body

Theoretically, all response header information should be the response header. However, for efficiency, security, and other reasons, the server will add the corresponding response header information, as you can see from the figure above:

##1. Cache-control: must-revalidate, no-cache, private

This value tells the client that the server does not want the client to cache the resource, and that the next time it requests the resource, it must request the server again and cannot obtain the resource from the cached copy.

Cache-control is the most important information in the response header. When the client request header contains cache-control :max-age=0, which explicitly indicates that the server resource will not be cached, cache-control usually returns no-cache as a response message. “Then don’t cache it.”
When the client does not include cache-control in the request header, the server will specify a different Cache policy for different resources. For example, osChina uses cache-control to Cache image resources: Max-age =86400, which means that for 86400 seconds from the current time, the client can read the resource directly from the cache copy without requesting it from the server.

# # 2. Connection: keep alive

This field responds to the client’s Connection: keep-alive, telling the client that the server’s TCP Connection is also a long Connection and that the client can continue to use this TCP Connection to send HTTP requests.

##3. Content-Encoding:gzip

Tells the client that the resource sent by the server is gZIP encoded, and when the client sees this information, it should decode the resource using GZIP.

# # 4. The content-type: text/HTML. charset=UTF-8

Tell the client the type of the resource file and the character encoding. The client decodes the resource through UTF-8 and then parses the resource in HTML. Often we see sites that are garbled, often because the server does not return the correct encoding.

##5. Date: Sun, 21 Sep 2016 06:18:21 GMT

This is the server time when the server sends resources. GMT is the local standard time of Greenwich. The time sent in HTTP protocol is GMT, which is mainly to solve the problem of time confusion when different time zones request resources from each other on the Internet.

##6. Expires:Sun, 1 Jan 2000 01:00:00 GMT

This response header is also cache-specific and tells the client that the cache copy can be accessed directly before this time. Obviously, this value can be problematic because the client and server times are not always the same, and different times can cause problems. So this response header is not as accurate as cache-control: max-age=*, because max-age=date is a relative time, which is not only better understood, but also more accurate.

##7. Pragma:no-cache

This meaning is the same as cache-control.

# # 8. Server: Tengine / 1.4.6

This is the server and its corresponding version, just telling the client about the server.

# # 9. Transfer – Encoding: chunked

This response header tells the client that the server is sending the resource in chunks. General block sending resources are dynamically generated from the server, still don’t know when to send to send resource size, so the block to send, each is independent, independent blocks can be marked their length, the last piece is 0 length, when the client to read the zero length of block, can determine the transmission resources have been finished.

##10. Vary: Accept-Encoding

Tell the cache server to cache both compressed and uncompressed versions of the file. This field is not useful now because most browsers support compression.

Response status code

The response status code consists of three digits, the first of which defines the category of the response and has five possible values.

Common status codes:

100 to 199: indicates that the server receives some requests successfully and requires the client to submit other requests to complete the processing.
200 to 299: indicates that the server successfully receives the request and completes the processing. Usually 200 (OK request successful).
300 to 399: The customer needs to further refine the request to complete the request. For example, the requested resource has been moved to a new address, common 302 (the requested page has been temporarily moved to a new URL), 307, and 304 (using cached resources).
400 to 499: An error occurs in the request from the client, such as 404 (the server cannot find the requested page) or 403 (the server denies access to the requested page because the permission is insufficient).
500 to 599: An error occurs on the server, usually 500 (the request is not completed. The server encountered an unpredictable condition. # # the Cookie and Session:

The interaction between the server and the client is limited to the request/response process and is then disconnected, with the server assuming the new client on the next request.

In order to maintain the link between them and let the server know that it was a request sent by the previous user, the client’s information must be kept in one place.

Cookie: Identifies a user by recording information on the client.

Session: Identifies the user through the information recorded on the server.

Python Web crawler (part 1) Getting started

Related Posts

Get to the bottom of the support library and v4 V7 duplicate dependencies

Dynamic bubble chart, take it away!

Django 2.1.7 tasks 4.3.0 signatures and Primitives tasks execution process