Understand Python network programming

While learning Python network programming recently, I read some articles about it and found that most of them were either difficult to understand or easy to understand. I thought why not write an article along the way, so I came up with this article. I believe that technology is not all cold, and that you can better understand the joy of programming from a human point of view. This article will try to share how to understand network programming in Python in brief words.

There are many people in the Python world who like to use Python as a crawler, so what happens during the page request?

Now write the simplest Client/Server program:

  1. To start an HTTP server listening on port 8000, run the following command:

    python3 -m http.server 8000
    Serving HTTP on 0.0. 0. 0 port 8000.Copy the code
  2. Then write a program to make an HTTP request to the server:

    import requests
    r = requests.get('http://127.0.0.1:8000/')
    print(r)Copy the code
  3. Then execute the program:

    bash3.2$ python test.py
    <Response [200] >Copy the code

As you can see, the server returns a 200 success response.

Ok, now let’s summarize the request process:

  1. The client sends an HTTP(GET) request to the server.
  2. The server returns an HTTP(200) response to the client.

This is the most abstract process we can see, so let’s take a closer look at what happens with tcpdump:

Using tcpdump on the command line to listen for TCP connections to the local network adapter,

tcpdump -i lo0 port 8000Copy the code

Or you can use the -w argument to write the information to a file, and then use wireshark to see the result:

tcpdump -i lo0 port 8000 -w test.capCopy the code

Now execute the program:

bash3.2$ python test.py
<Response [200] >Copy the code

Not surprisingly, we can see tcpdump output similar to the following:

tcpdump: verbose output suppressed.use -v or -vv for full protocol decode
listening on lo0.link-type NULL (BSD loopback), capture size 262144 bytes
23: 46: 06.464962 IP localhost49329. > localhost.irdmi: Flags [S].seqIn 1191154495,winIn 65535,options [mss 16344,nop,wscale 5,nop,nop,TS val 178410641 ecr 0,sackOK,eol].length 0
23: 46: 06.465018 IP localhost.irdmi > localhost49329.: Flags [S.].seqIn 1405387906,ackIn 1191154496,winIn 65535,options [mss 16344,nop,wscale 5,nop,nop,TS val 178410641 ecr 178410641,sackOK,eol].length 0
23: 46: 06.465029 IP localhost49329. > localhost.irdmi: Flags [.].ack 1, winIn 12759,options [nop,nop,TS val 178410641 ecr 178410641].length 0
23: 46: 06.465039 IP localhost.irdmi > localhost49329.: Flags [.].ack 1, winIn 12759,options [nop,nop,TS val 178410641 ecr 178410641].length 0
23: 46: 06.465065 IP localhost49329. > localhost.irdmi: Flags [P.].seq 1: 146.ack 1, winIn 12759,options [nop,nop,TS val 178410641 ecr 178410641].length 145
23: 46: 06.465079 IP localhost.irdmi > localhost49329.: Flags [.].ackIn 146,winIn 12754,options [nop,nop,TS val 178410641 ecr 178410641].length 0
23: 46: 06.467141 IP localhost.irdmi > localhost49329.: Flags [P.].seq 1: 156.ackIn 146,winIn 12754,options [nop,nop,TS val 178410642 ecr 178410641].length 155
23: 46: 06.467171 IP localhost49329. > localhost.irdmi: Flags [.].ackIn 156,winIn 12754,options [nop,nop,TS val 178410643 ecr 178410642].length 0
23: 46: 06.467231 IP localhost.irdmi > localhost49329.: Flags [P.].seq 156: 5324.ackIn 146,winIn 12754,options [nop,nop,TS val 178410643 ecr 178410643].length 5168
23: 46: 06.467245 IP localhost49329. > localhost.irdmi: Flags [.].ackIn 5324,winIn 12593,options [nop,nop,TS val 178410643 ecr 178410643].length 0
23: 46: 06.467313 IP localhost.irdmi > localhost49329.: Flags [F.].seqIn 5324,ackIn 146,winIn 12754,options [nop,nop,TS val 178410643 ecr 178410643].length 0
23: 46: 06.467331 IP localhost49329. > localhost.irdmi: Flags [.].ackIn 5325,winIn 12593,options [nop,nop,TS val 178410643 ecr 178410643].length 0
23: 46: 06.468442 IP localhost49329. > localhost.irdmi: Flags [F.].seqIn 146,ackIn 5325,winIn 12593,options [nop,nop,TS val 178410644 ecr 178410643].length 0
23: 46: 06.468479 IP localhost.irdmi > localhost49329.: Flags [.].ackIn 147,winIn 12754,options [nop,nop,TS val 178410644 ecr 178410644].length 0Copy the code

It can be seen from the results:

  1. The client sends a SYN packet to request the server to establish a TCP connection.
  2. The server returns a SYN+ACK packet, indicating that the server receives the request from the client and agrees to establish a TCP connection with the client.
  3. The client sends an ACK packet, indicating that it knows that the server agrees to establish a TCP connection. Then, the communication starts.
  4. The client and server constantly exchange information, receive messages, and return replies.
  5. Finally, after data transmission is complete, the server sends a FIN packet to terminate the communication. The client sends an ACK response, and then another FIN packet. Finally, the server returns an ACK response.

If you think about it, this process is very similar to making a call in the real world. Instead of making a call, establishing a connection, confirming an answer, exchanging messages, and closing a connection, TCP is often said to be connection-oriented.

The lsof command is used to query the descriptor of port 8000.

lsof -n -i:8000          
COMMAND    PID   USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
python3.4 1128 tonnie    4u  IPv4 0x17036ae156ec58cf      0t0  TCP *:irdmi (LISTEN)Copy the code

The server process is in the LISTEN phase, indicating that the server is in the listening state:

Using the previous example to explain the concept of state migration in TCP, if a request comes in from the client:

  1. After receiving a SYN packet from the client, the server returns a SYN+ACK packet and enters the SYN_RCVD state.
  2. After receiving the ACK reply from the client, the server establishes a connection and enters the ESTABLISHED state.
  3. After data transmission is complete, the server sends a FIN packet to the client and enters the FIN_WAIT_1 state.
  4. After receiving the ACK response from the client, the server enters the FIN_WAIT_2 state.
  5. After receiving the FIN packet from the client, the server returns an ACK response and waits for the connection to close and enters the TIME_WAIT state.
  6. After 2MSL, the server enters the CLOSED state and the connection is CLOSED.

As for the client, there is also a state at each stage. The following figure shows the TCP state migration process:





Let’s look at the TCP/IP four-tier model:

  1. The application layer includes HTTP, DNS, FTP, SSH, and so on.
  2. Transport layer, at this layer there are TCP, UDP, and so on.
  3. The network layer includes IP and ARP.
  4. Network interface layer, in this layer Ethernet, PPP, etc.




In the above program, the communication between the client and the server goes through these four layers. How does this Python program establish and close connections and transfer data? The answer is a set of methods provided through sockets.

A socket is an IPC method that enables applications on the same host or different hosts to exchange data. The socket is between layer 3 and Layer 4 in the figure above, so you can think of a socket as a set of communication interfaces between the transport layer and the application layer, or as an abstract communication device. Applications can easily communicate with other applications through sockets.

Now simplify the client code to the simplest form represented by sockets:

import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(('127.0.0.1'.8000))
sock.send(B 'GET/HTTP / 1.1 \ r \ nHost: 127.0.0.1:8000 \ r \ n \ r \ n')
data = sock.recv(4096)
print(data)
sock.close()Copy the code

Does it feel very similar to the TCP connection process above? This is just a code way to abstract the representation of the process.

Let’s look at the simplest code on the server side:

import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.bind(('127.0.0.1'.8000))
sock.listen(5)
while 1:
    cli_sock, cli_addr = sock.accept()
    req = cli_sock.recv(4096)
    cli_sock.send(b'hello world')
    cli_sock.close()Copy the code

The process is also very simple, to summarize their process:

Server side:

  1. Socket creates a socket object, specifying the domain and protocol. A file descriptor is bound to the socket object.
  2. Call socket. setsockopt to set the socket option. In this example, set socket.SO_REUSEADDR to 1.
  3. Call sock.bind to bind the socket object to an address, which takes a tuple of host address and port as arguments.
  4. A call to sock.listen notifies the system to start listening for connections from clients. The parameter is the maximum number of pending connections in the queue.
  5. A call to sock.accept blocks the call until it returns a tuple containing the socket object used to talk to the client and the address information of the client.
  6. The cli_sock.recv method is called to accept data from the client, in this case b ‘get/HTTP/1.1\r\nHost: 127.0.0.1:8000\r\n\r\n’.
  7. Call the cli_sock.send method to send the data to the client.
  8. Call cli_sock.close to end the connection.

Client:

  1. Socket creates a socket object, specifying the domain and protocol. A file descriptor is bound to the socket object.
  2. Call sock.connect to connect to the peer server process through the specified host and port.
  3. Call sock.send to send data to the server.
  4. Call sock.recv to receive data from the server.
  5. Call sock.close to close the connection.

Socket data is obtained from the read/write buffer maintained by the kernel, as shown in the following figure:





A standard system call is made each time data is written or read from the buffer, such as:

int read(fd, buf, bufsize);
int write(fd, buf, bufwrite);Copy the code

To write or read data. Of course, for large files, the cost of executing multiple read and write system calls is considerable, so the sendFile system call should be used:





The socket of the domain

In the above application, we set up the socket object using the AF_INET parameter, which indicates that the socket is communicating with IPV4.

This socket is also called an Internet Domain socket, and it defines an address like this:

struct in_addr {
     in_addr_t s_addr;     // a 32-bit unsigned integer.
};
struct sockaddr_in {
     sa_family_t sin_family;     //AF_INET
     in_port_t sin_port;     / / the port number
     struct in_addr sin_addr;     / / ipv4 addresses
     unsigned char __pad[X];
};Copy the code

In contrast, there is a socket type called Unix Domain socket, which is created with the AF_UNIX parameter. It defines an address like this:

struct sockaddr_un {
     sa_family_t sun_family;     //AF_UNIX
     char sun_path[108];     / / socket path name
};Copy the code

When a bind operation is initiated using a Unix Domain Socket, an entry is created in the file system with the Socket and path named one-to-one. Generally speaking, Unix Domain sockets are only for network communication under the same host application. Another feature of Unix Domain sockets is that directory permissions can be used to control Socket access. (For example, mysql.sock is the carrier for using Unix Domain Sokcet.)

The socket protocol

On protocol we use SOCK_STREAM to indicate that this is a streaming socket (i.e. TCP), or we can specify SOCK_DGRAM to indicate that this is a datagram socket (i.e. UDP).

Some basic differences between TCP and UDP:

  1. TCP is connection-oriented, UDP is not.
  2. TCP is byte oriented, does not have message boundary, may have sticky packet problem. UDP is packet oriented.
  3. TCP does its best to ensure reliable delivery of data, while UDP does not by default.
  4. The TCP header is 20 bytes, and the UDP header is 8 bytes.

The socket of the channel

Generally speaking, the channel of a socket is bidirectional, that is, a socket can read and write. Sometimes you need to set up a semi-open socket and use the shutdown call to the socket, which receives a flag that says:

  • SHUT_RD closes the read end of the connection.
  • SHUT_WR stands for closing the write end of the connection.
  • SHUT_RDWR closes both the read and write ends of a connection.

Shutdown () does not explicitly close the file descriptor, requiring a separate call to close().

Now that you have a general understanding of sockets, let’s explore how a socket server is written.

Back to the original code:

python3 -m http.server 8000
Serving HTTP on 0.0. 0. 0 port 8000.Copy the code

We directly bind to port 8000 using Python’s built-in HTTPServer.

Python3 http.server

def test(HandlerClass=BaseHTTPRequestHandler,
         ServerClass=HTTPServer, protocol="HTTP / 1.0", port=8000, bind=""):
    server_address = (bind, port)

    HandlerClass.protocol_version = protocol
    httpd = ServerClass(server_address, HandlerClass)

    sa = httpd.socket.getsockname()
    print("Serving HTTP on", sa[0]."port", sa[1]."...")
    try:
        httpd.serve_forever()
    except KeyboardInterrupt:
        print("\nKeyboard interrupt received, exiting.")
        httpd.server_close()
        sys.exit(0)Copy the code

The test method is called when http.server is run as a module, creating a test server that uses HTTPServer as the server class by default and BaseHTTPRequestHandler as the request handling class.

Look at HTTPServer, the server we started with:

class HTTPServer(socketserver.TCPServer):

    allow_reuse_address = 1

    def server_bind(self):
        socketserver.TCPServer.server_bind(self)
        host, port = self.socket.getsockname()[:2]
        self.server_name = socket.getfqdn(host)
        self.server_port = portCopy the code

TCPServer: socketServer.tcpServer: socketServer.tcpServer: socketServer.tcpServer: socketServer.tcpServer: socketServer.tcpServer: socketServer.tcpServer: socketServer.tcpServer

        +------------+
        | BaseServer |
        +------------+
              |
              v
        +-----------+        +------------------+
        | TCPServer |------->| UnixStreamServer |
        +-----------+        +------------------+
              |
              v
        +-----------+        +--------------------+
        | UDPServer |------->| UnixDatagramServer |
        +-----------+        +--------------------+Copy the code

As you can see, TCPServer inherits from BaseServer, and UDPServer inherits from TCPServer.

The TCPServer class uses socket.af_inet (IPV4) and socket.sock_stream (TCP) by default, and creates a socket object during initialization. Note that the socket object is only created and processed. It hasn’t done any binding yet.

class TCPServer(BaseServer):
    address_family = socket.AF_INET

    socket_type = socket.SOCK_STREAM

    request_queue_size = 5

    allow_reuse_address = False

    def __init__(self, server_address, RequestHandlerClass, bind_and_activate=True):
        BaseServer.__init__(self, server_address, RequestHandlerClass)
        self.socket = socket.socket(self.address_family,
                                    self.socket_type)
        if bind_and_activate:
            try:
                self.server_bind()
                self.server_activate()
            except:
                self.server_close()
                raiseCopy the code

The actual binding takes place in self.server_bind(). Now let’s look at this method, which binds the socket object to the address given in __init__ initialization and retrieves the server address:

def server_bind(self):
    if self.allow_reuse_address:
        self.socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    self.socket.bind(self.server_address)
    self.server_address = self.socket.getsockname()Copy the code

The bound listening action takes place in the self.server_activate() line, immediately following the binding, where the socket listens for incoming connections at the bound address.

def server_activate(self):
    self.socket.listen(self.request_queue_size)Copy the code

Now our concern is, if a client now initiates a connection request, what does the server class do? The answer can be found in BaseServer, which TCPServer inherits.

Find BaseServer’s serve_forever method:

def serve_forever(self, poll_interval=0.5):
    self.__is_shut_down.clear()
    try:
        while not self.__shutdown_request:
            r, w, e = _eintr_retry(select.select, [self], [], [],
                                    poll_interval)
            if self in r:
                self._handle_request_noblock()

            self.service_actions()
    finally:
        self.__shutdown_request = False
        self.__is_shut_down.set()Copy the code

When the server is not shutdown, the active socket is polled with select in the while loop to return the active file descriptor. When a readable event is detected, the _handle_request_noblock method is called to handle the socket:

def get_request(self):
    return self.socket.accept()

def _handle_request_noblock(self):
    try:
        request, client_address = self.get_request()
    except OSError:
        return
    if self.verify_request(request, client_address):
        try:
            self.process_request(request, client_address)
        except:
            self.handle_error(request, client_address)
            self.shutdown_request(request)Copy the code

In the _HANDle_REQUEST_NOblock method, the server takes the readable socket (request), calls process_REQUEST to handle the request, and calls handle_ERROR to handle the error when an exception occurs. Then call shutdown_request to close the request.

def process_request(self, request, client_address):
    self.finish_request(request, client_address)
    self.shutdown_request(request)

def finish_request(self, request, client_address):
    self.RequestHandlerClass(request, client_address, self)

def shutdown_request(self, request):
    self.close_request(request)Copy the code

Finally, to see what the process_request method does, it first calls the finish_request method and instantiates a RequestHandlerClass to handle the request, When processing is complete, call the shutdown_request method to end the request.

The UDPServer is almost the same as the TCPServer, with only a few important parameters changed:

class UDPServer(TCPServer):
    allow_reuse_address = False

    socket_type = socket.SOCK_DGRAM

    max_packet_size = 8192

    def get_request(self):
        data, client_addr = self.socket.recvfrom(self.max_packet_size)
        return (data, self.socket), client_addrCopy the code

That’s pretty much it for the server class, RequestHandler.

Let’s start with the original BaseRequestHandler class:

class BaseRequestHandler:
    def __init__(self, request, client_address, server):
        self.request = request
        self.client_address = client_address
        self.server = server
        self.setup()
        try:
            self.handle()
        finally:
            self.finish()Copy the code

It receives a request (socket) as an argument, calls self.setup() to setup a file descriptor for reading and writing, calls self.handle() to handle the request, and finally calls self.finish() to end the processing.

Now look at the StreamRequestHandler class:

class StreamRequestHandler(BaseRequestHandler):
    rbufsize = - 1
    wbufsize = 0

    timeout = None

    disable_nagle_algorithm = False

    def setup(self):
        self.connection = self.request
        if self.timeout is not None:
            self.connection.settimeout(self.timeout)
        if self.disable_nagle_algorithm:
            self.connection.setsockopt(socket.IPPROTO_TCP,
                                       socket.TCP_NODELAY, True)
        self.rfile = self.connection.makefile('rb', self.rbufsize)
        self.wfile = self.connection.makefile('wb', self.wbufsize)

    def finish(self):
        if not self.wfile.closed:
            try:
                self.wfile.flush()
            except socket.error:
                pass
        self.wfile.close()
        self.rfile.close()Copy the code

The setup process creates a file descriptor for the socket to read and a file descriptor for the socket to write. The finish process flusses the write buffer and closes the read and write file descriptors.

Handle is a core process for processing requests. In BaseHTTPRequestHandler, the handler will handle a socket request. If the request is interrupted and there is no timeout or exception, the handler will handle a socket request. The next request (e.g. Keep-alive, big data transfer) is processed:

class BaseHTTPRequestHandler(socketserver.StreamRequestHandler):
    def handle(self):
        self.handle_one_request()
        while not self.close_connection:
            self.handle_one_request()Copy the code

The rest is too trivial to post. After this step, the server has completed processing a request from the client.

The http.server module provides a new handler class for the http.server module: CGIHTTPRequestHandler, which selects the CGI script to execute based on the request information. Cgi is more flexible, but also has some disadvantages, so there are various solutions: fastCGI, mod_python, WSGI… For those interested, see HOWTO Use Python in the Web. But without complexity, these native request handling classes are just fine.

Back to HTTPServer, no one in the online environment would be so stupid as to use the built-in HTTPServer directly. Since it is a single process and can only handle the same request for the lifetime of the request, the socketServer module also provides ThreadingMixIn and ForkingMixIn, which are designed to create a new thread or process to handle a request when it comes in.

Using ThreadingMixIn or ForkingMixIn as a hybrid class with the Server class is simple:

class ThreadingHTTPServer(ThreadingMixIn, HTTPServer):
    passCopy the code

ThreadingMixIn’s source code does see that it overrides the process_request method of the Server class in the hybrid class, which is called when the Server processes the request. In ThreadingMixIn processing, a new thread is created to process the request. As a result, the concurrency capability of the server has been greatly improved.

class ThreadingMixIn:
    daemon_threads = False

    def process_request_thread(self, request, client_address):
        try:
            self.finish_request(request, client_address)
            self.shutdown_request(request)
        except:
            self.handle_error(request, client_address)
            self.shutdown_request(request)

    def process_request(self, request, client_address):
        t = threading.Thread(target = self.process_request_thread,
                             args = (request, client_address))
        t.daemon = self.daemon_threads
        t.start()Copy the code

But some people see here will not necessarily be satisfied, one request one thread, one hundred requests one hundred threads, ten thousand, one hundred thousand… It’s not heaven yet. In a real-world environment, it is generally necessary to keep threads within a certain number (such as thread pools) to reduce system load.

Now let’s move on to our initial discussion of sockets and the IO model.

We know that socket input requires two phases:

  1. Wait for the data to be ready.
  2. Copy data from the kernel to the process.

Because the process of waiting is blocking, we use multithreading above to reduce the impact of this blocking.

Here are five IO models:

Blocking IO model

Recv -> No reports Ready -> Waiting for data -> Datagrams ready -> Data copied from kernel to user space -> Replication completed -> Return success indicator

Non-blocking IO model

Recv -> No reported ready -> Return EWOULDBLOCK-> Recv -> No reported Ready -> Return EWOULDBLOCK-> datagram ready -> Data replicated from kernel to user space -> Replication complete -> Return success indication

Features: Polling operation, occupy a lot of CPU time.

IO multiplexing model

Select -> No reported Ready -> Reported Ready -> Return readable condition -> recV -> Copy data from kernel to user space -> Copy completed -> Return success indicator

Signal-driven model

Set up signal handler (SIGAction)-> Submit SIGIO-> RECV -> Copy data from kernel to user space -> Copy completed -> Return success indication

Asynchronous IO model

Aio_read -> No data ready -> Datagrams ready -> Data copied from kernel to user space -> Replication completed -> submit the signal specified in aio_read

Features: The process is not blocked until the data replication is complete and signals are generated.

There is no doubt that we have been using the blocking IO model since the beginning, which is inefficient.

To achieve better performance, we generally use IO multiplexing models, such as select and poll operations. The running process checks multiple file descriptors simultaneously to find out whether any of them can perform IO operations. Once the kernel finds that one or more IO conditions specified by the process are ready (the input is ready to be read, Or descriptor to take on more output), it notifies the process.

However, the drawback of select and poll is that they constantly iterate over and over when checking available descriptors. When the number of file descriptors of the socket to be monitored is large, the performance deteriorates rapidly and the CPU consumption becomes serious.

The signal-driven model has an advantage over them in that when input data arrives at the specified file descriptor, the kernel sends a signal to the process requesting the data, and the process can handle other tasks by receiving the signal to be notified.

Epoll takes it one step further by using an event-driven way to listen for FDS, avoiding the complexity of signal processing. Event functions are registered in file descriptors, which are monitored by the system, and the kernel notifies the application process when the file descriptors are ready.

Epoll typically outperforms SELECT and Poll by several orders of magnitude on highly concurrent network operations.

There are two concepts in IO calls:

  • Horizontal firing: a file descriptor is considered ready if it can make an IO call without blocking. (Supported models: SELECT, poll, epoll, etc.)
  • Edge trigger: Trigger notification if the file descriptor has new IO activity (new input) since the last time it came. (Support model: signal driven, epoll, etc.)

In the actual development to pay attention to their differences, know why edge trigger may cause socket hunger, how to solve the problem.

To summarize the five IO models in one diagram, it looks like this:





Using multiplexing IO model can improve the quality of network programming effectively.

HTTP

Now look at HTTP. HTTP is a stateless protocol on top of TCP, at the application layer of the four-tier model. HTTP uses TCP to transmit packet data.

Take the browser to enter a url to open as an example, look at the HTTP request process:

  1. The browser first parses the host name and port information from the URL. The general format of the URL is :< schema>://

    : @

    : / . ? The < query > # < flag >.
  2. Browsers translate host names into IP addresses (DNS).
  3. The browser establishes a TCP connection with the server.
  4. The browser sends an HTTP request packet over the TCP connection.
  5. The server returns an HTTP response packet on the TCP connection.
  6. The connection is closed and the browser renders the document.

HTTP request information consists of several elements:

  1. A request line, such as GET /index.html HTTP/1.1, indicates that the file index.html is being requested.
  2. Request header (header).
  3. A blank line.
  4. The body of the message.

For example, in the first example, we make a request to port 8000:

GET / HTTP/1.1(Request line)Host: 127.0. 01.:8000(Request header)Copy the code

You will get the following response:

HTTP / 1.0200OK (response line)Content-Length: 5252
Content-type: text/html; charset=utf-8
Date: Tue, 21 Feb 2017 08:36:01 GMT
Server: SimpleHTTP/0.6 Python/3.4.5


        
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Directory listing for /</title>
</head>
<body>
<h1>Directory listing for /</h1>.Copy the code

The key to HTTP is its header, the header information that determines what the client and server can do.

The HTTP status code

  • 1XX message — The request has been received by the server and continues processing
  • 2XX success – The request was successfully received, understood, and accepted by the server
  • 3XX redirection – Subsequent action is required to complete this request
  • 4XX request error – The request contains a lexical error or cannot be executed
  • 5XX server error – An error occurred while the server was processing a valid request

HTTP & DOM

DOM, also known as Document Object Module, is the Document Object model. When we write crawler, we usually need to parse THE HTML page, at this time, dom parser is needed to analyze the captured page.

LXML and BeautifulSoup work well for us, but how do they parse HTML?

The HTML. Parser module in Python comes with an HTML parser:

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data :", data)

parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '

Parse me!

'
) # -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - ''' Encountered a start tag: html Encountered a start tag: head Encountered a start tag: title Encountered some data : Test Encountered an end tag : title Encountered an end tag : head Encountered a start tag: body Encountered a start tag: h1 Encountered some data : Parse me! Encountered an end tag : h1 Encountered an end tag : body Encountered an end tag : html '''Copy the code

You can see how the DOM is parsed in its source code.

HTTP & RESTful

Recommended reading: RESTful API design best practices

HTTP More

Recommended reading: The Definitive Guide to HTTP

DNS

The host-to-IP translation is usually performed through DNS queries, a large distributed database that organizes host names in a hierarchical space. The domain name of a node is made up of the names of all nodes along the path from the node to the root.





DNS queries can be easily done using the DNSpython package:

import dns.resolver

domain = 'baidu.com'
A = dns.resolver.query(domain, 'A')
for answer in A.response.answer:
    for item in answer.items:
        print(item.address)Copy the code

FTP

In the Python world, using FTP is very simple. You only need to use the built-in FTplib module to use FTP to operate on remote machines:

from ftplib import FTP
with FTP("ftp1.at.proftpd.org") as ftp:
    ftp.login()
    ftp.dir()
'230 Anonymous login ok, restrictions apply.'
dr-xr-xr-x   9 ftp      ftp           154 May  6 10:43 .
dr-xr-xr-x   9 ftp      ftp           154 May  6 10:43. dr-xr-xr-x5 ftp      ftp          4096 May  6 10:43 CentOS
dr-xr-xr-x   3 ftp      ftp            18 Jul 10  2008 FedoraCopy the code

XML-RPC

Setting up an XML-RPC server and client is equally simple.

Server

from xmlrpc.server import SimpleXMLRPCServer
import datetime

class ExampleService:
    def getData(self):
        return The '42'

    class currentTime:
        @staticmethod
        def getCurrentTime(a):
            return datetime.datetime.now()

server = SimpleXMLRPCServer(("localhost".8000))
server.register_function(pow)
server.register_function(lambda x,y: x+y, 'add')
server.register_instance(ExampleService(), allow_dotted_names=True)
server.register_multicall_functions()
print('Serving XML-RPC on localhost port 8000')
try:
    server.serve_forever()
except KeyboardInterrupt:
    print("\nKeyboard interrupt received, exiting.")
    sys.exit(0)Copy the code

Client

from xmlrpc.client import ServerProxy, MultiCall
server = ServerProxy("http://localhost:8000")

try:
    print(server.currentTime.getCurrentTime())
except Error as v:
    print("ERROR", v)

multi = MultiCall(server)
multi.getData()
multi.pow(2.9)
multi.add(1.2)
try:
    for response in multi():
        print(response)
except Error as v:
    print("ERROR", v)Copy the code

SMTP & POP3

import smtplib
import poplibCopy the code

End

About network programming, here is only the tip of the iceberg, there is a lot more to say, in view of my lack of level, interested readers can go to understand.