One, foreword

As a crawler engineer, I often encounter the need to crawl real-time data in my work, such as real-time data of sports events, real-time data of the stock market or real-time changes of the coin circle. The diagram below:

In the Web world, polling and WebSocket are two methods for implementing ‘real-time’ updates of data. Polling means that the client accesses the server interface at certain intervals (e.g., 1 second) to achieve the effect of ‘real time’. Although the data looks like it is being updated in real time, it is actually being updated at certain intervals and is not really being updated in real time. Polling usually adopts pull mode, in which the client actively pulls data from the server.

WebSocket adopts the push mode, in which the server actively pushes the data to the client, which is the real real-time update.

What is WebSocket

WebSocket is a protocol for full duplex communication over a single TCP connection. It makes it easier to exchange data between the client and the server, allowing the server to actively push data to the client. In the WebSocket API, the browser and server only need to complete a handshake to create a persistent connection and two-way data transfer.

WebSocket advantages

  • Less control overhead: only one handshake is required, only one request header is carried, and only subsequent data is transmitted. Compared to HTTP, WebSocket is very resource-saving.
  • Better real-time: Since the server can actively push messages, latency is negligible, and WebSocket can transmit multiple times in the same amount of time compared to the HTTP polling interval.
  • Binary support: WebSocket supports binary frames, which means transport is more economical.

Crawlers face HTTP and Websockets

There are many network request libraries in Python, and Requests is one of the most commonly used libraries that can simulate sending network Requests. But these requests are based on the HTTP protocol. Requests comes into play when facing Websockets, and you must use a library that can connect to webSockets.

Three, climb to get ideas

The real-time data on litecoin’s official website http://www.laiteb.com/ is taken as an example. The WebSocket handshake only happens once, so if you want to observe the network request through the browser developer tool, open the browser developer tool while the page is open, navigate to the NewWork TAB, and type or refresh the current page. WebSocket handshake requests and data transfers can be observed. Chrome is used as an example:

RealTime: realTime: realTime: realTime: realTime: realTime: realTime: realTime: realTime

Unlike HTTP requests, WebSocket connection addresses begin with WS or WSS. The connection status code is 101 instead of 200.

The Headers TAB records the Request and Response information, while the Frames TAB records the data transmitted between the two parties, which is also the data we need to crawl:

The data up in the green arrow in the Frames figure is sent to the client, and the data down in the orange arrow is pushed to the client.

As can be seen from the data order, the client sends first:

{"action":"subscribe"."args": ["QuoteBin5m:14"]}
Copy the code

Then the server pushes the message (all the way) :

{"group":"QuoteBin5m:14"."data": [{"low":"55.42"."high":"55.63"."open":"55.42"."close":"55.59"."last_price":"55.59"."avg_price":"55.5111587372932781077"."volume":"40078"."timestamp": 1551941701,"rise_fall_rate":"0.0030674846625766871"."rise_fall_value":"0.17"."base_coin_volume":"400.78"."quote_coin_volume":"22247.7621987324"}}]Copy the code

Therefore, the whole process from initiating handshake to obtaining data is as follows:

So, here’s the question:

  • How about a handshake?
  • How do I keep the connection?
  • What about sending and receiving messages?
  • Are there any libraries that can be easily implemented?

Four, aiowebsocket

There are many Python libraries for connecting websockets, but easy-to-use, stable ones are websocket-client(non-asynchronous), WebSockets (asynchronous), and aiowebSocket (asynchronous).

Depending on your project requirements, you can choose one of the three. Today’s introduction is the asynchronous WebSocket connection client AIoWebSocket. Its making address is: https://github.com/asyncins/aiowebsocket.

AioWebSocket is an asynchronous WebSocket client that follows the WebSocket specification and is lighter and faster than other libraries.

It is as simple to install as the other libraries, using PIP install aiowebSocket. Once installed, we can test the sample code provided with the ReadMe:

import asyncio
import logging
from datetime import datetime
from aiowebsocket.converses import AioWebSocket


async def startup(uri):
    async with AioWebSocket(uri) as aws:
        converse = aws.manipulator
        message = b'AioWebSocket - Async WebSocket Client'
        while True:
            await converse.send(message)
            print('{time}-Client send: {message}'
                  .format(time=datetime.now().strftime('%Y-%m-%d %H:%M:%S'), message=message))
            mes = await converse.receive()
            print('{time}-Client receive: {rec}'
                  .format(time=datetime.now().strftime('%Y-%m-%d %H:%M:%S'), rec=mes))


if __name__ == '__main__':
    remote = 'ws://echo.websocket.org'
    try:
        asyncio.get_event_loop().run_until_complete(startup(remote))
    except KeyboardInterrupt as exc:
        logging.info('Quit.')
Copy the code

The output after running is as follows:

2019-03-07 15:43:55-Client send: b'AioWebSocket - Async WebSocket Client'
2019-03-07 15:43:55-Client receive: b'AioWebSocket - Async WebSocket Client'
2019-03-07 15:43:55-Client send: b'AioWebSocket - Async WebSocket Client'
2019-03-07 15:43:56-Client receive: b'AioWebSocket - Async WebSocket Client'
2019-03-07 15:43:56-Client send: b'AioWebSocket - Async WebSocket Client'...Copy the code

Send Indicates the message sent by the client to the server

Recive indicates the message pushed by the server to the client

Five, code to obtain data

Back to this time, the target site is litecoin’s official website:

From the network request record just now, we know that the WebSocket address of the target website is: WSS: / / API. Bbxapp. VIP/v1 / ifcontract/realTime, can be seen from the address of the target site using the WSS, namely the ws security version, their relationship with the HTTP/HTTPS. Aiowebsocket automatically handles and recognizes SSL, so we don’t need to do anything extra, just assign the destination address to the connection URI:

import asyncio
import logging
from datetime import datetime
from aiowebsocket.converses import AioWebSocket


async def startup(uri):
    async with AioWebSocket(uri) as aws:
        converse = aws.manipulator
        while True:
            mes = await converse.receive()
            print('{time}-Client receive: {rec}'
                  .format(time=datetime.now().strftime('%Y-%m-%d %H:%M:%S'), rec=mes))


if __name__ == '__main__':
    remote = 'wss://api.bbxapp.vip/v1/ifcontract/realTime'
    try:
        asyncio.get_event_loop().run_until_complete(startup(remote))
    except KeyboardInterrupt as exc:
        logging.info('Quit.')
Copy the code

When you run the code and look at the output, you’ll see that nothing happened. With no output and no disconnection, the program keeps running, but nothing:

Why is that?

Is it the other side that won’t accept our request?

Or are there any anti-crawler restrictions?

In fact, the flow chart above illustrates this problem:

One step in the process requires the client to send the specified message to the server, and the server will continuously push data after verification. Therefore, the message sending code should be added before the message is read and after the handshake connection:

import asyncio
import logging
from datetime import datetime
from aiowebsocket.converses import AioWebSocket


async def startup(uri):
    async with AioWebSocket(uri) as aws:
        converse = aws.manipulator
        The client sends a message to the server
        await converse.send('{"action":"subscribe","args":["QuoteBin5m:14"]}')
        while True:
            mes = await converse.receive()
            print('{time}-Client receive: {rec}'
                  .format(time=datetime.now().strftime('%Y-%m-%d %H:%M:%S'), rec=mes))


if __name__ == '__main__':
    remote = 'wss://api.bbxapp.vip/v1/ifcontract/realTime'
    try:
        asyncio.get_event_loop().run_until_complete(startup(remote))
    except KeyboardInterrupt as exc:
        logging.info('Quit.')

Copy the code

Save and run, and you’ll see a steady stream of data coming:

At this point, the crawler is able to retrieve the desired data.

What does aiowebSocket do

The code is not long. When you use it, you only need to fill in the address of the target website WebSocket, and then send data according to the process. So what does AIoWebSocket do in this process?

  • First, aioWebSocket sends a handshake request to the specified server based on the WebSocket address and verifies the handshake result.
  • The data is then sent to the server upon confirmation of a successful handshake.
  • To keep the connection open during the whole process, AIoWebSocket will automatically respond to ping Pong with the server.
  • Finally, aioWebSocket reads the message pushed by the server

Quinn: if you think aiowebsocket 】 to help you, so please to making https://github.com/asyncins/aiowebsocket to a Star. If you find problems or want to make suggestions to AioWebSocket, you can also do so on Github. As long as you make suggestions, you will certainly help AIoWebSocket become better, and AioWebSocket can continue to serve you.