1. Prepare knowledge

“

Today let’s talk about the distributed process crawler, crawler to know about the distributed crawler this east, today let’s understand the concept of distributed, literally is separate layout, indeed it can be separated to operate.

Distributed process is to distribute the process to multiple machines and make full use of each machine to complete our crawler task. Distributed processes need to use the Multiprocessing template, which not only supports multiple processes, but its Managers submodule also supports the distribution of multiple processes across multiple machines.

We can write a server process as the scheduler, and then distribute our crawler tasks among other processes, we rely on network communication to manage these processes.

“

Simulate a distributed process crawler

Let’s simulate a distributed process crawler, for example, if we need to crawl all the images of a photo site, if we use our distributed process idea, we would create a process to crawl the links of the images, and then store those links in a Queue. Other processes are responsible for reading links from the Queue, downloading images, and doing other things (locally).

In fact, our Queue is exposed to the network, and by distributing it, it is encapsulated, in fact, the so-called networking of local queues.

Next, let’s analyze how to create a distributed server process, which can be divided into six steps:

First we need to create a queue, which is mainly used for communication between processes. Generally speaking, there are two processes, one is a service process, one is a task process. The server process creates a task queue, task_queue, that acts as a channel for passing tasks to the task process. The service process creates a result_queue as a channel for the task process to reply to the service process after completing the task. In a distributed process environment, we need to add tasks through the Queue interface that Queuemanager obtains.

The queue in the first step is registered on the network and exposed to other processes or hosts. After registration, the network queue is obtained, which is equivalent to the image of the local queue.

Create Queuemanager objects and instantiate and bind ports and passwords

Start the instance created in step 3, that is, start the management Manager to oversee the information channel

Queue objects accessed through the network are obtained through the management instance method, that is, a local queue materializing the network object.

The task is automatically uploaded to the network queue and assigned to the task process for processing.

Taskmanager.py: taskManager.py: taskManager.py

import queue

from multiprocessing.managers import BaseManager

from multiprocessing import freeze_support

The number of tasks

task_num = 500

Defining the send and receive queue

task_queue = queue.Queue(task_num)

result_queue = queue.Queue(task_num)

def get_task():

return task_queue

def get_result():

return result_queue

Create a similar QueueManager

class QueueManager(BaseManager):

pass

def run():

Windows binding calls cannot use lambdas, so functions must be defined before binding

QueueManager.register(‘get_task_queue’, callable = get_task)

QueueManager.register(‘get_result_queue’, callable=get_result)

In Windows, the IP address is required. In Linux, the default password is local

QueueManager(address=(‘127.0.0.1′, 8001), authKey =’jap’.encode(‘ UTF-8 ‘))

Start the

manager.start()

try:

Get the task queue and result queue over the network

task = manager.get_task_queue()

result = manager.get_result_queue()

Add tasks

For url in [” STR (I) for I in range(500)]:

Print (” Add task %s” %url)

task.put(url)

Print (” Getting results…” )

for i in range(500):

print(“result is %s” %result.get(timeout=10))

except:

print(‘Manager error’)

finally:

Be sure to close, otherwise the pipeline will be reported as not closed

manager.shutdown()

if name == ‘main‘:

Add this sentence to solve problems that may occur with multiple processes in Windows

freeze_support()

run()

Create a taskWorker (taskWorker). Create a taskWorker (taskWorker).

Create a similar QueueManager object and use QueueManager to register the name of the method used to obtain the queue. A task process can only obtain a queue from the network by name, so make sure that the server and the task have the same name.
The link server, port and instructions must be the same as those on the server
Get the queue from the network and localize it.
Gets the task from the Task column and writes the result to the Result column.

import time

from multiprocessing.managers import BaseManager

Create a similar QueueManager

class QueueManager(BaseManager):

pass

Step 1: Use QueueManager to register the name of the method used to get the Queue

QueueManager.register(‘get_task_queue’)

QueueManager.register(‘get_result_queue’)

Step 2: Connect to the server

Server_addr = ‘127.0.0.1’

print(‘Connect to server %s’ %server_addr)

The port and authentication passwords must be the same

m = QueueManager(address = (server_addr, 8001), authkey=’jap’.encode(‘utf-8’))

Connect from the network:

m.connect()

Step 3: Get the object of the queue

task = m.get_task_queue()

result = m.get_result_queue()

Step 4: Fetch the task from the Task queue and write the result to the Result queue

while(not task.empty()):

url = task.get(True, timeout = 5)

print(“run task download %s” %url)

time.sleep(1)

Writes the result to the result queue

result.put(“%s —>success” %url)

print(“exit”)

Detailed steps are also written in it, of course, the task queue, we can create multiple, each task process will complete its own thing, and will not interfere with other task processes, this also let our URL will not repeat to climb, so that the perfect implementation of multiple processes to climb our task.

The above is a very simple distributed process crawler small case, you can through this way to practice their own small project, here also say, we can distribute our task to multiple machines, so that we can achieve a large scale crawl.

These are some common Python crawlers.

Q u n 315-346-913 learning Python q U N 315-346-913 Learning Python q U N 315-346-913 learning Python Free video sharing

Turn the JAP

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Ten minutes to understand distributed crawlers

The number of tasks

Defining the send and receive queue

Create a similar QueueManager

Windows binding calls cannot use lambdas, so functions must be defined before binding

Start the

Get the task queue and result queue over the network

Add tasks

Be sure to close, otherwise the pipeline will be reported as not closed

Add this sentence to solve problems that may occur with multiple processes in Windows

Create a similar QueueManager

Step 1: Use QueueManager to register the name of the method used to get the Queue

Step 2: Connect to the server

The port and authentication passwords must be the same

Connect from the network:

Step 3: Get the object of the queue

Step 4: Fetch the task from the Task queue and write the result to the Result queue

Writes the result to the result queue

Ten minutes to understand distributed crawlers

The number of tasks

Defining the send and receive queue

Create a similar QueueManager

Windows binding calls cannot use lambdas, so functions must be defined before binding

Start the

Get the task queue and result queue over the network

Add tasks

Be sure to close, otherwise the pipeline will be reported as not closed

Add this sentence to solve problems that may occur with multiple processes in Windows

Create a similar QueueManager

Step 1: Use QueueManager to register the name of the method used to get the Queue

Step 2: Connect to the server

The port and authentication passwords must be the same

Connect from the network:

Step 3: Get the object of the queue

Step 4: Fetch the task from the Task queue and write the result to the Result queue

Writes the result to the result queue

Related Posts

MySQL primary/secondary replication + dual primary mode

Data structures and algorithms (xi) and search set

Leaf: Meituan distributed ID generation service open source