Moment For Technology

Left hand R right hand Python series - Use multiple processes for task processing

Posted on Nov. 27, 2023, 10:20 p.m. by Donna Holmes
Category: The back-end Tag: python Unit testing The back-end windows

Data capture in intensive task processing, often involves performance bottlenecks, at this time if you can have multi-process tools to support, then often the efficiency will be improved a lot.

In Python, you can download binary files by calling multiple processes.

Import the file to be downloaded:

library("dplyr") mydata-read.csv("D:/Python/File/toutiaoreport.csv",stringsAsFactors = FALSE,check.names = FALSE) Extract report URL and report name:  mydata1-mydata[1:10,] %% .[,c("title","path")] mydata1$path - paste0("",mydata1$path) mydata1$title-sub("\\?","",mydata1$title)Copy the code

In R, there are three options for downloading files:

Option 1 -- Build the display loop:

# build the download program: myworks-function(data){ setwd("D:/R") dir.create("folder1",showWarnings=FALSE) for(i in 1:nrow(data)){ download.file(data$path[i],paste0("./folder1/",data$title[i],".pdf"),quiet=TRUE, Mode = "wb") the cat (sprintf (" page is downloading the first [% d] ", I), '\ n')} cat (" all download task is complete!" ,"\n") } system.time(myworks(mydata1))Copy the code

There are 10 PDF files in total, and the waiting time is not set in the downloading process. The average waiting time is 4.5m, a total of 44.5m, and the total time is 100m.

Option 2 -- Use the vectorization function in the Plyr package

### Use vectolization function Library ("plyr") Library ("dplyr") Library ("foreach") myList -foreach(x=1:nrow(mydata1),. Combine ='c') %do% List (mydata1[x,]) setwd("D:/R") dir.create("folder2",showWarnings=FALSE) downloadCSV - function(filelinks) { download.file(filelinks$path,destfile=paste0("D:/R/folder2/",filelinks$title,".pdf"),quiet=TRUE, mode = "wb") } url - ""system.time( l_ply(mylist,downloadCSV,.progress = "text") )Copy the code

A little sad, the same 10 PDF documents, the time will not change, this time is 99.89, only 0.02m less than 99.91 last time, but I use the campus network (especially bad network speed, interested in the broadband performance of high conditions can be tested again)

Scenario 3 -- Use multi-process packages for concurrent processing:

Copy the code

The multiprocess package used here is the Foreach package, but you can also try the Parallel package.

downloadCSV - function(filelinks) {
        download.file(filelinks$path,destfile=paste0("D:/R/folder3/",filelinks$title,".pdf"),quiet=TRUE, mode = "wb") 

  foreach(d=mylist, .combine=c) %dopar% downloadCSV(d)
Copy the code

This time a total of... 99.46, ok, I may have used fake multiple processes, but overall it took less time to go from 99.91 to 98.72, still saving nearly 1.19 seconds.

And the code looks a lot more elegant (well I can't make this up ~_~)

At present, I don't know much about the multi-process of R language. If I have a new understanding in the future, I will reorganize this part. If you are interested, you can also explore the internal multi-process execution mechanism of foreach package by yourself.


Import time, OS from urllib import request import threading from multiprocessing import Pool import pandas as pd ### os.chdir("D:/Python/File") mydata = pd.read_csv("toutiaoreport.csv",encoding='gbk') mydata1 = mydata.loc[:9,["title","path"]] mydata1['path'] = ["" + path for path in mydata1['path']] mydata1['title'] = [text.replace("?","") for text in mydata1.title]Copy the code

Option 1 -- download using an explicitly declared loop:

def getPDF(mydata1): os.makedirs("folder1") os.chdir("D:/Python/File/folder1") i = 0 while i  len(mydata1): Print (" download [{}] file!" .format(i+1)) request.urlretrieve(mydata1['path'][i],mydata1['title'][i]+".pdf") i += 1if __name__ == '__main__': T0 = time.time() getPDF(mydata1) t1 = time.time() total = t1-t0 print(" time.format (total))Copy the code

This is three seconds slower than the R loop, so try using multi-process/multi-thread to try downloading these PDFS.

Option 2 -- Download using the threading package:

def executeThread(i): request.urlretrieve(mydata1['path'][i],"D:/Python/File/folder2/"+mydata1['title'][i]+".pdf") def main(): try: os.makedirs("D:/Python/File/folder2") except: pass threads = [] for i in range(len(mydata1)): thread = threading.Thread(target=executeThread,args=(i,)) threads.append(thread) thread.start() for i in threads: i.join() if __name__ == '__main__': T0 = time.time() main() t1 = time.time() total = t1 -t0 print(" format(total))Copy the code

The total time is 98.15953660011292, which only saves four seconds compared to the explicit loop.

Scenario 3 -- Use the multiprocess functionality provided by the Multiprocessing package

links = mydata1['path'].tolist() def downloadPDF(i): request.urlretrieve(i,os.path.basename(i)) def shell(): # if not os.path.exists("D:/Python/File/folder3"): os.makedirs("D:/Python/File/folder3") os.chdir("D:/Python/File/folder3") else: Os.chdir ("D:/Python/File/folder3") # start time: t0 = time.time()  # Multi-process pool=Pool(processes=4),links) pool.close() t1 = time.time() total = t1 - t0 Print (" elapsed time: {} ". The format (total) if __name__ = = "__main__" : shell ()Copy the code

When using the process pooling feature of the Multiprocessing package, my code ran in a locked and suspended state, with no output, no exit, and even no forced break. It was a particular problem for the Windows platform with Forks.

-- -- -- -- -- -- -- --

Today, the problem is that the last multi-process code does not run on an interactive IDE and needs to be wrapped into file and run directly in CMD or PowerShell.

About (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.