Moment For Technology

Left hand R right hand Python series - Use multiple processes for task processing

Posted on Nov. 27, 2023, 10:20 p.m. by Donna Holmes
Category: The back-end Tag: python Unit testing The back-end windows

Data capture in intensive task processing, often involves performance bottlenecks, at this time if you can have multi-process tools to support, then often the efficiency will be improved a lot.

In Python, you can download binary files by calling multiple processes.

Import the file to be downloaded:

library("dplyr") mydata-read.csv("D:/Python/File/toutiaoreport.csv",stringsAsFactors = FALSE,check.names = FALSE) Extract report URL and report name:  mydata1-mydata[1:10,] %% .[,c("title","path")] mydata1$path - paste0("https://mlab.toutiao.com/report/download/",mydata1$path) mydata1$title-sub("\\?","",mydata1$title)Copy the code

In R, there are three options for downloading files:

Option 1 -- Build the display loop:

# build the download program: myworks-function(data){ setwd("D:/R") dir.create("folder1",showWarnings=FALSE) for(i in 1:nrow(data)){ download.file(data$path[i],paste0("./folder1/",data$title[i],".pdf"),quiet=TRUE, Mode = "wb") the cat (sprintf (" page is downloading the first [% d] ", I), '\ n')} cat (" all download task is complete!" ,"\n") } system.time(myworks(mydata1))Copy the code

There are 10 PDF files in total, and the waiting time is not set in the downloading process. The average waiting time is 4.5m, a total of 44.5m, and the total time is 100m.

Option 2 -- Use the vectorization function in the Plyr package

### Use vectolization function Library ("plyr") Library ("dplyr") Library ("foreach") myList -foreach(x=1:nrow(mydata1),. Combine ='c') %do% List (mydata1[x,]) setwd("D:/R") dir.create("folder2",showWarnings=FALSE) downloadCSV - function(filelinks) { download.file(filelinks$path,destfile=paste0("D:/R/folder2/",filelinks$title,".pdf"),quiet=TRUE, mode = "wb") } url - "https://mlab.toutiao.com/report/download/"system.time( l_ply(mylist,downloadCSV,.progress = "text") )Copy the code

A little sad, the same 10 PDF documents, the time will not change, this time is 99.89, only 0.02m less than 99.91 last time, but I use the campus network (especially bad network speed, interested in the broadband performance of high conditions can be tested again)

Scenario 3 -- Use multi-process packages for concurrent processing:

library("parallel")
library("foreach")
library("iterators")
Copy the code

The multiprocess package used here is the Foreach package, but you can also try the Parallel package.

downloadCSV - function(filelinks) {
    tryCatch({
        download.file(filelinks$path,destfile=paste0("D:/R/folder3/",filelinks$title,".pdf"),quiet=TRUE, mode = "wb") 
        "OK"
    },error=function(e){
        "Trouble"
    })
}

system.time({
  library("doParallel")
  setwd("D:/R")
  dir.create("folder3",showWarnings=FALSE)
  cl-makeCluster(4)
  registerDoParallel(cl)
  foreach(d=mylist, .combine=c) %dopar% downloadCSV(d)
  stopCluster(cl)
  })
Copy the code

This time a total of... 99.46, ok, I may have used fake multiple processes, but overall it took less time to go from 99.91 to 98.72, still saving nearly 1.19 seconds.

And the code looks a lot more elegant (well I can't make this up ~_~)

At present, I don't know much about the multi-process of R language. If I have a new understanding in the future, I will reorganize this part. If you are interested, you can also explore the internal multi-process execution mechanism of foreach package by yourself.

Python:

Import time, OS from urllib import request import threading from multiprocessing import Pool import pandas as pd ### os.chdir("D:/Python/File") mydata = pd.read_csv("toutiaoreport.csv",encoding='gbk') mydata1 = mydata.loc[:9,["title","path"]] mydata1['path'] = ["https://mlab.toutiao.com/report/download/" + path for path in mydata1['path']] mydata1['title'] = [text.replace("?","") for text in mydata1.title]Copy the code

Option 1 -- download using an explicitly declared loop:

def getPDF(mydata1): os.makedirs("folder1") os.chdir("D:/Python/File/folder1") i = 0 while i  len(mydata1): Print (" download [{}] file!" .format(i+1)) request.urlretrieve(mydata1['path'][i],mydata1['title'][i]+".pdf") i += 1if __name__ == '__main__': T0 = time.time() getPDF(mydata1) t1 = time.time() total = t1-t0 print(" time.format (total))Copy the code

This is three seconds slower than the R loop, so try using multi-process/multi-thread to try downloading these PDFS.

Option 2 -- Download using the threading package:

def executeThread(i): request.urlretrieve(mydata1['path'][i],"D:/Python/File/folder2/"+mydata1['title'][i]+".pdf") def main(): try: os.makedirs("D:/Python/File/folder2") except: pass threads = [] for i in range(len(mydata1)): thread = threading.Thread(target=executeThread,args=(i,)) threads.append(thread) thread.start() for i in threads: i.join() if __name__ == '__main__': T0 = time.time() main() t1 = time.time() total = t1 -t0 print(" format(total))Copy the code

The total time is 98.15953660011292, which only saves four seconds compared to the explicit loop.

Scenario 3 -- Use the multiprocess functionality provided by the Multiprocessing package

links = mydata1['path'].tolist() def downloadPDF(i): request.urlretrieve(i,os.path.basename(i)) def shell(): # if not os.path.exists("D:/Python/File/folder3"): os.makedirs("D:/Python/File/folder3") os.chdir("D:/Python/File/folder3") else: Os.chdir ("D:/Python/File/folder3") # start time: t0 = time.time()  # Multi-process pool=Pool(processes=4) pool.map(downloadPDF,links) pool.close() t1 = time.time() total = t1 - t0 Print (" elapsed time: {} ". The format (total) if __name__ = = "__main__" : shell ()Copy the code

When using the process pooling feature of the Multiprocessing package, my code ran in a locked and suspended state, with no output, no exit, and even no forced break. It was a particular problem for the Windows platform with Forks.

-- -- -- -- -- -- -- --

Today, the problem is that the last multi-process code does not run on an interactive IDE and needs to be wrapped into a.py file and run directly in CMD or PowerShell.

Search
About
mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.