Last time I talked about R and Pyhton exception catching and error handling basic knowledge, today I use a small case to practice, let your program water bridge, unhindered.

The target website of this case, toutiao toutiao Index industry report, are all in PDF format. It is necessary to capture the package to obtain the PDF file address, and then I will randomly select 5 of them (because PDF download depends on the Internet speed, which is very slow), and then set two of them as non-existent address.

This type of error is very common, of course, in practice there are many types of error, you need to be careful to identify, but the basic idea is the same. When encountering an error address that is blocking the program, use the exception function to catch the error exception and then use the next command (continue in Python) to bypass it.

Error handling in R language loops:

library("httr") 
library("dplyr") 
library("jsonlite")
url<-"https://index.toutiao.com/api/report"
Copy the code

Build the download function:

GETPDF<-function(url){ myresult<-data.frame() headers<-c( "Host"="index.toutiao.com", "The user-agent" = "Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/61.0.3163.79 Safari/537.36") Payload <-list("page"=1,"size"=12) for (I in 1:16){payload[["page"]]= I  web <- GET(url,add_headers(.headers = headers),query = payload,verbose()) content <- web %>% content(as="text",encoding="UTF-8") %>% fromJSON() %>% `[[`(9) myresult <- rbind(myresult,content) Sys.sleep(runif(1)) Print (sprintf(" %d ")) print(" %d ") return(myresult)}Copy the code

Run the data fetching function:

myresult<-GETPDF(url)
Copy the code

Myresult <-arrange(myresult,id)Copy the code

# PDF address links the completion of the data frame myresult $path < - paste0 (" https://mlab.toutiao.com/report/download/ ", myresult $path) # randomly selected five titles and addresses The Test < - myresult [sample (9, 5 1:18), c (" title ", "path")]Copy the code

Test (5, 2] '/ / mlab.toutiao.com/report/download/report47.pdf # will be one of the 3, 5 address set to cross-border (legal but index is url of crossing the line, So you can't request legal data) Test (3, 2] < - "https://mlab.toutiao.com/report/download/report570.pdf" The Test (5, 2] < - "https://mlab.toutiao.com/report/download/report470.pdf"Copy the code

Using an out-of-bounds address to request a return screen in the browser looks like this!

Next use a vector with two out-of-bounds addresses for the PDF loop download:

Hidden code:

setwd("D:/R") for(i in 1:nrow(Test)){ download.file(Test$path[i],paste0(Test$title[i],".pdf"), Mode = "wb") print(sprintf(" %d", I)) sys. sleep(runif(1))}Copy the code

Add error-catching code (option 1 — use tryCatch) :

for(i in 1:nrow(Test)){ tryCatch({ download.file(Test$path[i],paste0(Test$title[i],".pdf"), Mode = "wb") print(sprintf(" %d ", I))},error = function(e) {print(" %d ") ,i)) }) Sys.sleep(runif(1)) }Copy the code

Add error-catching code (option 1 — use try) :

for (i in 1:nrow(Test)){ Error <- try(download.file(Test$path[i],paste0(Test$title[i],".pdf"), mode = "wb")) if(! 'try-error' %in% class(Error)){ download.file(Test$path[i],paste0(Test$title[i],".pdf"), Mode = "wb") print(sprintf(" %d ", I)) else {print(sprintf(" %d ", I)) ,i)) next } Sys.sleep(runif(1)) }Copy the code

Both tryCatch and try can be used to bypass loops. TryCatch seems to be more common in other languages. After an error is caught, it will directly ignore the error item and skip to the next loop. The error item is specified as next (that is, skip to the next loop and skip to the next loop).

But if you don’t do any exception handling without knowing it, then the editor will pop up errors and interrupt the process, which we don’t want.

Python:

import json
import random
import requests
import pandas as pd
import osimport time
Copy the code

Download PDF download address:

Headers ={"Host":"index.toutiao.com", "user-agent ":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/61.0.3163.79 Safari/537.36"} Payload ={"page":1,"size":12} url="https://index.toutiao.com/api/report" def GETPDF(url,headers=headers,payload=payload): Fullinfo =[] for I in range(1,17): payload['pageIndex']=i r=requests.get(url,params=payload,headers=headers) content=r.json() Fullinfo =fullinfo+content['data'] print(" {} part loaded ". Format (I)) print(" all pages loaded!! ") ) return fullinfo mydata=GETPDF(url) mydata=pd.DataFrame(mydata)Copy the code


# Select 5 records randomly: Test = mydata. Loc [5, [' title ', 'path']] # joining together into a complete download link Test [' path '] = [' https://mlab.toutiao.com/report/download/ '+ text for Text in the Test [' path ']] # random set two cross-border address Test [' path '] [[3, 5]] = 'https://mla.toutiao.com/download/500.pdf' os.chdir("D:/Python/File")Copy the code

Do not set task error capture mechanism:

for i in range(len(Test)): file=requests.get(Test['path'][i]).content with open('{}.pdf'.format(Test['title'][i]), 'wb') as f: F. write(file) print(" {} file download completed!!" .format(i)) time.sleep(1)Copy the code

After setting up fault tolerant handling:

for i in range(len(Test)): try: file=requests.get(Test['path'][i]).content with open('{}.pdf'.format(Test['title'][i]), 'wb') as f: F. write(file) print(" {} file download completed!!" .format(i+1)) except requests.exceptions.ConnectionError as e: Print (" format(I +1),e) continue time.sleep(0.5) print(" format(I +1),e) continue time.sleep(0.5)Copy the code

# Save this data for future use! mydata.to_csv("D:/Python/File/toutiaoreport.csv")Copy the code

As you can see, both R and Python’s error-catching and evading mechanisms are easy to understand, as long as the error-catching functions are in place and the resolution of the error is specified, usually by downloading binaries or extracting data in a loop. Using the next function in R or the continue function in Python, you can successfully bypass the failed tasks in the loop and keep the whole process going until the end of the loop, automatically exiting!

For online courses, please click on the link below:

Hellobi Live | September 12 R language data visualization application in the business scenario past case please click I lot: github.com/ljtyduyu/Da…