Recently, WHILE practicing R and Python for web data fetching, I encountered some annoying captcha problems, took a lot of devious steps, and finally solved them.

Before I share this post, I just want to say that although Python has a much more complete crawler ecology and a plethora of crawler sharing courses, it seems that most of these things can be done using RCurl+ HTTR, but unfortunately crawler enthusiasts using R are much less likely than Pythoner. There are so few advanced crawler tutorials for R that you can only sort through stackFlow bit by bit.

I hope that this case can bring you a little reference ideas.

R

library("RCurl")
library("XML")
library("dplyr")
library("ggplot2")
library("ggimage")
Copy the code

Using crawler to log in educational administration system, the biggest difficulty is verification code recognition. Generally speaking, the first time you visit the login page of the Dean’s Office, a verification code request will be activated. Enter the verification code and account password, and click the login button to activate a POST request to submit data. They are processed in the same process, so you don’t have to worry about inconsistent cookies.

However, if you use crawler to log in, you need to use the cookie management function to automatically remember the cookie of the login, so that the two requests are bound in the same process, so that all subsequent requests will automatically reuse the cookie of the first login, and you can complete the request and traversal of all sub-pages.

The following are the login and verification code request addresses of the Academic Affairs Office:

The login < - "http://202.199.165.193/loginAction.do" Codein < - "http://202.199.165.193/validateCodeAction.do" # construct header:  header = c( "Accept"="text/html,application/xhtml+xml,application/xml; Q = 0.9, image/webp image/apng, * / *; Q =0.8", "Connection"="keep-alive", "user-agent "="Mozilla/5.0 (Windows NT 6.3; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36", "Content-type "="application/x-www-form-urlencoded") # construct a form body composed of account number, password, and verification code (blank) payload<-c(zjh="*******",mm="*****",v_yzm="")Copy the code

Using the cookie manager:

D < -debuggatherer () Chandle < -getCurlHandle (debugFunction =d$update, followLocation =TRUE, cookieFile ="",verbose =TRUE) Save the cookie: PostForm (login,httpheader=header,.params=payload,.encoding="GBK",curl=chandle,style="post") # retrieve the verification code and save it locally GetBinaryURL (Codein, httpHeader =header,curl=chandle) %>% writeBin("vcode.jpg") #  ggplot()+geom_image(aes(x=1,y=1,image="vcode.jpg"),size=.1)+theme_void()Copy the code

Payload [['v_yzm']]<- payload[['v_yzm']]  postForm(login,httpheader=header,.params=payload,.encoding="GBK",curl=chandle,style="post") Url = URLencode (" http://202.199.165.193/gradeLnAllAction.do? Type =ln&oper=qbinfo",reserved =FALSE) Mysocre < -postform (url,httpheader=header,.params=payload,.encoding="GBK",curl=chandle,style="post") Myresult <-mysocre %>% iconv("GBK"," UTF-8 ") %>% htmlParse(encoding=" UTF-8 ") # scorename< -myResult %>% GetNodeSet ("//table//tr// TD [@valign='middle']/b") %>% lapply(xmlValue,trim=T) %>% unlist() # namelabel<-myresult %>% getNodeSet("//table[@class='titleTop2']//th") %>% lapply(xmlValue,trim=T) %>% unlist() %>% Unique () #  scoreclass<-myresult %>% getNodeSet("//table[@class='titleTop2']") classall<-data.frame() for (i in 1:8){ Classall < - rbind (classall scoreclass, % > % ` [[` (I) % > % readHTMLTable () % > %. [3: (nrow (.) - 1)])} # extract all grades Names (classall)<-namelabel # preview head(classall)Copy the code

Next, use the HTTR package to demonstrate:

library("httr") 
library("dplyr") 
library("jsonlite")
library("curl")
library("magrittr")
library("plyr")
library("rlist")
library("jpeg")
library("ggimage")
library("rvest")
Copy the code

POST Login to educational Administration system:

# header and login information header = c (" Accept "=" text/HTML, application/XHTML + XML, application/XML. Q = 0.9, image/webp image/apng, * / *; Q =0.8", "Connection"="keep-alive", "user-agent "="Mozilla/5.0 (Windows NT 6.3; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36", "The content-type" = "application/x - WWW - form - urlencoded") Content < - c (ZJH = "* * * * * * *", mm = "* * * * * * * *", v_yzm = "") # login request and address the login authentication code request < - "http://202.199.165.193/loginAction.do" Codeurl < < - - "http://202.199.165.193/validateCodeAction.do" h handle (login) Cookie POST(login,add_headers(.headers =header),body =payload,encode="form",verbose(),handle=h image<-POST(Codeurl,add_headers(.headers =header),body=payload,encode="form",verbose(),handle=h)%>% content() image %>% Ggplot ()+geom_image(AES (x=1,y=1,image="vcode.jpg"),size=.1)+theme_void() Payload [['v_yzm']] = payload['v_yzm'] = payload['v_yzm'] = payload['v_yzm'] R < -post (login,add_headers(.headers =header),body =payload,encode="form",verbose(),handle=h) Can request to the information you need in a child page url < - URLencode (" http://202.199.165.193/gradeLnAllAction.do? type=ln&oper=qbinfo",reserved =FALSE) myresult<- POST(url,add_headers(.headers =header),body =payload,encode="form",verbose(),handle=h) You can either use the Rvest package or use the XML package mytable < -myResult %>% content(as="parsed",type ="text/ HTML ",encoding ="GBK") %>% html_nodes(xpath="//table[@class='titleTop2']") %>% html_table(fill = TRUE)Copy the code

Python:

import http.cookiejar
from urllib.request import build_opener, HTTPCookieProcessor, Request  
from urllib.parse import urlencode  
from PIL import Image
import matplotlib.pyplot as plt
import re
import sys
import importlib
importlib.reload(sys)
Copy the code

Enable Cookies management:

Cookie = http.cookiejar. Cookiejar () opener=build_opener(HTTPCookieProcessor(cookie))  values = {'zjh':'*******','mm':'*****','v_yzm':''} postdata = urlencode(values).encode(encoding='UTF-8') header = { 'the user-agent' : 'Mozilla / 5.0 (Windows NT 6.3; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36', 'Referer:' http://202.199.165.193/loginAction.do '} # first request pages get cookies: Login = "http://202.199.165.193/loginAction.do" request = request (login, postdata, headers = headers) response = opener.open(request) print (response.getcode()) #200 for item in cookie: print ('Name = '+item.name) print ('Value = '+item.value)Copy the code

Get the verification code and save it:

Yzm = opener. Open (" http://202.199.165.193/validateCodeAction.do ") yzm_data = yzm. Read () yzm_pic = open (' yzm. JPG ', 'wb) yzm_pic.write(yzm_data) yzm_pic.close()Copy the code

Read the user verification code

Img = image.open ('yzm.jpg') plt.figure("code",figsize=(1.2,2.4)) plt.axis('off') plt.imshow(img) plt.show() Values ['v_yzm'] = input(' Please enter verification code: ') login = "http://202.199.165.193/loginAction.do" postdata = urlencode (values). The encode (encoding = "utf-8") # belt captcha landing simulation request = Request(login,postdata,header) response = opener.open(request) print (response.read().decode('gbk'))Copy the code

Request required content information:

Url = "http://202.199.165.193/gradeLnAllAction.do" values = {" type ":" ln ", "oper" : "qbinfo", "LNXNDM" : "2014-2015 school year second semester (two semesters)"} data = urlencode(values) geturl = url + "?" + data response = opener.open(geturl) print(response.read().decode('gbk'))Copy the code

Requests version:

import requests
from PIL import Image
import matplotlib.pyplot as plt
import sys
import importlib
importlib.reload(sys)
Copy the code

Login Information:

Header = {' user-agent ':'Mozilla/5.0 (Windows NT 6.3; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome / 56.0.2924.87 Safari / 537.36 '} values = {' ZJH ':' * * * * * 'and' mm ':' * * * 'and' v_yzm ':'} login = "Http://202.199.165.193/loginAction.do" codein = "http://202.199.165.193/validateCodeAction.do"Copy the code

# simulate sending a login request

s = requests.session()
log = s.post(login,data=values,headers=header)
print (log.text)
imgresponse = s.post(codein,headers=header).content
print (imgresponse)
Copy the code

# obtain verification code:

yzm_pic = open('yzm.jpg','wb') yzm_pic.write(imgresponse) yzm_pic.close() img=Image.open('yzm.jpg') PLT. Figure (" code ", figsize = (1.2, 2.4)), PLT, axis (' off ') PLT. Imshow (img) PLT. The show ()Copy the code

Complete the login information and revisit the login page

Values ['v_yzm'] = input(' Please enter verification code: ') s.post(login,data=values,headers=header).textCopy the code

# Log in to other sub-pages to obtain requirements information:

Url = "http://202.199.165.193/gradeLnAllAction.do? type=ln&oper=qbinfo&lnxndm=2011-2012%D1%A7%C4%EA%B5%DA%D2%BB%D1%A7%C6%DA(%C1%BD%D1%A7%C6%DA)" response = s.post(url,data=values,headers=header) print (response.text)Copy the code

References:

https://cran.r-project.org/web/packages/RCurl/RCurl.pdfhttp://blog.csdn.net/sinat_26917383/article/details/51123164
https://cran.r-project.org/web/packages/httr/httr.pdf
https://docs.python.org/2/library/urllib.html
Copy the code

For online courses, please click on the link below:

Hellobi Live | September 12 R language data visualization application in the business scenario past case please click I lot: https://github.com/ljtyduyu/DataWarehouse/tree/master/File

Welcome to the data cube QQ communication group