Left hand with R right hand Python series - Dynamic Web crawling with Selenium Driven browser

In particular, the R language crawler framework (RCurl+XML/ HTTR +rvest[xml2+selectr]) has formed a rich tutorial system.

However, all of these are based on static pages (except for packet capture and API access), and many dynamic web pages do not provide API access, leaving it to selenium, a browser-based driver technology, to do so.

Fortunately, the Selenium interface package is already available in the R language, the RSelenium package, which makes it possible to crawl dynamic web pages. I wrote a student monk site crawler earlier this year that uses Rwebdriver, another selenium driven interface package in R.

Intern recruitment website crawler data visualization

At that time, the technology was not mature and the thinking was naive, so I used the navigator to traverse the contents of 500 pages. Although I finally finished climbing all the data, it took a long time (nearly 40 minutes) with low efficiency. (Interested in the little partners can refer to the above article, but the official website of the internship monk has been greatly revised recently, now it is definitely more difficult than before! That code might not work.

Recently RSelenium package under the time to study the related content, here to thank Chen weir teacher in Shanghai R language conference site do the made flexible RSelenium powerful web crawler’s speech, although failed to reach the scene, but had the pleasure of watching video edition, some of the details to solve some confusion, I nearly period of time thanks here.

Creating a Flexible and powerful Web crawler using RSelenium www.xueqing.tv/course/88 www.youtube.com/watch?v=ic6…

There are several packages in R that can parse dynamic web pages (welcome to add) :

RSelenium (Recommended)
Rwebdriver(not very mature)
seleniumpipes
Rdom (Advanced encapsulation, not flexible enough)
Rcrawler (multi-process support)
Webshot (specifically for dynamic web screenshots)

The rest of this section officially shares today’s example, the goal is to pull the hook (don’t ask why, because I haven’t climbed the hook before)!

Before introducing the case, ensure that the system meets the following conditions:

Selenium server is locally available and system path is added;

Local plantomJS browser and add system path;

The RSelenium package is installed.

Because involves the automation click operation, the Chrome drummed up an afternoon just click on the link fails, find a reason, a long face when pull hook web page, and on the next page button is beyond the scope of the default Windows, use the js script control slider failed, for reasons unknown, to see someone with firefox browser test success, I have not tried, Instead use plantomJS headless browser (regardless of whether elements are blocked by Windows).

R language Version:

#!!!!!! These two sentences are run in CMD or PowerShell! Keep this window open until the RSelenium service is closed! ### Start selenium service: CD D:\ Java-jar Selenium-server-standalone 3.3.1.jar ## The Selenium server can also be launched directly in the R language (no pop-up window) System (" Java-jar \"D:/selenium-server-standalone-2.53.1.jar\"",wait = FALSE,invisible = FALSE) # Load package Library ("RSelenium") library("magrittr") library("xml2")Copy the code

Start the service

# to camouflage UserAgent plantomjs browser eCap < - list (phantomjs. Page. Settings. UserAgent = "Mozilla / 5.0 (Windows NT 6.1; WOW64; Rv :29.0) Gecko/20120101 Firefox/29.0") ### Disguise the browser UserAgent, why do you need to disguise the UA even if you use plantomJS? ### because PlantomJS is specially designed for web side page testing, Usually in their own Web projects to test the Web side of the function, directly grab someone else's website, the default UA is plantomJS; ### This is a blatant provocation! <- remoteDriver(browserName = "Phantomjs ", extraCapabilities = eCap)Copy the code

Build an automated fetching function:

Myresult <-function(remDr,url){### Myresult <-data.frame() ### RemDr $open() ## navigate(url) ### To initialize a timer, while(TRUE){ # timer starts counting: DOM pagecontent< -remdr $getPageSource()[[1]] So a temporary root node is created (to save redundant code) con_list_item < -pagecontent %>% read_html() %>% xml_find_all('//ul[@class="item_con_list"]/li') Position.name < -con_list_item %>% xml_attr(" data-positionName ") # position.pany < -con_list_item %>% Xml_attr ("data-company") # position. Salary < -con_list_item %>% xml_attr("data-salary") # position. Link <- Pagecontent %>% read_html() %>% xml_find_all('//div[@class=" p_TOP "]/a') %>% xml_attr("href") # <- pagecontent %>% read_html() %>% xml_find_all('//div[@class="p_bot"]/div[@class="li_b_l"]') %>% xml_text(trim=TRUE) Position. Industry < -pagecontent %>% read_html() %>% xml_find_all('//div[@class="industry"]') %>% xml_text(trim=TRUE) %>% gsub("[[:space:]\\u00a0]+|\\n", "",.) Position.bonus < -pagecontent %>% read_html() %>% xml_find_all('//div[@class="list_item_bot"]/div[@class="li_b_l"]') %>% xml_text(trim=TRUE) %>% gsub("[[:space:]\\u00a0]+|\\n", "/",.) Position # position work environment. The environment < - > pagecontent % % read_html () % > % xml_find_all (' / / div [@ class = "li_b_r"] ") % > % Xml_text (trim=TRUE) # Collect data mydata<- data.frame(position.name,position.company,position.salary,position.link,position.exprience,position.industry,position.bo Nus, the position, the environment, stringsAsFactors = FALSE) # will be before the collection of data to create data frames myresult < - rbind (myresult, mydata) # system dormancy 0.5 ~ 1.5 seconds If (pagecontent %>% read_html() %>% xml_find_all('//div[@class="page-number"]/span[1]') %>% xml_text() ! ="30"){# if the page does not end, RemDr $findElement('xpath','//div[@class=" pager_Container "]/a[last()]')$clickElement() # Cat (sprintf(" %d ", I),sep = "\n")} else {# break} # close remDr$close() Cat (" All work is done!! ") ,sep = "\n") # return(myResult)}Copy the code

Run grab function

< - url "https://www.lagou.com/zhaopin" myresult < - myresult (remDr, url) # preview DT: : datatable (myresult)Copy the code

Python:

import os,random,time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities 
import DesiredCapabilities
from lxml import etree
Copy the code

Start the service

Dcap = dict (DesiredCapabilities PHANTOMJS) # here is camouflage UA: Dcap [" phantomjs. Page. Settings. UserAgent "] = (" Mozilla / 5.0 (Macintosh; Intel Mac OS X 10.9; Rv :25.0) Gecko/20100101 Firefox/25.0") # PhantomJS(desired_capabilities= dCAP)Copy the code

Building a fetching function

Def getlaogou(driver,url): # initialize an empty dictionary with length 0! Myresult = {" POSItion_name ":[], "POSItion_company ":[]," POSItion_salary ":[], "position_link":[], "position_exprience":[], "position_industry":[], "position_environment":[] }; Driver.get (url) # timer initialization I =0 while True: # Timer accumulative time: Result = etree.html (pagecontent) # use the extend method of a single list in the dictionary to collect data  myresult["position_name"].extend(result.xpath('//ul[@class="item_con_list"]/li/@data-positionname')) myresult["position_company"].extend(result.xpath('//ul[@class="item_con_list"]/li/@data-company')) myresult["position_salary"].extend(result.xpath('//ul[@class="item_con_list"]/li/@data-salary')) myresult["position_link"].extend(result.xpath('//div[@class="p_top"]/a/@href')) myresult["position_exprience"].extend([ text.xpath('string(.)').strip() for text in result.xpath('//div[@class="p_bot"]/div[@class="li_b_l"]')]) myresult["position_industry"].extend([ text.strip() for text in result.xpath('//div[@class="industry"]/text()')]) Myresult ["position_environment"].extend(result.xpath('//div[@class="li_b_r"]/text()')) # single loop task sleep Time. Sleep (random. Choice (range(3))) # if result.xpath('//div[@class="page-number"]/span[1]/text()')[0]! = '30' : Driver.find_element_by_xpath ('//div[@class="pager_container"]/a[last()]').click() # Print the current task state at the same time! Print (" [{}] ") .format(I)) else: # If all pages reach the end of the loop! Print ("everything is OK") # Exit and close the Selenium service! Driver.quit () # return data pd.dataframe (myResult)Copy the code

Run grabber

url = "https://www.lagou.com/zhaopin"
mydata = getlaogou(driver,url) 
Copy the code

For online courses, please click on the link below:

Hellobi Live | R language data visualization application in the business scenario past case please click I lot: github.com/ljtyduyu/Da…

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Left hand with R right hand Python series – Dynamic Web crawling with Selenium Driven browser

Left hand with R right hand Python series – Dynamic Web crawling with Selenium Driven browser

Related Posts

Understand the design and practice of K8s log system

Java8’s Stream API is awesome, but what about performance?

Must read for developers! The Grafana, Loki, and Tempo protocols will be changed from Apache 2.0 to AGPLv3