[Python3 network crawler development actual practice] 7- dynamic rendering page crawl -2-Splash use

Splash is a JavaScript rendering service, a lightweight browser with an HTTP API that also plugs into Python’s Twisted and QT libraries. Using it, we can also achieve dynamic rendering page grab.

1. Function Introduction

With Splash, we can achieve the following functions:

Asynchronous processing of multiple web page rendering processes;
Get source code or screenshots of rendered pages;
Speed up page rendering by turning off image rendering or using Adblock rules;
Executable specific JavaScript scripts;
Control the page rendering process through Lua scripts;
Get the detailed rendering process and present it in HAR (HTTP Archive) format.

Next, let’s see how it’s used.

2. Preparation

Before you start, make sure Splash is properly installed and running the service. If not, refer to Chapter 1.

3. Instance introduction

First, the rendering process was tested using the Web pages provided by Splash. For example, the Splash service is run on port 8050 of the local computer. The Web page of http://localhost:8050/ is displayed, as shown in Figure 7-6.

Figure 7-6 Web page

On the right side of Figure 7-6, an example rendering is presented. As you can see, there is an input box at the top, which is google.com by default. Change it to Baidu to test, change the content to www.baidu.com, and then click the Render me button to start rendering. The result is shown in Figure 7-7.

Figure 7-7 Running result

As you can see, the page returns with rendered screenshots, HAR loading statistics, and the source code for the page.

According to HAR’s results, Splash executed the entire webpage rendering process, including the loading of CSS and JavaScript, and the rendered page was completely consistent with the result obtained in the browser.

So what controls this process? If you go back to the home page, you can see that there is actually a script that reads as follows:

123456789function main(splash, args)  assert(splash:go(args.url))  assert(splash:wait(0.5))  return {    html = splash:html(),    png = splash:png(),    har = splash:har(),  }endCopy the code

This script is actually written in Lua. Even if we don’t understand the syntax of the language, from the superficial meaning of the script, we can know that it first calls the go() method to load the page, then calls the wait() method to wait for a certain amount of time, and finally returns the source code, screenshots and HAR information of the page.

By now, we have generally learned that Splash controls the loading process of the page through Lua scripts. The loading process is completely simulated by the browser, and the results can be returned in various formats, such as the source code and screenshots of the page.

Next, let’s take a look at how Lua scripts are written and how the related apis are used.

4. Splash the Lua script

Splash can perform a series of rendering operations using Lua scripts, which allows us to use Splash to simulate operations similar to Chrome and PhantomJS.

First, let’s take a look at the entry and execution of the Splash Lua script.

Entry and return values

Let’s start with a basic example:

123456function main(splash, args)  splash:go("http://www.baidu.com")  splash:wait(0.5)  local title = splash:evaljs("document.title")  return {title=title}endCopy the code

Let’s paste the code into the code editing area of the http://localhost:8050/ we just opened and click Render me! Button to test it out.

We see that it returns the title of the page, as shown in Figure 7-8. Here we pass in JavaScript via the evaljs() method, and document.title returns the title of the page, assigns it to a title variable, and returns it.

Figure 7-8 Running result

Notice that the name of the method we defined here is main(). The name must be fixed, and Splash calls this method by default.

The return value of this method can be either in dictionary form or string form, which is finally converted to Splash HTTP Response, for example:

123function main(splash)    return {hello="world!"}endCopy the code

Returns a dictionary form of content. Such as:

123function main(splash)    return 'hello'endCopy the code

Returns a string of content.

Asynchronous processing

Splash supports asynchronous processing, but the callback method is not explicitly specified here. The jump of the callback is done inside Splash. The following is an example:

function main(splash, args)
  local example_urls = {"www.baidu.com"."www.taobao.com"."www.zhihu.com"}
  local urls = args.urls or example_urls
  local results = {}
  for index, url in ipairs(urls) do
    local ok, reason = splash:go("http://". url)if ok then
      splash:wait(2)
      results[url] = splash:png()
    end
  end
  return results
end

Copy the code

The result shows the screenshots of the three sites, as shown in Figure 7-9.

Figure 7-9 Running result

The wait() method called inside the script is similar to sleep() in Python, and takes the number of seconds to wait. When Splash executes this method, it moves on to other tasks and then returns to the process after the specified time.

It’s worth noting here that unlike Python, string concatenation in Lua scripts uses.. Operator, instead of +. If necessary, a quick look at the syntax of Lua scripts can be found at www.runoob.com/lua/lua-bas… .

In addition, there is a load – time exception detection. The go() method returns the result status of the loaded page. If the page has a 4xx or 5XX status code, the OK variable is empty and the loaded image is not returned.

5. Splash Object properties

Notice that the first argument to the main() method in the previous example is splash. This object is very important. It is similar to the WebDriver object in Selenium, and we can call some of its properties and methods to control the loading process. Now, let’s look at its properties.

args

This property can GET parameters configured at load time, such as URL, or GET request parameters if it is a GET request; If it is a POST request, it can get the data submitted by the form. Splash also supports using the second argument directly as args, for example:

123function main(splash, args)    local url = args.urlendCopy the code

Here the second parameter args is equivalent to the splash. Args property. The above code is equivalent to:

123function main(splash)    local url = splash.args.urlendCopy the code

js_enabledCopy the code

This property is Splash’s JavaScript execution switch and can be set to true or false to control whether or not the JavaScript code executes. The default is true. For example, JavaScript code is prohibited here:

123456function main(splash, args)  splash:go("https://www.baidu.com")  splash.js_enabled = false  local title = splash:evaljs("document.title")  return {title=title}endCopy the code

We then re-call the evaljs() method to execute the JavaScript code, which throws an exception:

1234567891011121314 {"error": 400,    "type": "ScriptError"."info": {        "type": "JS_ERROR"."js_error_message": null,        "source": "[string \"function main(splash, args)\r...\"]"."message": "[string \"function main(splash, args)\r...\"]:4: unknown JS error: None"."line_number": 4."error": "unknown JS error: None"."splash_method": "evaljs"    },    "description": "Error happened while executing Lua script"}Copy the code

Generally speaking, this property is not set. It is enabled by default.

resource_timeout

This property sets the load timeout in seconds. If set to 0 or nil (similar to None in Python), the timeout is not detected. The following is an example:

functionMain (splash) splash. Resource_timeout = 0.1 Assert (splash:go('https://www.taobao.com'))    return splash:png()endCopy the code

For example, the timeout is set to 0.1 seconds. If no response is received within 0.1 seconds, an exception is thrown with the following error:

123456789101112 {"error": 400,    "type": "ScriptError"."info": {        "error": "network5"."type": "LUA_ERROR"."line_number": 3."source": "[string \"function main(splash)\r...\"]"."message": "Lua error: [string \"function main(splash)\r...\"]:3: network5"    },    "description": "Error happened while executing Lua script"}Copy the code

This property is suitable for slow loading of web pages. If there is no response after a certain time, simply throw an exception and ignore it.

images_enabled

This property sets whether an image is loaded, which it is by default. If this attribute is disabled, you can save network traffic and speed up web page loading. However, it is important to note that disabling image loading may affect JavaScript rendering. Disabling the image affects the height of its outer DOM node, which in turn affects the position of the DOM node. Therefore, if the JavaScript has operations on the image node, its execution will be affected.

It’s also worth noting that Splash uses caching. If the page image is loaded at the beginning, and then the page is reloaded after image loading is disabled, the loaded image may still be displayed. In this case, Splash can be directly restarted.

The following is an example of disabling image loading:

function main(splash, args)

splash.images_enabled = false

assert(splash:go(‘https://www.jd.com’))

return {png=splash:png()}

end

This will return screenshots without any images and will load much faster.

plugins_enabled

This property controls whether the browser plug-in (such as Flash plug-in) is enabled. By default, this property is false, which means it is disabled. It can be turned on and off using the following code:

1	splash.plugins_enabled = true/false

scroll_position

By setting this property, we can control scrolling up and down or left and right. This is a common attribute, as shown in the following example:

function main(splash, args)

assert(splash:go(‘https://www.taobao.com’))

splash.scroll_position = {y=400}

return {png=splash:png()}

end

This allows us to scroll down the page by 400 pixels, as shown in Figure 7-10.

Figure 7-10 Running result

To scroll the page left or right, pass in the x argument as follows:

1	splash.scroll_position = {x=100, y=200}

6. Method of the Splash object

In addition to the properties described earlier, a Splash object has the following methods.

go()

This method is used to request a link, and it can simulate GET and POST requests, while supporting incoming headers, forms, and other data. It can be used as follows:

1	ok, reason = splash:go{url, baseurl=nil, headers=nil, http_method=”GET”, body=nil, formdata=nil}

The parameters are described as follows.

Url: The requested URL.
Baseurl: This parameter is optional. The default value is empty, indicating the relative path of resource loading.
Headers: This parameter is optional. The default value is empty, indicating the request header.
http_method: This parameter is optional. The default value isGET, while supportingPOST.
body: This parameter is optional. The default value is empty. The form data used when the POST request is sentContent-typeforapplication/json.
formdata: Optional, default is empty, when the POST form data, usedContent-typeforapplication/x-www-form-urlencoded.

The result of this method is the combination of result OK and reason. If OK is empty, it indicates that the page load error occurred. In this case, the reason variable contains the reason for the error. The following is an example:

123456function main(splash, args)  local ok, reason = splash:go{"http://httpbin.org/post", http_method="POST", body="name=Germey"}  if ok then        return splash:html()  endendCopy the code

Here we simulate a POST request, pass in the form data for the POST, and, if successful, return the source code for the page.

The running results are as follows:

1234567891011121314151617181920212223<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;"> {"args": {},   "data": ""."files": {},   "form": {    "name": "Germey"  },   "headers": {    "Accept": "text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8"."Accept-Encoding": "gzip, deflate"."Accept-Language": "en,*"."Connection": "close"."Content-Length": "11"."Content-Type": "application/x-www-form-urlencoded"."Host": "httpbin.org"."Origin": "null"."User-Agent": "Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/602.1 (KHTML, like Gecko) Splash Version/9.0 Safari/602.1"  },   "json": null,   "origin": "60.207.237.85"."url": "http://httpbin.org/post"}</pre></body></html>Copy the code

As you can see, we successfully implemented the POST request and sent the form data.

wait()

This method can control the wait time of a page, using the following method:

1	ok, reason = splash:wait{time, cancel_on_redirect=false, cancel_on_error=true}

The following table describes the parameters.

Time: indicates the number of seconds to wait.
cancel_on_redirect: This parameter is optional. The default value isfalseTo stop waiting if a redirect occurs and return the redirect result.
cancel_on_error: This parameter is optional. The default value isfalse, to stop waiting if a loading error occurs.

The result is also a combination of ok and reason.

Let’s feel it with an example:

12345function main(splash)    splash:go("https://www.taobao.com")    splash:wait(2)    return {html=splash:html()}endCopy the code

This allows you to visit Taobao, wait 2 seconds, and then return to the page source code.

jsfunc()

This method can directly call a JavaScript defined method, but the called method needs to be surrounded by double brackets, which is equivalent to implementing the JavaScript method to Lua script conversion. The following is an example:

123456789101112function main(splash, args)  local get_div_count = splash:jsfunc([[  function () {    var body = document.body;    var divs = body.getElementsByTagName('div');    return divs.length;  }  ]])  splash:go("https://www.baidu.com")  return ("There are %s DIVs"):format(    get_div_count())endCopy the code

The running results are as follows:

1	There are 21 DIVs

First, we declare a JavaScript defined method, which is then called after the page loads successfully to count the number of div nodes in the page.

More conversion details about JavaScript to Lua scripts, you can refer to the official document: splash. Readthedocs. IO/en/stable/s… .

evaljs()

This method executes JavaScript code and returns the result of the last JavaScript statement, using the following method:

1	result = splash:evaljs(js)

For example, we could use the following code to get the page title:

1	local title = splash:evaljs(“document.title”)

runjs()

This method, which executes JavaScript code, has similar functionality to evaljs(), but prefers to perform certain actions or declare certain methods. Such as:

function main(splash, args)

splash:go(“https://www.baidu.com”)

splash:runjs(“foo = function() { return ‘bar’ }”)

local result = splash:evaljs(“foo()”)

return result

end

Here we invoke a JavaScript defined method with runjs(), and then call the result with evaljs().

The running results are as follows:

bar

autoload()

This method sets the objects to be automatically loaded on each page visit, using the following method:

1	ok, reason = splash:autoload{source_or_url, source=nil, url=nil}

The following table describes the parameters.

Source_or_url: JavaScript code or JavaScript library link.
Source: JavaScript code.
Url: JavaScript library link

But this method does nothing but load the JavaScript code or library. If you want to perform an operation, you can call the evaljs() or runjs() methods. The following is an example:

function main(splash, args)

splash:autoload([[

function get_document_title(){

return document.title;

}

]])

splash:go(“https://www.baidu.com”)

return splash:evaljs(“get_document_title()”)

end

Here we declare a JavaScript method by calling the Autoload () method, which is then executed by the evaljs() method.

The running results are as follows:

1	Google it and you’ll see

Alternatively, we can use the autoload() method toload some method libraries, such as jQuery, as shown in the following example:

function main(splash, args)

Assert (splash: autoload (” https://code.jquery.com/jquery-2.1.3.min.js “))

assert(splash:go(“https://www.taobao.com”))

local version = splash:evaljs(“$.fn.jquery”)

return ‘JQuery version: ‘ .. version

end

The running results are as follows:

1	JQuery version 2.1.3

call_later()

This method can delay the execution of the task by setting the scheduled task and the delay time, and can re-execute the scheduled task by using the cancel() method before execution. The following is an example:

function main(splash, args)

local snapshots = {}

local timer = splash:call_later(function()

snapshots[“a”] = splash:png()

Splash: wait (1.0)

snapshots[“b”] = splash:png()

End, 0.2)

splash:go(“https://www.taobao.com”)

Splash: wait (3.0)

return snapshots

end

Here we set a scheduled task, get the webpage screenshot at 0.2 seconds, then wait 1 second, get the webpage screenshot again at 1.2 seconds, visit the page is Taobao, and finally return the screenshot result. Figure 7-11 shows the result.

Figure 7-11 Running result

It can be found that when the first screenshot is taken, the page has not been loaded, and the screenshot is empty. The second time the page is loaded successfully.

http_get()

This method can simulate sending an HTTP GET request as follows:

1	response = splash:http_get{url, headers=nil, follow_redirects=true}

The following table describes the parameters.

Url: request URL.
Headers: This parameter is optional. The default value is empty.
follow_redirects: Optional, indicates whether to enable automatic redirection. The default value istrue.

The following is an example:

function main(splash, args)

local treat = require(“treat”)

local response = splash:http_get(“http://httpbin.org/get”)

return {

html=treat.as_string(response.body),

url=response.url,

status=response.status

}

end

The running results are as follows:

12345678910111213141516Splash Response: Objecthtml: String (length 355){  "args": {},   "headers": {    "Accept-Encoding": "gzip, deflate"."Accept-Language": "en,*"."Connection": "close"."Host": "httpbin.org"."User-Agent": "Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/602.1 (KHTML, like Gecko) Splash Version/9.0 Safari/602.1"  },   "origin": "60.207.237.85"."url": "http://httpbin.org/get"}status: 200url: "http://httpbin.org/get"Copy the code

http_post()

Like the http_get() method, this is used to simulate sending a POST request, but with the body argument, as follows:

1	response = splash:http_post{url, headers=nil, follow_redirects=true, body=nil}

The following table describes the parameters.

Url: request URL.
Headers: This parameter is optional. The default value is empty.
follow_redirects: Optional, indicates whether to enable automatic redirection. The default value istrue.
Body: The optional argument, form data, is null by default.

Let’s use an example to feel:

function main(splash, args)  local treat = require("treat")  local json = require("json")  local response = splash:http_post{"http://httpbin.org/post",           body=json.encode({name="Germey"}),      headers={["content-type"] ="application/json"}}return {    html=treat.as_string(response.body),    url=response.url,    status=response.status    }endCopy the code

The running results are as follows:

123456789101112131415161718192021222324Splash Response: Objecthtml: String (length 533){  "args": {},   "data": "{\"name\": \"Germey\"}"."files": {},   "form": {},   "headers": {    "Accept-Encoding": "gzip, deflate"."Accept-Language": "en,*"."Connection": "close"."Content-Length": "18"."Content-Type": "application/json"."Host": "httpbin.org"."User-Agent": "Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/602.1 (KHTML, like Gecko) Splash Version/9.0 Safari/602.1"  },   "json": {    "name": "Germey"  },   "origin": "60.207.237.85"."url": "http://httpbin.org/post"}status: 200url: "http://httpbin.org/post"Copy the code

As you can see, here we successfully simulated submitting the POST request and sending the form data.

set_content()

This method is used to set the content of a page as shown in the following example:

function main(splash)

assert(splash:set_content(“<html><body><h1>hello</h1></body></html>”))

return splash:png()

end

Figure 7-12 shows the result.

Figure 7-12 Running results

html()

This method is used to obtain the source code of the web page, it is very simple and common method. The following is an example:

function main(splash, args)

splash:go(“https://httpbin.org/get”)

return splash:html()

end

The running results are as follows:


1234567891011121314<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;"> {"args": {},   "headers": {    "Accept": "text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8"."Accept-Encoding": "gzip, deflate"."Accept-Language": "en,*"."Connection": "close"."Host": "httpbin.org"."User-Agent": "Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/602.1 (KHTML, like Gecko) Splash Version/9.0 Safari/602.1"  },   "origin": "60.207.237.85"."url": "https://httpbin.org/get"}</pre></body></html>Copy the code

png()

This method is used to get a screenshot of a WEBPAGE in PNG format as shown in the following example:

function main(splash, args)

splash:go(“https://www.taobao.com”)

return splash:png()

end

jpeg()

This method is used to get a screenshot of a web page in JPEG format as shown in the following example:

function main(splash, args)

splash:go(“https://www.taobao.com”)

return splash:jpeg()

end

har()

This method is used to get a page loading process description as shown in the following example:

function main(splash, args)

splash:go(“https://www.baidu.com”)

return splash:har()

end

Figure 7-13 shows the result, which shows the details of each request record during page loading.

Figure 7-13 Running result

url()

This method gets the URL currently being accessed as shown in the following example:

function main(splash, args)

splash:go(“https://www.baidu.com”)

return splash:url()

end

The running results are as follows:

1	https://www.baidu.com/

get_cookies()

This method can obtain Cookies for the current page as shown in the following example:

function main(splash, args)

splash:go(“https://www.baidu.com”)

return splash:get_cookies()

end

The running results are as follows:

Splash Response: Array[2]

0: Object

domain: “.baidu.com”

expires: “2085-08-21T20:13:23Z”

httpOnly: false

name: “BAIDUID”

path: “/”

secure: false

value: “C1263A470B02DEF45593B062451C9722:FG=1”

1: Object

domain: “.baidu.com”

expires: “2085-08-21T20:13:23Z”

httpOnly: false

name: “BIDUPSID”

path: “/”

secure: false

value: “C1263A470B02DEF45593B062451C9722”

add_cookie()

This method adds cookies to the current page as follows:

1	cookies = splash:add_cookie{name, value, path=nil, domain=nil, expires=nil, httpOnly=nil, secure=nil}

The parameters of the method represent the attributes of the Cookie.

The following is an example:

function main(splash)

splash:add_cookie{“sessionid”, “237465ghgfsd”, “/”, domain=”http://example.com”}

splash:go(“http://example.com/”)

return splash:html()

end

clear_cookies()

This method clears all Cookies as shown in the following example:

function main(splash)

splash:go(“https://www.baidu.com/”)

splash:clear_cookies()

return splash:get_cookies()

end

Here we clear all the Cookies and call get_cookies() to return the results.

The running results are as follows:

1	Splash Response: Array[0]

As you can see, the Cookies are all cleared without any results.

get_viewport_size()

This method can obtain the current browser page size, i.e. width and height, as shown in the following example:

function main(splash)

splash:go(“https://www.baidu.com/”)

return splash:get_viewport_size()

end

The running results are as follows:

Splash Response: Array[2]

Zero: 1024

1:768

set_viewport_size()

This method sets the current browser page size, i.e. width and height, as follows:

1	splash:set_viewport_size(width, height)

For example, visit a width-adaptive page here:

function main(splash)

splash:set_viewport_size(400, 700)

assert(splash:go(“http://cuiqingcai.com”))

return splash:png()

end

Figure 7-14 shows the result.

Figure 7-14 Running result

set_viewport_full()

This method can be used to set the browser to display in full screen, as shown in the following example:

function main(splash)

splash:set_viewport_full()

assert(splash:go(“http://cuiqingcai.com”))

return splash:png()

end

set_user_agent()

This method can be used to set the user-Agent of the browser as shown in the following example:

function main(splash)

splash:set_user_agent(‘Splash’)

splash:go(“http://httpbin.org/get”)

return splash:html()

end

Here, we set the user-Agent of the browser to Splash, and the running result is as follows:

“args”: {},

“headers”: {

“Accept”: “text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 “,

“Accept-Encoding”: “gzip, deflate”,

“Accept-Language”: “en,*”,

“Connection”: “close”,

“Host”: “httpbin.org”,

“User-Agent”: “Splash”

“origin”: “60.207.237.85”,

“url”: “http://httpbin.org/get”

}

</pre></body></html>

You can see that user-agent has been set successfully.

set_custom_headers()

This method sets the request header as shown in the following example:

function main(splash)

splash:set_custom_headers({

[“User-Agent”] = “Splash”,

[“Site”] = “Splash”,

})

splash:go(“http://httpbin.org/get”)

return splash:html()

end

Here we set the user-agent and Site properties in the request header and run the following:

“args”: {},

“headers”: {

“Accept”: “text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 “,

“Accept-Encoding”: “gzip, deflate”,

“Accept-Language”: “en,*”,

“Connection”: “close”,

“Host”: “httpbin.org”,

“Site”: “Splash”,

“User-Agent”: “Splash”

“origin”: “60.207.237.85”,

“url”: “http://httpbin.org/get”

}

</pre></body></html>

select()

This method selects the first node that meets the criteria, and if there are multiple nodes that meet the criteria, only one is returned with a CSS selector. The following is an example:

function main(splash)

splash:go(“https://www.baidu.com/”)

input = splash:select(“#kw”)

input:send_text(‘Splash’)

splash:wait(3)

return splash:png()

end

Here we first visit Baidu, then select the search box, then call send_text() to fill in the text, then return the screenshot of the web page.

The result is shown in Figure 7-15. As you can see, we successfully filled in the input field.

Figure 7-15 Running result

select_all()

This method selects all eligible nodes and takes a CSS selector as an argument. The following is an example:

function main(splash)

local treat = require(‘treat’)

assert(splash:go(“http://quotes.toscrape.com/”))

Assert (splash: wait (0.5))

local texts = splash:select_all(‘.quote .text’)

local results = {}

for index, text in ipairs(texts) do

results[index] = text.node.innerHTML

end

return treat.as_array(results)

end

Here we select the body of the node using the CSS selector, and then walk through all the nodes to retrieve the text.

The running results are as follows:

Splash Response: Array[10]

0: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking”

1: “It is our choices, Harry, that show what we truly are, far more than our abilities.”

2: One is as though nothing is a miracle. The other is as though everything is A miracle.”

3: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”

4: “Imperfection is beauty, madness is genius and it’s better to be absolutely ridiculous than absolutely boring.”

5: “Try not to become a man of success. Rather become a man of value.”

6: “It is better to be tutors for what you are than to be loved for what you are not.” ”

7: “I have not failed. I’ve just found 10,000 ways that won’t work.”

8: ““A woman is like a tea bag; you never know how strong it is until it’s in hot water.””

9: “A day without sunshine is like, you know, night.”

As you can see, we succeeded in getting the body content of 10 nodes down.

mouse_click()

This method simulates mouse clicks, passing in coordinate values x and y. Alternatively, you can simply select a node and call this method, as shown in the following example:

function main(splash)

splash:go(“https://www.baidu.com/”)

input = splash:select(“#kw”)

input:send_text(‘Splash’)

submit = splash:select(‘#su’)

submit:mouse_click()

splash:wait(3)

return splash:png()

end

Here, we first select the input box of the page and enter the text, then select the “Submit” button and call the mouse_click() method to submit the query. Then the page waits for three seconds and the screenshot is returned, as shown in Figure 7-16.

Figure 7-16 Running result

It can be seen that here we have successfully obtained the page content after the query, simulating baidu search operation.

Introduces the Splash in front of the commonly used API operations, there are also some API in this not introduced, more detailed and authoritative explanation can see the official document Splash. Readthedocs. IO/en/stable/s… This page describes all API operations on the Splash object. In addition, there is in view of the API operating elements on the page, link to splash. Readthedocs. IO/en/stable/s… .

7. Splash API call

The use of the Splash Lua scripts has been described above, but these scripts are being tested in a Splash page. How can I use Splash to render a page? How can I use it in conjunction with a Python program and grab a JavaScript rendered page?

In fact, Splash provides us with some HTTP API interfaces. We only need to request these interfaces and pass the corresponding parameters. These interfaces are briefly introduced below.

render.html

This interface is used to get JavaScript rendering of the page’s HTML code, the interface address is Splash run with the interface name, such as http://localhost:8050/render.html. You can use curl to test this:

1	curl http://localhost:8050/render.html? url=https://www.baidu.com

We pass a URL parameter to this interface to specify the url to render and return the result as the source code for the rendered page.

If implemented in Python, the code looks like this:

import requests

url = ‘http://localhost:8050/render.html? url=https://www.baidu.com’

response = requests.get(url)

print(response.text)

So you can successfully output Baidu page rendering after the source code.

In addition, this interface can specify other parameters, such as the number of seconds to wait through wait. If you want to ensure that the page loads completely, you can increase the wait time, for example:

import requests

url = ‘http://localhost:8050/render.html? url=https://www.taobao.com&wait=5’

response = requests.get(url)

print(response.text)

At this point, the response time will be correspondingly longer, for example, here will wait more than 5 seconds to obtain the source code of Taobao page.

In addition, the interface also supports proxy Settings, the picture loaded set, setting the Headers, request method set, specific usage can see splash. The official document readthedocs. IO/en/stable/a… .

render.png

This interface takes screenshots of web pages with several more parameters than render.html, such as width and height, and returns binary PNG image data. The following is an example:

1	curl http://localhost:8050/render.png? url=https://www.taobao.com&wait=5&width=1000&height=700

Here we pass width and height to set the page size to 1000×700 pixels.

If implemented in Python, you can save the returned binary data as a PNG image as follows:

import requests

url = ‘http://localhost:8050/render.png? url=https://www.jd.com&wait=5&width=1000&height=700’

response = requests.get(url)

with open(‘taobao.png’, ‘wb’) as f:

f.write(response.content)

The resulting image is shown in Figure 7-17.

Figure 7-17 Running result

So we can successfully get the jingdong first page rendering the page screenshots, after finishing the detailed parameter can reference website document splash. Readthedocs. IO/en/stable/a… .

render.jpeg

This interface is similar to render. PNG, except that it returns image binary data in JPEG format.

In addition, this interface has more quality than render. PNG, which is used to set image quality.

render.har

This interface is used to get HAR data for page loading as shown in the following example:

1	curl http://localhost:8050/render.har? url=https://www.jd.com&wait=5

The result it returns, as shown in Figure 7-18, is quite a bit of jSON-formatted data that contains HAR data during page loading.

Figure 7-18 Running result

render.json

This interface contains all the functionality of the previous interface and returns the result in JSON format, as shown in the following example:

1	curl http://localhost:8050/render.json? url=https://httpbin.org

The results are as follows:

1	{“title”: “httpbin(1): HTTP Client Testing Service”, “url”: “https://httpbin.org/”, “requestedUrl”: “https://httpbin.org/”, “geometry”: [0, 0, 1024, 768]}

As you can see, the corresponding request data is returned in JSON form.

We can control its return by passing in different parameters. For example, passing HTML =1 and returning the result will increment the source code data; If PNG =1 is passed in, the returned result will add page PNG screenshot data; Passing HAR =1 will get the page HAR data. Such as:

1	curl http://localhost:8050/render.json? url=https://httpbin.org&html=1&har=1

The RETURNED JSON result will contain the source code of the web page and HAR data.

And there’s more parameter Settings, specific can refer to the official document: splash. Readthedocs. IO/en/stable/a… .

execute

This interface is the most powerful interface. This interface is used to connect the Splash Lua script.

The render.html and render.png interfaces above are good enough for normal JavaScript rendering pages, but they’re not enough to do anything interactive, so the execute interface is needed.

Let’s start with the simplest script that returns data directly:

function main(splash)

return ‘hello’

end

This script is then converted to a URL-encoded string and concatenated to the end of the execute interface as shown in the following example:

1	curl http://localhost:8050/execute? lua_source=function+main%28splash%29%0D%0A++return+%27hello%27%0D%0Aend

The running results are as follows:

hello

Here, we pass the transcoding Lua script through the lua_source parameter, and obtain the execution result of the final script through the execute interface.

Here we are more concerned with the implementation of Python, the above example with Python implementation, the code is as follows:

import requests

from urllib.parse import quote

lua = ”’

function main(splash)

return ‘hello’

end

‘ ‘ ‘

url = ‘http://localhost:8050/execute? lua_source=’ + quote(lua)

response = requests.get(url)

print(response.text)

The running results are as follows:

hello

Here we enclose the Lua script with triple quotes in Python, and then transcode the script URL using the quote() method in urllib.parse. We then construct the Splash request URL and pass it as the lua_source parameter. This will show the results of Lua script execution.

Let’s take another example:

import requests

from urllib.parse import quote

lua = ”’

function main(splash, args)

local treat = require(“treat”)

local response = splash:http_get(“http://httpbin.org/get”)

return {

html=treat.as_string(response.body),

url=response.url,

status=response.status

}

end

‘ ‘ ‘

url = ‘http://localhost:8050/execute? lua_source=’ + quote(lua)

response = requests.get(url)

print(response.text)

The running results are as follows:

{“url”: “http://httpbin.org/get”, “status”: 200, “html”: “{\n \”args\”: {}, \n \”headers\”: {\n \”Accept-Encoding\”: \”gzip, deflate\”, \n \”Accept-Language\”: \”en,*\”, \n \”Connection\”: \”close\”, \n \”Host\”: \ “httpbin.org \”, \ n \ “the user-agent \” : \ “Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/602.1 (KHTML, like Gecko) Splash Version/9.0 Safari/602.1\”\n}, \n \” Origin \”: 60.207.237.85 \ “\”, \ n \ “url \” : \ “http://httpbin.org/get\” \ n \ n}}”

As you can see, the return result is JSON, and we successfully obtained the requested URL, status code, and page source code.

This way, all Lua scripts we talked about can be interlinked with Python in this way, and all dynamic rendering of web pages, simulated clicks, form submission, page sliding, and delayed waiting results can be controlled freely, as well as obtaining source code and screenshots of the page.

At this point, we can use Python and Splash to crawl pages rendered in JavaScript. In addition to Selenium, Splash, described in this section, can also do very powerful rendering, and it also requires no browser to render, making it very easy to use.

This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)

[Python3 network crawler development actual practice] 7- dynamic rendering page crawl -2-Splash use

1. Function Introduction

2. Preparation

3. Instance introduction

4. Splash the Lua script

Entry and return values

Asynchronous processing

5. Splash Object properties

args

resource_timeout

images_enabled

plugins_enabled

scroll_position

6. Method of the Splash object

go()

wait()

jsfunc()

evaljs()

runjs()

autoload()

call_later()

http_get()

http_post()

set_content()

html()

png()

jpeg()

har()

url()

get_cookies()

add_cookie()

clear_cookies()

get_viewport_size()

set_viewport_size()

set_viewport_full()

set_user_agent()

set_custom_headers()

select()

select_all()

mouse_click()

7. Splash API call

render.html

render.png

render.jpeg

render.har

render.json

execute

Related Posts

MySQL Principles – InnoDB engine – Row record storage – Compact row format

SpringBoot group verification and custom verification annotations

Streamstreams for JDK8 new features