Puppeteer +mysql - New method for crawlers! Grab news & comments so easy!

Puppeteer is Google Chrome’s official Headless Chrome tool. As a result of this official announcement, many of the industry’s automated test libraries have stopped maintenance, including PhantomJS. The Selenium IDE for Firefox project was also terminated due to a lack of maintainers.

Summary

This article will use Chrome Headless,Puppeteer,Node and Mysql to crawl Sina Weibo. Log in and climb the news on the homepage of People’s Daily and save it in the Mysql database.

The installation

There is a chance that Puppeteer installation will fail due to failure to download the Chromium driver package. The Puppeteer installation solution was introduced in the previous article, so there is no more to cover in this article. Puppetter is installed on a pit – Solution

Get started

We’ll start with a screenshot of the apis that enable Puppeteer to launch the browser and get things done.

screenshot.js

const puppeteer = require('puppeteer'); (async () => { const pathToExtension = require('path').join(__dirname, '.. /chrome-mac/Chromium.app/Contents/MacOS/Chromium'); const browser = await puppeteer.launch({ headless: false, executablePath: pathToExtension }); const page = await browser.newPage(); await page.setViewport({width: 1000, height: 500}); await page.goto('https://weibo.com/rmrb'); await page.waitForNavigation(); await page.screenshot({path: 'rmrb.png'}); await browser.close(); }) ();Copy the code

Puppeteer. launch (this will pass when puppeteer is connected to a Chromium instancepuppeteer.launch 或 puppeteer.connectCreate a Browser object.
- ExecutablePath: The path to start Chromium or Chrome.
- Headless: Whether to start the browser as headless. Headless Chrome is Google Chrome running in Headless mode. For automated testing and servers that do not require a visual user interface)
Browser. newPage opens a newPage and returns a Promise. Most methods in the Puppeteer API returnPromiseObject, we needasync+awaitUse together.

Run the code

$ node screenshot.js
Copy the code

The screenshots are saved to the root directory

Analyze page structure and extract news

Our goal is to get the characters and dates of People’s Daily’s weibo posts.

Dom: div[action-type=feed_list_item]
Dom: div[action-type=feed_list_item]>.wb_detail >.wb_text
Dom: div[action-type=feed_list_item]>.wb_detail >.wb_from a”).eq(0).attr(“date”)

Puppeteer provides a method for extracting Page elements: Page. Evaluate. Because it operates within the context in which the browser is running. Once the Page is loaded, the Page. Evaluate method can be used to analyze dom nodes

page.evaluate(pageFunction, … args)

pageFunction> < | [function] [string] to be executed in page instance context method
. args<… [Serializable] | > JSHandle will pass pageFunction parameters
Returns: < [Promise] < [Serializable] > >pageFunctionResult of execution

If pageFunction returns [Promise], Page. Evaluate waits for the Promise to complete and returns its return value.

If pageFunction returns a value that cannot be serialized, undefined is returned

The code for analyzing weibo page information is as follows:

const LIST_SELECTOR = 'div[action-type=feed_list_item]'
return await page.evaluate((infoDiv)=> {
    return Array.prototype.slice.apply(document.querySelectorAll(infoDiv))
        .map($userListItem => {
            var weiboDiv = $($userListItem)
            var webUrl = 'http://weibo.com'
            var weiboInfo = {
                "tbinfo": weiboDiv.attr("tbinfo"),
                "mid": weiboDiv.attr("mid"),
                "isforward": weiboDiv.attr("isforward"),
                "minfo": weiboDiv.attr("minfo"),
                "omid": weiboDiv.attr("omid"),
                "text": weiboDiv.find(".WB_detail>.WB_text").text().trim(),
                'link': webUrl.concat(weiboDiv.find(".WB_detail>.WB_from a").eq(0).attr("href")),
                "sendAt": weiboDiv.find(".WB_detail>.WB_from a").eq(0).attr("date")
            };

            if (weiboInfo.isforward) {
                var forward = weiboDiv.find("div[node-type=feed_list_forwardContent]");
                if (forward.length > 0) {
                    var forwardUser = forward.find("a[node-type=feed_list_originNick]");
                    var userCard = forwardUser.attr("usercard");
                    weiboInfo.forward = {
                        name: forwardUser.attr("nick-name"),
                        id: userCard ? userCard.split("=")[1] : "error",
                        text: forward.find(".WB_text").text().trim(),
                        "sendAt": weiboDiv.find(".WB_detail>.WB_from a").eq(0).attr("date")
                    };
                }
            }
            return weiboInfo
        })
}, LIST_SELECTOR)

Copy the code

We pass the news block LIST_SELECTOR as an argument to Page. Evaluate. We can use the document method to manipulate the DOM node above and below the page instance of the pageFunction function. Traverse the news block div, analyze the DOM structure, get the corresponding information.

By extension ~

Because I don’t feel comfortable using native JS methods to manipulate DOM nodes. 2333), so I decided to support jQuery in my development environment.

Methods a

page.addScriptTag(options)

Inject a script tag specifying either SRC (URL) or content to the current page.

options <[Object]>
- Url <[string]> SRC of the script to be added

Path <[string]> Path to the JS file to inject frame. If path is a relative path, resolve relative to the current path. -content <[string]> Js code to be injected into the page (that is,) type <[string]> Script type. If you want to inject ES6 modules, the value is ‘module’. Click script for details. – Returns: <[Promise]<[ElementHandle]>> The Promise object, which is the injected tag tag. When a script’s onload fires or code is injected into a frame.

So we just add it to the code:

Await page. AddScriptTag ({url: "https://code.jquery.com/jquery-3.2.1.min.js"})Copy the code

Then you can fly happily!

Method 2

It’s even more convenient if you’re visiting a web page that already supports jQuery!

await page.evaluate(()=> {
    var $ = window.$
})
Copy the code

The name variable in pageFunction is simply assigned with the $in window.

End of extension ~

Note: pageFunctin contains page instances. If you use methods such as Document or jquery elsewhere in the program, you will be prompted for a Document environment or an error will be reported

(node:3346) UnhandledPromiseRejectionWarning: ReferenceError: document is not defined
Copy the code

Extract comments for each story

It’s not enough just to grab the news, we also need the hot comments on each story.

We found that after clicking the comment button in the action bar, the comments of the news will be loaded.

After analyzing the news block DOM element, we simulate clicking the comment button

$('.WB_handle span[node-type=comment_btn_text]').each(async(i, v)=>{
    $(v).trigger('click')
})
Copy the code

Listen for page requests using the listener event: ‘response’

event: ‘response’

<[Response]>

Triggered when a corresponding [response] is received for a request on the page.

Figure: When we click the comment button, the browser sends a lot of requests, and our goal is to extract the comment request.

We need to use several methods in Class Response to listen for the browser’s Response, analyze it and extract the comments.

response.url() Contains the URL of the response.
response.text()
- returns: <Promise> Promise which resolves to a text representation of response body.

page.on('response', async(res)=> { const url = res.url() if (url.indexOf('small') > -1) { let text = await res.text() var mid = getQueryVariable(res.url(), 'mid'); Parse (text).data.html) var matchReg = /\ :.*? (? = )/gi; var matchRes = delHtml.match(matchReg) if (matchRes && matchRes.length) { let comment = [] matchRes.map((v)=> { comment.push({mid, content: Json.stringify (v.split(' : ')[1])})}) Pool.getConnection (function (err, connection) {save.comment({"connection": connection, "res": comment}, function () { console.log('insert success') }) }) } } })Copy the code

withres.url()Get the url of the response and check whether the string contains the small keyword.
The key: mid in the URL is used to distinguish which news the comment belongs to.
withres.text()Get the body of the response and remove the HTML tag from the DOM in body.data.
Extract the comments we need from plain text.

Save to Mysql

Use Mysql to store news and comments

$ npm i mysql -D mysql
Copy the code

Mysql is a node.js driven library. It is written in JavaScript and does not require compilation.

Create config.js, create a local database connection, and export the configuration.

config.js

var mysql = require('mysql'); Var IP = 'http://127.0.0.1:3000'; var host = 'localhost'; Var pool = mysql.createpool ({host:'127.0.0.1', user:'root', password:' XXXX ', database:'yuan_place', connectTimeout:30000 }); module.exports = { ip : ip, pool : pool, host : host, }Copy the code

Introduce config.js in crawler

page.on('response', async(res)=> { ... if (matchRes && matchRes.length) { let comment = [] matchRes.map((v)=> { comment.push({mid, content: Json.stringify (v.split(' : ')[1])})}) Pool.getConnection (function (err, connection) {save.comment({"connection": connection, "res": comment}, function () { console.log('insert success') }) }) } ... }) const content = await getWeibo(page) pool.getConnection(function (err, connection) { save.content({"connection": connection, "res": content}, function () { console.log('insert success') }) })Copy the code

Then we write a save.js that deals specifically with data insertion logic.

The structure of the two tables is as follows:

Now we can happily start loading data into the database.

save.js

exports.content = function(list,callback){ console.log('save news') var connection = list.connection async.forEach(list.res,function(item,cb){ debug('save news',JSON.stringify(item)); var data = [item.tbinfo,item.mid,item.isforward,item.minfo,item.omid,item.text,new Date(parseInt(item.sendAt)),item.cid,item.clink] if(item.forward){ var fo = item.forward data = data.concat([fo.name,fo.id,fo.text,new Date(parseInt(fo.sendAt))]) }else{ data = data.concat(['','','',new Date()]) } connection.query('select * from sina_content where mid = ?',[item.mid],function (err,res) { if(err){ console.log(err) } if(res && res.length){ //console.log('has news') cb(); }else{ connection.query('insert into sina_content(tbinfo,mid,isforward,minfo,omid,text,sendAt,cid,clink,fname,fid,ftext,fsendAt) values(? ,? ,? ,? ,? ,? ,? ,? ,? ,? ,? ,? ,?) ',data,function(err,result){ if(err){ console.log('kNewscom',err) } cb(); }) } }) },callback); } ment = function(list,callback){console.log('save comment') var connection = list.connection async.forEach(list.res,function(item,cb){ debug('save comment',JSON.stringify(item)); var data = [item.mid,item.content] connection.query('select * from sina_comment where mid = ? ',[item.mid],function (err,res) { if(res &&res.length){ cb(); }else{ connection.query('insert into sina_comment(mid,content) values(? ,?) ',data,function(err,result){ if(err){ console.log(item.mid,item.content,item) console.log('comment',err) } cb(); }); } }) },callback); }Copy the code

Run the program, and you’ll find that the data is already in the library.

Useless and cumbersome progression to the project: mock login

Here without login, you can already enjoy the news and comments. But! The pursuit of progress how can we stop here. Make login widgets that aren’t useful to the project! Bring it in if you need it, leave it as it is if you don’t.

Add a creds.js file to the project root.

module.exports = {
  username: '<GITHUB_USERNAME>',
  password: '<GITHUB_PASSWORD>'
};
Copy the code

usepage.clickSimulate page click

page.click(selector[, options])

selector A selector to search for element to click. If there are multiple elements satisfying the selector, the first will be clicked.
options
- button left, right, or middle, defaults to left.
- clickCount defaults to 1. See UIEvent.detail.
- delay Time to wait between mousedown and mouseup in milliseconds. Defaults to 0.
- returns: Promise which resolves when the element matching selector is successfully clicked. The Promise will be rejected if there is no element matching selector.
Because page. Click returns a Promise, suspend with await.
```
await page.click('.gn_login_list li a[node-type="loginBtn"]');
Copy the code
```
1. Await page.waitfor (2000) waits 2s for the input box to appear.
Use 3.page.typeInput user name, password (here we in order to simulate the user input speed, plus{delay:30}Parameter, can be modified according to the actual situation), and then click the login button to usepage.waitForNavigation()Wait until the page is successfully logged in to.
```
await page.type('input[name=username]',CREDS.username,{delay:30});
await page.type('input[name=password]',CREDS.password,{delay:30});
await page.click('.item_btn a');
await page.waitForNavigation();
Copy the code
```
Because the test account I used is not bound to my mobile phone number, the login can be completed by using the above method. If the user is bound to the user’s mobile phone number, the user needs to use the client to scan for secondary authentication.

The last

Crawler renderings:

The crawler demo is here: github.com/wallaceyuan…

Reference Documents:

Github.com/GoogleChrom… Github.com/GoogleChrom…

Feel fun to pay attention to ~ welcome everyone to collect comments ~~~

Recruitment is stuck

Bytedance is hiring!

Position Description: Front-end Development (senior) ToB Direction — Video Cloud (Base: Shanghai, Beijing)

1. Responsible for the productization of multimedia services such as voD/live broadcast/real-time communication and the construction of business cloud platform;

2. Responsible for the construction and system development of multimedia quality system, operation and maintenance system;

3, good at abstract design, engineering thinking, focus on interaction, to create the ultimate user experience.

Job requirements

1. Major in Computer, communication and electronic information science is preferred;

2, familiar with all kinds of front end technology, including HTML/CSS/JavaScript/Node. Js, etc.

3. In-depth knowledge of JavaScript language and use of React or vue.js and other mainstream development frameworks;

4. Familiar with Node.js, Express/KOA and other frameworks, experience in developing large server programs is preferred;

5. Have some understanding of user experience, interactive operation and user demand analysis, experience in product or interface design is preferred;

6, own technical products, open source works or active open source community contributors are preferred.

Position window

Relying on the audio and video technology accumulation and basic resources of douyin, Watermelon video and other products, the video Cloud team provides customers with the ultimate one-stop audio and video multimedia services, including audio and video on demand, live broadcasting, real-time communication, picture processing, etc. Internally as the video technology center, service internal business; External to create productized audio and video multimedia service solutions, serving enterprise users.

The team has standardized project iteration process and perfect project role configuration; Strong technical atmosphere, embrace open source community, regular sharing, so that everyone can grow with the rapid business, with technology to change the world!

The delivery way

You can send your resume directly to: [email protected]

You can also scan the push two-dimensional code online delivery, looking forward to your participation! ~

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Puppeteer +mysql – New method for crawlers! Grab news & comments so easy!

Summary

The installation

Get started

Analyze page structure and extract news

By extension ~

Methods a

Method 2

End of extension ~

Extract comments for each story

Save to Mysql

Useless and cumbersome progression to the project: mock login

The last

Feel fun to pay attention to ~ welcome everyone to collect comments ~~~

Recruitment is stuck

Bytedance is hiring!

Position Description: Front-end Development (senior) ToB Direction — Video Cloud (Base: Shanghai, Beijing)

Job requirements

Position window

The delivery way

Puppeteer +mysql – New method for crawlers! Grab news & comments so easy!

Summary

The installation

Get started

Analyze page structure and extract news

By extension ~

Methods a

Method 2

End of extension ~

Extract comments for each story

Save to Mysql

Useless and cumbersome progression to the project: mock login

The last

Feel fun to pay attention to ~ welcome everyone to collect comments ~~~

Recruitment is stuck

Bytedance is hiring!

Position Description: Front-end Development (senior) ToB Direction — Video Cloud (Base: Shanghai, Beijing)

Job requirements

Position window

The delivery way

Related Posts

Five suggestions for building large Redux applications

Data type judgment in Javascript

Functional programming