Writing in the front

Thank you so much for your likes and attention. Actually, this is the first time I’m writing on the Nuggets. Because I just started to understand and learn crawler by accident some time ago, and I haven’t learned Node for a long time. Although I have done some back-end projects with Node, I am still a newcomer in node and crawler. This article mainly wants to share the basic knowledge of Node and crawler with you. I hope it will be helpful to you, and I also want to communicate with you and learn together.

By the way, I opened a personal **** home page, which has its own technical articles, but also will have personal thoughts, thinking and log. In the future, all articles will be updated here and then synchronized to other platforms. Have the friend that likes to be able to have no matter to stroll, thank everybody support again!

What is a reptile

A web crawler (also known as a web spider, a web bot, and more commonly as a web chaser in the FOAF community) is a program or script that automatically crawls information from the World Wide Web according to certain rules. Other less commonly used names include ant, autoindex, simulator or worm. WIKIPEDIA crawler introduction

Ii. Classification of reptiles

  • Universal Web crawler (whole web crawler)

Crawling objects expand from a few seed urls to the entire Web, gathering data primarily for portal site search engines and large Web service providers.

  • Focus web crawler (Theme Web crawler)

A web crawler that selectively crawls pages related to predefined topics.

  • Incremental web crawler

Refers to the downloaded web page to take incremental update and only crawling new or has changed the web page crawler, it can to a certain extent to ensure that the crawling page is as new as possible.

  • Deep Web crawler

Crawlers are deep web information crawlers that can only be accessed by users after keyword search or login.

Three, crawler’s crawling strategy

  • Universal Web crawler (whole web crawler)

Depth first strategy, breadth first strategy

  • Focus web crawler (Theme Web crawler)

Crawling strategies based on content evaluation (content relevance), crawling strategies based on link structure evaluation, crawling strategies based on enhanced learning (link importance), crawling strategies based on context graph (distance, weight of edge between two nodes in graph theory)

  • Incremental web crawler

Unified update method, individual update method, update method based on classification, adaptive frequency modulation update method

  • Deep Web crawler

The most important part of Deep Web crawler’s crawling process is form filling, which includes two types: form filling based on domain knowledge and form filling based on webpage structure analysis

The behavior of modern web crawlers is usually the result of a combination of four strategies:

Selection strategy: Decide which pages to download; Re-access policy: Deciding when to check for updated page changes Balanced politeness strategy: figure out how to avoid site overload; Parallel strategy: it points out how to cooperate to achieve the effect of distributed grasping;

Four, write a simple web crawler process

  1. Determine what to crawl (website/page)
  2. Analyze page content (target data /DOM structure)
  3. Identify development languages, frameworks, tools, etc
  4. Code tests, crawl data
  5. To optimize the

A simple Baidu news crawler

Determine what to crawl (website/page)

By News.baidu.com

Analyze page content (target data /DOM structure)

· · · · · ·

Identify development languages, frameworks, tools, etc

node.js (express) + SublimeText 3

Code, test, crawl data

Coding…

Let’s start

New Project Directory

1. Create the project directory baiduNews under the appropriate disk directory (my project directory is: F:\web\baiduNews)

Note: Because I’m writing this on a really crappy computer. Installing WebStorm or VsCode to run the project is somewhat difficult. Therefore, the following command line operations are performed in the DOS command line Window of Windows.

Initialization package. Json

1. Access the project root directory baiduNews 2 in the DOS command line interface. Execute NPM init to initialize the package.json file

Install dependencies

Express (Use Express to build a simple Http server. Of course, you can also use the HTTP module that comes with Node. Superagent is a convenient, lightweight, and progressive proxy module for third-party client requests in Node. Cheerio (Cheerio is the node version of jQuery, which is very easy to use if you have used jQuery. It is mainly used to fetch page elements and data information in them.)

// I prefer to use YARN to install dependencies. You can also use NPM install to install dependencies. yarn add express yarn add superagent yarn add cheerioCopy the code

After the dependencies are installed, you can check whether the dependencies you just installed were successful in package.json. After correct installation, the picture is as follows:

Start coding

A, use,expressStart a simple local Http server

1. Create the index.js file in the root directory of the project.

After creating index.js, we first instantiate an Express object and use it to start an Http service that listens locally on port 3000.

const express = require('express');
const app = express();

// ...

let server = app.listen(3000, function () {
  let host = server.address().address;
  let port = server.address().port;
  console.log('Your App is running at http://%s:%s', host, port);
});
Copy the code

Yes, it’s that simple, less than 10 lines of code, to set up a simple local Http service.

3, in accordance with international convention, we expect this service to give us a foul Hello World when accessing the native address http://localhost:3000. Add the following code to index.js:

app.get('/', function (req, res) { res.send('Hello World! '); });Copy the code

At this point, execute Node index.js under the project root directory baiduNews in DOS to make the project run. After that, open your browser, go to http://localhost:3000, and you’ll see ‘Hellow World! With the words’. In this way, after we get the information on the front page of Baidu News, we can see it when we visit http://localhost:3000.

Second, grab the news information of baidu news home page

1. First of all, let’s analyze the page information of baidu news home page.

The homepage of Baidu News is generally divided into “hot news”, “local news”, “domestic news” and “international News”…… And so on. This time we first try to capture the news data in the left “hot news” and the bottom “local news”.

F12 Open the Chrome console and examine the page elements. After checking the structure of the DOM where the “hot news” information is located on the left, we find that all the “hot news” information (including the news title and the news page link) is in the tag under the

    < Li > of the id # pane-News. Represented by jQuery’s selector: #pane-news ul Li A.

2. In order to crawl the news data, we first need to request the target page with superagent to obtain the information of the whole news home page

Const superagent= require('superagent'); let hotNews = []; // let localNews = []; /** * index.js * [description] - Use superagent.get() to access baidu news */ Superagent. get('http://news.baidu.com/').end((err, res) => {if (err) {// If access failed or error, Console. log(' Hot news fetching failed - ${err} ')} else {// Access successful, HotNews = getHotNews(res)}});Copy the code

3. After obtaining the page information, we define a function getHotNews() to capture the “hot news” data within the page.

Const cheerio = require('cheerio'); /** * index.js * [description] - let getHotNews = (res) => { let hotNews = []; // On success, the data returned from the request http://news.baidu.com/ page is included in res.text. /* Use the Cherrio.load () method of the Cheerio module, $(selectior) = cheerio.load(res.text); $(selectior) = cheerio.load(res.text); // Find the page element where the target data is located, $('div#pane-news ul li a').each((idx, ele) => {// cherrio $('selector').each() is used to iterate over all matched DOM elements // idx is the index of the element currently iterated, Let news = {title: $(ele).text(), // get the news title href: $(ele).attr('href') // Get the news page link}; Hotnews.push (news) // Store the final result array}); return hotNews };Copy the code

Here are a few more points:

  1. async/awaitIt is said to be the ultimate solution to asynchronous programming, allowing us to think synchronously about asynchronous programming.PromiseAsync /await solves the “callback hell” of asynchronous programming and makes asynchronous flow control friendly and clear.
  2. superagentModules provide many examplesget,post,delteAnd other methods can be very convenient for Ajax request operations. Execute at the end of the request.end()Callback function..end()Accepts a function as an argument that takes two argumentsThe error and res. When the request fails,errorWill contain the error message returned, request successful,errorA value ofnull, the returned data will be included in theresIn the parameters.
  3. cheerioThe module.load()Methods,HTML documentPassed as a parameter to the function, you can later use $(selectior) like jQuery to get page elements. Can also be used similar tojQueryIn the.each()Let’s iterate over the elements. In addition, there are many methods, you can Google/Baidu.

4. The captured data is returned to the front-end browser

Const app = express(); Instantiate an Express object app. App.get (”, async() => {}) takes two arguments. The first argument takes a routing path of type String, representing the Ajax request path. The second argument takes a Function in which the code is executed when the path is requested.

/** * [description] - route */ / when a get request http://localhost:3000, Get ('/', async (req, res, next) => {res.send(hotNews); });Copy the code

Make the project run by executing Node index.js in the project root directory baiduNews in DOS. After that, open your browser, visit http://localhost:3000, and you’ll see the captured data returned to the front page. Note: Since my Chrome has the JSONView extension installed, the returned data is automatically formatted into a structured JSON format for easy viewing.

OK!!!!! So, a simple Baidu"Hot News"The reptile is done!!

To summarize, the steps are simple:

  1. expressStart a simpleHttpservice
  2. Analysis target pageDOMStructure to find the correlation of the information to be capturedDOMThe element
  3. usesuperagentRequest target page
  4. usecheerioGet the page element, get the target data
  5. Returns the data to the front-end browser

Now, moving on to our goal of capturing “local news” data (we will encounter some interesting problems in the coding process), we naturally thought of using the same method for “local news” data. 1. Analyze the DOM structure of the “Local News” part of the page, as shown below:

F12 open the console and examine the DOM element of “Local News”. We find that “local News” is divided into two main parts, “news on the left” and “News and Information” on the right. All of this target data is in the div with id #local_news. The “left news” data is again in the A tag under the LI tag under the UL tag with the id # localnews-Focus, including the news title and page links. The “localnews” data is in the a tag under the li tag under the ul tag under the div with the id #localnews-zixun, and includes news headlines and page links.

2, OK! After analyzing the DOM structure and locating the data, we define a getLocalNews() function to crawl the data as we would crawl the “hot news”.

/** * [description] - let getLocalNews = (res) => {let localNews = []; let $ = cheerio.load(res); $('ul#localnews-focus li a'). Each ((idx, ele) => {let news = {title: $(ele).text(), href: $(ele).attr('href'), }; localNews.push(news) }); $('div#localnews-zixun ul li a'). Each ((index, item) => {let news = {title: $(item).text(), href: $(item).attr('href') }; localNews.push(news); }); return localNews };Copy the code

Accordingly, after requesting the page in superagent.get(), we need to call getLocalNews() to crawl the local news data. The superagent.get() function is changed to:

Superagent. get('http://news.baidu.com/').end((err, res) => {if (err) {// If access failed or error, Console. log(' Hot news fetching failed - ${err} ')} else {// Access successful, HotNews = getHotNews(res) localNews = getLocalNews(res)}});Copy the code

We also need to return data to the front-end browser in the app.get() route. App.get () route code changed to:

/** * [description] - route */ / when a get request http://localhost:3000, Get ('/', async (req, res, next) => {res.send({hotNews: hotNews, localNews: localNews}); });Copy the code

Coding complete, excited!! To get the project running in DOS, use a browser to visit http://localhost:3000

Embarrassing things happen!! Only hot news is returned, while local news returns an empty array []. Check the code and find that there is no problem, but why keep returning the empty array? After a reason search, just return where the problem is!!

An interesting question

Superagent.get (‘http://news.baidu.com/’).end((err, res) => {})

// define a global variable pageRes let pageRes = {}; // Superagent.get () stores res in pageRes superagent.get('http://news.baidu.com/').end((err, Res) => {if (err) {console.log(' Hot news fetching failed - ${err} ')} else {// Access successful, // hotNews = getHotNews(res) // localNews = getLocalNews(res) pageRes = res } }); Get ('/', async (req, res, next) => {res.send({// {}hotNews: hotNews, // localNews: localNews, pageRes: pageRes }); });Copy the code

Visit http://localhost:3000, and the following information is displayed:

As you can see, the text field in the return value should be the string format of the HTML code for the entire page. For our viewing purposes, we can return the value of the text field directly to the front-end browser so that we can see the browser-rendered page clearly.

Modifies the return value to the front-end browser

app.get('/', async (req, res, next) => {
  res.send(pageRes.text)
}
Copy the code

Visit http://localhost:3000, and the following information is displayed:

After reviewing the element, we found that the ORIGINAL DOM element where the target data is captured is empty, there is no data in it! Here, all will be clear! When we use superagent.get() to visit the homepage of Baidu News, the obtained page content contained in res has not generated the data of “local news” we want, and the DOM node element is empty, so the previous situation appears! The data returned after fetching is always an empty array [].

In the Network of the console we found page requests a such interfaces: http://localhost:3000/widget? Id =LocalNews&ajax= jSON&T =1526295667917, interface status 404. This should be baidu news to get “local news” interface, here everything is clear! “Local news” is obtained by dynamically requesting the above interface after the page is loaded, so when we request this interface with the page requested by superagent.get(), the hostname part in the interface URL becomes the local IP address, but there is no such interface on the local machine, so 404, data cannot be requested.

Find out why and let’s figure out how to fix it!!

  1. Direct use ofsuperagenT Access the correct and legitimate Baidu"Local News"To retrieve the data and return it to the front-end browser.
  2. Use of third PartiesnpmPackage, simulated browser to visit baidu news home page, in this simulated browser when"Local News"After the load is successful, the data is captured and returned to the front-end browser.

All right, so let’s try the second one, which is more interesting

Use the Nightmare automated testing tool

Electron allows you to create desktop applications using pure JavaScript calls to Chrome’s rich native interface. You can think of it as a variant of Node.js that focuses on desktop applications, rather than a Web server. Its browser-based application makes it very convenient to do all kinds of responsive interaction

Nightmare is a framework based on Electron for Web automated testing and crawlers, because it has the same automatic testing functions as PlantomJS. It can simulate user behavior on the page and trigger some asynchronous data loading. It can also directly access THE URL to grab data like the Request library. And you can set the latency of the page, so it’s a breeze to trigger scripts either manually or behaviourally.

Install dependencies

// Install NIGHTMARE YARN add NIGHTMARECopy the code

For “Local news”, continue coding…

Add the following code to index.js:

const Nightmare = require('nightmare'); Const Nightmare = Nightmare({show: true}); // show:true display built-in browser /** * [description] - fetching local news page * [nremark] - Baidu local news after the visit to the page load JS positioning IP location to obtain the corresponding news, * So to grab local news, you need to use an automated test tool like Nightmare, * to simulate the browser environment to access the page, make JS run, Nightmare.goto('http://news.baidu.com/').wait("div#local_news").evaluate(() => Document.queryselector ("div#local_news").innerhtml).then(htmlStr => {// getLocalNews data localNews = getLocalNews(htmlStr)}) .catch(error => {console.log(' Local news fetching failed - ${error} '); })Copy the code

Change the getLocalNews() function to:

/** * [description]- getLocalNews = (htmlStr) => {let localNews = []; let $ = cheerio.load(htmlStr); $('ul#localnews-focus li a'). Each ((idx, ele) => {let news = {title: $(ele).text(), href: $(ele).attr('href'), }; localNews.push(news) }); $('div#localnews-zixun ul li a'). Each ((index, item) => {let news = {title: $(item).text(), href: $(item).attr('href') }; localNews.push(news); }); return localNews }Copy the code

Change app.get(‘/’) route to:

/** * [description] - route */ / when a get request http://localhost:3000, Get ('/', async (req, res, next) => {res.send({hotNews: hotNews, localNews: localNews})});Copy the code

At this point, the DOS command line will restart the project and the browser will visit https://localhost:3000 to see the information displayed on the page and see if the “local News” data has been captured!

At this point, a simple and complete crawling Baidu news page “hot news” and “local news” crawler is done!!

Finally, the overall idea is as follows:

  1. expressStart a simpleHttpservice
  2. Analysis target pageDOMStructure to find the correlation of the information to be capturedThe DOM elementplain
  3. usesuperagentRequest target page
  4. Dynamic pages (need to run after loading the pageJSOr request the page of the interface)NightmareMock browser access
  5. usecheerioGet the page element, get the target data

The complete code

GitHub address: complete code

Later, it should also do some advanced steps to crawl some of the better looking images on the site (manual funny), will dragConcurrency controlandAnti - anti reptilianSome strategies. Use crawler to take to climb some need to log in and input verification code website, welcome everyone to pay attention to and correct communication.

I want to say

Thanks again for your likes, attention and comments. Thank you for your support. Thank you! I consider myself a semi-literary programmer who loves writing, music and coding. I had been thinking about writing technical and other literary articles. Although I do not have a very good foundation, I always feel that I can urge myself to think, learn and communicate in the process of writing articles, whether technical or literary. After all, every day busy, still have to live their own different life. So, if there are some good articles in the future I will actively share with you! Thanks again for your support!