Java Web crawler (1)

This article continues the use of the crawler framework NetDiscovery: how to make repeated requests, how to drive crawlers with the crawler container engine

1) Examples of repeated requests

Since the second half of 2017, new terms such as bitcoin, digital currency, virtual currency, blockchain technology and decentralization have appeared frequently in the media, making it hard for people not to see them.

(This article does not examine whether digital currency is the future.)

Risk tip: speculation has risks, the market needs to be cautious

I found an aggregative information platform for digital currencies, which periodically updates the price of each currency. If you want to write a program that captures data, you want it to be able to capture data repeatedly.

Objective: To obtain the price information of digital currency [EOS] on this page at regular intervals

2) Code implementation

  • Parsing class
package com.sinkinka.parser;

import com.cv4j.netdiscovery.core.domain.Page;
import com.cv4j.netdiscovery.core.parser.Parser;

public class EosParser implements Parser {

   @Override
   public void process(Page page) {

       String xpathStr = "//div[@class='coinprice']/text()";
       String marketPrice = page.getHtml().xpath(xpathStr).get();
       System.out.println("marketPrice="+ marketPrice); }}Copy the code
  • Execution method
package com.sinkinka;

import com.cv4j.netdiscovery.core.Spider;
import com.cv4j.netdiscovery.extra.downloader.httpclient.HttpClientDownloader;
import com.sinkinka.parser.EosParser;

public class EosSpider {

    public static void main(String[] args) {

        String eosUrl = "https://www.feixiaohao.com/currencies/eos/";

        long periodTime = 1000 * 600;

        Spider.create()
                .name("EOS"RepeatRequest (periodTime, eosUrl) // Please note that the following line should be set to periodTime >=periodTime to understand the specific function. InitialDelay (periodTime).parser(new EosParser()).downloader(new HttpClientDownloader()).run(); }}Copy the code
  • The execution result

3) Crawler container engine

There are hundreds or thousands of digital currencies, and each currency information is on a separate page. What if I want to obtain information about multiple digital currencies at the same time?

Depending on the framework, there is an implementation method: define a crawler program for each digital currency, and then put the crawler program in a container, and give it to the crawler engine to drive.

Code examples:

package com.sinkinka; import com.cv4j.netdiscovery.core.Spider; import com.cv4j.netdiscovery.core.SpiderEngine; import com.cv4j.netdiscovery.extra.downloader.httpclient.HttpClientDownloader; import com.sinkinka.parser.EosParser; Public class TestSpiderEngine {public static void main(String[] args) {// SpiderEngine = SpiderEngine.create(); // String eosUrl ="https://www.feixiaohao.com/currencies/eos/";
        long periodTime1 = 1000 * 5;
        Spider spider1 = Spider.create()
                .name("EOS") .repeatRequest(periodTime1, eosUrl) .parser(new EosParser()) .downloader(new HttpClientDownloader()) .initialDelay(periodTime1); engine.addSpider(spider1); // engine. AddSpider (spider2); // engine. / /... engine.httpd(8088); // This line is used to retrieve the state engine. RunWithRepeat (); }}Copy the code

Interface to access container state:

The interface address: http://127.0.0.1:8088/netdiscovery/spiders

What is returned:

{
    "code": 200,"data":[
        {
            "downloaderType":"HttpClientDownloader"// Which download is used"leftRequestSize":0, // Number of requests left in the queue"queueType":"DefaultQueue"// Queue types: JDK (DefaultQueue), redis, kafka"spiderName":"EOS"// The crawler's name is unique in the engine"spiderStatus":1, //1: running 2: pause 4: stop"totalRequestSize":1 // The total number of requests added to the queue, minus the leftRequestSize above, is equal to the number of repeated requests the crawler has completed}],"message":"success"
}
Copy the code

4) summary

This article briefly introduces the function of making repeated requests in NetDiscovery. This is the value of the framework, if you do not use the framework, to implement their own, you need to write more code. The crawler engine has a lot more features, so stay tuned.


Today is the Western Valentine’s Day, I wish the world a happy valentine’s day!

I wish you all good health, family harmony and smooth work!

Java Web crawler (3)