NetDiscover is a crawler framework based on vert. x and RxJava2 implementation. I recently added two modules: the Selenium module and the DSL module.

Selenium module

The purpose of adding this module is to enable it to simulate human behavior to operate the browser and complete the purpose of crawler crawling.

Selenium is a tool for Web application testing. Selenium tests run directly in the browser, just as real users do. Supported browsers include Internet Explorer (7, 8, 9, 10, 11), Mozilla Firefox, Safari, Google Chrome, Opera, and more. The main functions of this tool include: testing browser compatibility – testing your application to see if it works well on different browsers and operating systems. Test system functionality – Create regression tests to verify software functionality and user requirements. Supports automatic recording of actions and automatic generation of test scripts in.net, Java, Perl and other languages.

Selenium includes a set of tools and apis: Selenium IDE, Selenium RC, Selenium WebDriver, and Selenium Grid.

Selenium WebDriver is a tool that supports browser automation. It includes a set of libraries for different languages and “drivers” that automate actions on the browser.

1.1 For Multiple Browsers

Thanks to Selenium WebDriver, the Selenium module works with multiple browsers. Chrome, Firefox, IE, and PhantomJS (an unbounded, scriptable WebKit browser engine) are currently supported in this module.

package com.cv4j.netdiscovery.selenium;

import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.ie.InternetExplorerDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriver;
import org.openqa.selenium.remote.CapabilityType;
import org.openqa.selenium.remote.DesiredCapabilities;

/** * Created by tony on 2018/1/28. */
public enum Browser implements WebDriverInitializer {

    CHROME {
        @Override
        public WebDriver init(String path) {
            System.setProperty("webdriver.chrome.driver", path);
            return new ChromeDriver();
        }
    },
    FIREFOX {
        @Override
        public WebDriver init(String path) {
            System.setProperty("webdriver.gecko.driver", path);
            return new FirefoxDriver();
        }
    },
    IE {
        @Override
        public WebDriver init(String path) {
            System.setProperty("webdriver.ie.driver", path);
            return new InternetExplorerDriver();
        }
    },
    PHANTOMJS {
        @Override
        public WebDriver init(String path) {

            DesiredCapabilities capabilities = new DesiredCapabilities();
            capabilities.setCapability("phantomjs.binary.path", path);
            capabilities.setCapability(CapabilityType.ACCEPT_SSL_CERTS, true);
            capabilities.setJavascriptEnabled(true);
            capabilities.setCapability("takesScreenshot".true);
            capabilities.setCapability("cssSelectorsEnabled".true);

            return newPhantomJSDriver(capabilities); }}}Copy the code

1.2 WebDriverPool

The reason for using WebDriverPool is that each time a WebDriver process is started, it costs resources, so create an object pool. I used Apache’s Commons Pool component for object pooling.

package com.cv4j.netdiscovery.selenium.pool;

import org.apache.commons.pool2.impl.GenericObjectPool;
import org.openqa.selenium.WebDriver;

/** * Created by tony on 2018/3/9. */
public class WebDriverPool {

    private static GenericObjectPool<WebDriver> webDriverPool = null;

    /** * If you want to use WebDriverPool, you must call the init() method **@param config
     */
    public static void init(WebDriverPoolConfig config) {

        webDriverPool = new GenericObjectPool<>(new WebDriverPooledFactory(config));
        webDriverPool.setMaxTotal(Integer.parseInt(System.getProperty(
                "webdriver.pool.max.total"."20"))); // The maximum number of objects can be placed
        webDriverPool.setMinIdle(Integer.parseInt(System.getProperty(
                "webdriver.pool.min.idle"."1")));   // At least a few idle objects
        webDriverPool.setMaxIdle(Integer.parseInt(System.getProperty(
                "webdriver.pool.max.idle"."20"))); // The maximum number of idle objects allowed

        try {
            webDriverPool.preparePool();
        } catch (Exception e) {
            throw newRuntimeException(e); }}public static WebDriver borrowOne(a) {

        if(webDriverPool! =null) {

            try {
                return webDriverPool.borrowObject();
            } catch (Exception e) {
                throw newRuntimeException(e); }}return null;
    }

    public static void returnOne(WebDriver driver) {

        if(webDriverPool! =null) { webDriverPool.returnObject(driver); }}public static void destory(a) {

        if(webDriverPool! =null) { webDriverPool.clear(); webDriverPool.close(); }}public static boolean hasWebDriverPool(a) {

        returnwebDriverPool! =null; }}Copy the code

1.3 SeleniumAction

Selenium simulates browser behavior, such as clicking, swiping, returning, and so on. There’s a SeleniumAction class abstracted here that represents the simulated event.

package com.cv4j.netdiscovery.selenium.action;

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;

/** * Created by tony on 2018/3/3. */
public abstract class SeleniumAction {

    public abstract SeleniumAction perform(WebDriver driver);

    public SeleniumAction doIt(WebDriver driver) {

        return perform(driver);
    }

    public static SeleniumAction clickOn(By by) {
        return new ClickOn(by);
    }

    public static SeleniumAction getUrl(String url) {
        return new GetURL(url);
    }

    public static SeleniumAction goBack(a) {
        return new GoBack();
    }

    public static SeleniumAction closeTabs(a) {
        return newCloseTab(); }}Copy the code

1.4 SeleniumDownloader

Downloader is a component of crawler framework. For example, vert.x webClient and OKHttp3 can be used to realize the function of network request. If Selenium is required, you must use SeleniumDownloader to complete the network request.

SeleniumDownloader class can add one or more SeleniumActions. If you have multiple seleniumactions you’re going to execute them sequentially.

Most importantly, the SeleniumDownloader class takes webDriver from the WebDriverPool and returns it to the connection pool each time it is used.

package com.cv4j.netdiscovery.selenium.downloader;

import com.cv4j.netdiscovery.core.config.Constant;
import com.cv4j.netdiscovery.core.domain.Request;
import com.cv4j.netdiscovery.core.domain.Response;
import com.cv4j.netdiscovery.core.downloader.Downloader;
import com.cv4j.netdiscovery.selenium.action.SeleniumAction;
import com.cv4j.netdiscovery.selenium.pool.WebDriverPool;
import com.safframework.tony.common.utils.Preconditions;
import io.reactivex.Maybe;
import io.reactivex.MaybeEmitter;
import io.reactivex.MaybeOnSubscribe;
import io.reactivex.functions.Function;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.WebDriver;

import java.util.LinkedList;
import java.util.List;

/** * Created by tony on 2018/1/28. */
public class SeleniumDownloader implements Downloader {

    private WebDriver webDriver;
    private List<SeleniumAction> actions = new LinkedList<>();

    public SeleniumDownloader(a) {

        this.webDriver = WebDriverPool.borrowOne(); // Get webDriver from connection pool
    }

    public SeleniumDownloader(SeleniumAction action) {

        this.webDriver = WebDriverPool.borrowOne(); // Get webDriver from connection pool
        this.actions.add(action);
    }

    public SeleniumDownloader(List<SeleniumAction> actions) {

        this.webDriver = WebDriverPool.borrowOne(); // Get webDriver from connection pool
        this.actions.addAll(actions);
    }

    @Override
    public Maybe<Response> download(Request request) {

        return Maybe.create(new MaybeOnSubscribe<String>(){

            @Override
            public void subscribe(MaybeEmitter emitter) throws Exception {

                if(webDriver! =null) {
                    webDriver.get(request.getUrl());

                    if (Preconditions.isNotBlank(actions)) {

                        actions.forEach(
                                action-> action.perform(webDriver)
                        );
                    }

                    emitter.onSuccess(webDriver.getPageSource());
                }
            }
        }).map(new Function<String, Response>() {

            @Override
            public Response apply(String html) throws Exception {

                Response response = new Response();
                response.setContent(html.getBytes());
                response.setStatusCode(Constant.OK_STATUS_CODE);
                response.setContentType(getContentType(webDriver));
                returnresponse; }}); }/ * * *@param webDriver
     * @return* /
    private String getContentType(final WebDriver webDriver) {

        if (webDriver instanceof JavascriptExecutor) {

            final JavascriptExecutor jsExecutor = (JavascriptExecutor) webDriver;
            // TODO document.contentType does not exist.
            final Object ret = jsExecutor
                    .executeScript("return document.contentType;");
            if(ret ! =null) {
                returnret.toString(); }}return "text/html";
    }


    @Override
    public void close(a) {

        if(webDriver! =null) {
            WebDriverPool.returnOne(webDriver); // Return webDriver to connection pool}}}Copy the code

1.5 Some useful tool classes

In addition, the Selenium module has a utility class. It contains browser operations such as scrollTo, scrollBy, clickElement, and so on.

There are also features to take screenshots of the current web page, or to capture an area.

    public static void taskScreenShot(WebDriver driver,String pathName){

        // OutputType.file is passed to the getScreenshotAs() method, which means that the captured screen is returned as a FILE.
        File srcFile = ((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE);
        // Use the copyFile() method of the IOUtils utility class to save the file object returned by getScreenshotAs().

        try {
            IOUtils.copyFile(srcFile, new File(pathName));
        } catch(IOException e) { e.printStackTrace(); }}public static void taskScreenShot(WebDriver driver,WebElement element,String pathName) {

        // OutputType.file is passed to the getScreenshotAs() method, which means that the captured screen is returned as a FILE.
        File srcFile = ((TakesScreenshot) driver).getScreenshotAs(OutputType.FILE);
        // Use the copyFile() method of the IOUtils utility class to save the file object returned by getScreenshotAs().

        try {
            // Get the position object of the element in the frame
            Point p = element.getLocation();
            // Get the width and height of the element
            int width = element.getSize().getWidth();
            int height = element.getSize().getHeight();
            // Rectangle image object
            Rectangle rect = new Rectangle(width, height);
            BufferedImage img = ImageIO.read(srcFile);
            BufferedImage dest = img.getSubimage(p.getX(), p.getY(), rect.width, rect.height);
            ImageIO.write(dest, "png", srcFile);
            IOUtils.copyFile(srcFile, new File(pathName));
        } catch(IOException e) { e.printStackTrace(); }}/** * Take a screenshot of an area *@param driver
     * @param x
     * @param y
     * @param width
     * @param height
     * @param pathName
     */
    public static void taskScreenShot(WebDriver driver,int x,int y,int width,int height,String pathName) {

        // OutputType.file is passed to the getScreenshotAs() method, which means that the captured screen is returned as a FILE.
        File srcFile = ((TakesScreenshot) driver).getScreenshotAs(OutputType.FILE);
        // Use the copyFile() method of the IOUtils utility class to save the file object returned by getScreenshotAs().

        try {
            // Rectangle image object
            Rectangle rect = new Rectangle(width, height);
            BufferedImage img = ImageIO.read(srcFile);
            BufferedImage dest = img.getSubimage(x, y, rect.width, rect.height);
            ImageIO.write(dest, "png", srcFile);
            IOUtils.copyFile(srcFile, new File(pathName));
        } catch(IOException e) { e.printStackTrace(); }}Copy the code

1.6 Using the Selenium module example

Search my new book “RxJava 2.x Combat” on JD.com and sort it by sales volume, then get the information of the top ten products.

1.6.1 Create multiple Actions and execute them in sequence.

Step 1: Open your browser and enter your keywords

package com.cv4j.netdiscovery.example.jd;

import com.cv4j.netdiscovery.selenium.Utils;
import com.cv4j.netdiscovery.selenium.action.SeleniumAction;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;

/** * Created by tony on 2018/6/12. */
public class BrowserAction extends SeleniumAction{

    @Override
    public SeleniumAction perform(WebDriver driver) {

        try {
            String searchText = "RxJava 2.x Practical";
            String searchInput = "//*[@id=\"keyword\"]";
            WebElement userInput = Utils.getWebElementByXpath(driver, searchInput);
            userInput.sendKeys(searchText);
            Thread.sleep(3000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        return null; }}Copy the code

Second, click the search button to search

package com.cv4j.netdiscovery.example.jd;

import com.cv4j.netdiscovery.selenium.Utils;
import com.cv4j.netdiscovery.selenium.action.SeleniumAction;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;

/** * Created by tony on 2018/6/12. */
public class SearchAction extends SeleniumAction {

    @Override
    public SeleniumAction perform(WebDriver driver) {

        try {
            String searchBtn = "/html/body/div[2]/form/input[4]";
            Utils.clickElement(driver, By.xpath(searchBtn));
            Thread.sleep(3000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        return null; }}Copy the code

The third step is to sort the search results by clicking on “sales”

package com.cv4j.netdiscovery.example.jd;

import com.cv4j.netdiscovery.selenium.Utils;
import com.cv4j.netdiscovery.selenium.action.SeleniumAction;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;

/** * Created by Tony on 2018/6/12. */
public class SortAction extends SeleniumAction{

    @Override
    public SeleniumAction perform(WebDriver driver) {

        try {
            String saleSortBtn = "//*[@id=\"J_filter\"]/div[1]/div[1]/a[2]";
            Utils.clickElement(driver, By.xpath(saleSortBtn));
            Thread.sleep(3000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        return null; }}Copy the code

1.6.2 Creating the Parsing class PriceParser

After executing the actions above, parse the returned HTML. Pass parsed product information to the following Pipeline.

package com.cv4j.netdiscovery.example.jd;

import com.cv4j.netdiscovery.core.domain.Page;
import com.cv4j.netdiscovery.core.parser.Parser;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

/** * Created by tony on 2018/6/12. */
public class PriceParser implements Parser{

    @Override
    public void process(Page page) {

        String pageHtml = page.getHtml().toString();
        Document document = Jsoup.parse(pageHtml);
        Elements elements = document.select("div[id=J_goodsList] li[class=gl-item]");
        page.getResultItems().put("goods_elements",elements); }}Copy the code

1.6.3 Creating The Pileline Class PricePipeline

Used to print information about the top ten best-selling items.

package com.cv4j.netdiscovery.example.jd;

import com.cv4j.netdiscovery.core.domain.ResultItems;
import com.cv4j.netdiscovery.core.pipeline.Pipeline;

import lombok.extern.slf4j.Slf4j;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/** * Created by tony on 2018/6/12. */
@Slf4j
public class PricePipeline implements Pipeline {

    @Override
    public void process(ResultItems resultItems) {

        Elements elements = resultItems.get("goods_elements");
        if(elements ! =null && elements.size() >= 10) {
            for (int i = 0; i < 10; i++) {
                Element element = elements.get(i);
                String storeName = element.select("div[class=p-shop] a").first().text();
                String goodsName = element.select("div[class=p-name p-name-type-2] a em").first().text();
                String goodsPrice = element.select("div[class=p-price] i").first().text();
                log.info(storeName + "" + goodsName + "  ¥"+ goodsPrice); }}}}Copy the code

1.6.4 complete JDSpider

In this case, multiple actions are executed in sequence. SeleniumDownloader is used.

package com.cv4j.netdiscovery.example.jd;

import com.cv4j.netdiscovery.core.Spider;
import com.cv4j.netdiscovery.selenium.Browser;
import com.cv4j.netdiscovery.selenium.action.SeleniumAction;
import com.cv4j.netdiscovery.selenium.downloader.SeleniumDownloader;
import com.cv4j.netdiscovery.selenium.pool.WebDriverPool;
import com.cv4j.netdiscovery.selenium.pool.WebDriverPoolConfig;

import java.util.ArrayList;
import java.util.List;

/** * Created by tony on 2018/6/12. */
public class JDSpider {

    public static void main(String[] args) {
        
        WebDriverPoolConfig config = new WebDriverPoolConfig("example/chromedriver",Browser.CHROME); // Set the browser driver and the browser type. The browser driver must match the operating system.
        WebDriverPool.init(config); To use WebDriverPool, init must be used first

        List<SeleniumAction> actions = new ArrayList<>();
        actions.add(new BrowserAction());
        actions.add(new SearchAction());
        actions.add(new SortAction());

        SeleniumDownloader seleniumDownloader = new SeleniumDownloader(actions);

        String url = "https://search.jd.com/";

        Spider.create()
                .name("jd")
                .url(url)
                .downloader(seleniumDownloader)
                .parser(new PriceParser())
                .pipeline(newPricePipeline()) .run(); }}Copy the code

DSL module

This module was written by Kotlin and uses its features to encapsulate DSLS.

package com.cv4j.netdiscovery.dsl

import com.cv4j.netdiscovery.core.Spider
import com.cv4j.netdiscovery.core.downloader.Downloader
import com.cv4j.netdiscovery.core.parser.Parser
import com.cv4j.netdiscovery.core.pipeline.Pipeline
import com.cv4j.netdiscovery.core.queue.Queue

/** * Created by tony on 2018/5/27. */
class SpiderWrapper {

    var name: String? = null

    var parser: Parser? = null

    var queue: Queue? = null

    var downloader: Downloader? = null

    var pipelines:Set<Pipeline>? = null

    var urls:List<String>? = null

}

fun spider(init: SpiderWrapper. () -> Unit):Spider {

    val wrap = SpiderWrapper()

    wrap.init()

    return configSpider(wrap)
}

private fun configSpider(wrap:SpiderWrapper):Spider {

    valspider = Spider.create(wrap? .queue) .name(wrap?.name)varurls = wrap? .urls urls? .let { spider.url(urls) } spider.downloader(wrap? .downloader) .parser(wrap? .parser) wrap? .pipelines? .let { it.forEach {// It means wrap? .pipelines

            spider.pipeline(it) < span style = "max-width: 100%; clear: both; min-height: 1em}}return spider
}
Copy the code

For example, use a DSL to create and run a crawler.

        val spider = spider {

            name = "tony"

            urls = listOf("http://www.163.com/"."https://www.baidu.com/")

            pipelines = setOf(ConsolePipeline())
        }

        spider.run()
Copy the code

It is equivalent to the Java code below

        Spider.create().name("tony1")
                .url("http://www.163.com/"."https://www.baidu.com/")
                .pipeline(new ConsolePipeline())
                .run();
Copy the code

DSLS can simplify code, improve development efficiency, and build models more abstractly. Then again, DSLS have their drawbacks, being limited in what they can express and not turing-complete.

conclusion

The crawler frame making address: https://github.com/fengzhizi715/NetDiscovery

Recently, it is not updated very often, because the company is busy with projects. However, I will try to ensure the quality of each update.

Later versions are mainly intended to combine cv4J, a real-time image processing framework, to better improve the crawler framework.


Java and Android technology stack: update and push original technical articles every week, welcome to scan the qr code of the public account below and pay attention to, looking forward to growing and progress with you together.