Antecedents to introduce

Some time ago developed a function to access a page through HttpClient, get the full HTML page content; After crawling through the entire page display in their own web pages; However, after some time, the page is updated, and the images in the page become dynamic load, directly through the HttpClient cannot obtain the complete page content, images are lazy load state cannot display the correct content.

To solve

Train of thought

Since there is no way to grab static pages directly, you can try to grab pages dynamically; How to crawl dynamic pages? Selenium is used in this article to crawl dynamic pages. Selenium simulates the normal opening of a browser and the process of browsing a web page. When the whole page is loaded at the bottom of the page, you can grab the whole page to obtain the complete page content. Selenium WebDriver emulates different browsers. This article uses Chrome. For other supported browsers, check out the WebDriver implementation classes.

coding

Reference the Jar package of the Selenium library

	<dependency>
		<groupId>org.seleniumhq.selenium</groupId>
		<artifactId>selenium-java</artifactId>
		<version>3.14159.</version>
	</dependency>
Copy the code

Because you want to emulate Chrome, you need to download Chrome Driver, and from there, the basics are done. Without further ado, let’s get right to the code.

public String selenium(String url) throws InterruptedException {
        // Set the location of the chromedirver
        System.getProperties().setProperty("webdriver.chrome.driver"."D:/lingy/chromedriver_win32/chromedriver.exe");
        // Set a headless browser so that no browser window pops up
        ChromeOptions chromeOptions = new ChromeOptions();
        chromeOptions.addArguments("--headless");
        Long scrollSize = 1000L;
        WebDriver webDriver = new ChromeDriver(chromeOptions);
        webDriver.get(url);
        // Set the width and height of the browser
        Dimension dimension = new Dimension(1000, scrollSize.intValue());
        webDriver.manage().window().setSize(dimension);
        String html = "";
        // Get JS executor, which can execute JS code to control the page
        JavascriptExecutor driver_js= ((JavascriptExecutor) webDriver);
        // Get the page height
        Long scrollHeight = (Long) driver_js.executeScript("return document.body.scrollHeight");
        logger.info("article hight is : {}",scrollHeight);
        // The height of each scroll is set to simulate the effect of the mouse scrolling through the page, and then calculate the number of shareholders by the height of the entire page
        Long loopNum = scrollHeight/scrollSize;
        loopNum = loopNum+1;
        logger.info("page need scroll times : {}",loopNum);
        for(int i =0 ; i < loopNum; i++){
            Long start = i*scrollSize;
            Long end = (i+1)*scrollSize;
            // Scroll the page one by one according to the number of times you scroll, holding for 1 second until the image loads completely
            driver_js.executeScript("scroll("+start+","+end+")");
            Thread.sleep(1000);
        }
        // Return to the entire page
        html = (String)driver_js.executeScript("return document.documentElement.outerHTML");
        webDriver.close();
        return html;
}
Copy the code

By now, we can obtain the content of the whole page completely. However, such original content may not be well displayed on our own page. The following are two problems encountered in the actual operation.

Solve the first problem encountered in the process

In the process of showing the capture of the content encountered pictures across the domain display or anti-theft chain problems. At the beginning of the solution is through Nginx reverse proxy way, to proxy image URL, but because the source page image anti-theft chain or other reasons can not be proxy; So I use a different way, by converting images into Base64 format to replace the contents of the original url to solve, this way can solve this problem, but leads to a new one after images into Base64 will become very big, the original web page in the process of storage and display request very consumption of storage space and time. In the end, I chose to read the image locally and process it through the Nginx agent.

		// Get the page image element
        List<WebElement> imgList = webDriver.findElements(By.tagName("img"));
// data:image/gif; base64,
        for(WebElement img : imgList){
            String imgSrc = img.getAttribute("src");
            logger.info("img's src is : {}",imgSrc);
            String pattern = "^((http)|(https)).*";
            booleanimgSrcPattern = ! StringUtils.isEmpty(imgSrc) && Pattern.matches(pattern, imgSrc);if(imgSrcPattern){
// String strNetImageToBase64 = changImgUrlToBase64(imgSrc);
// driver_js.executeScript("arguments[0].setAttribute(arguments[1],arguments[2])", img, "src", "data:image/png; base64,"+strNetImageToBase64);
                String imgUrlToImgFile = changImgUrlToImgFile(imgSrc);
				// Replace the page image with JS
				driver_js.executeScript("arguments[0].setAttribute(arguments[1],arguments[2])", img, "src", imgUrlToImgFile); }}Copy the code

ExecuteScript (“arguments[0].setattribute (arguments[1],arguments[2])”, img, “SRC “, imgUrlToImgFile);

Code to convert images to Base64

	private String changImgUrlToBase64(String imgSrc){
        String strNetImageToBase64 = "";
        try {
            URL imgURL = new URL(imgSrc);
            final HttpURLConnection conn = (HttpURLConnection) imgURL.openConnection();
            conn.setRequestMethod("GET");
            conn.setConnectTimeout(3000);
            InputStream is = conn.getInputStream();
            ByteArrayOutputStream data = new ByteArrayOutputStream();
            // Read the contents into memory
            final byte[] by = new byte[1024];
            int len = -1;
            while((len = is.read(by)) ! = -1) {
                data.write(by, 0, len);
            }
            // Base64 encodes byte arrays
            BASE64Encoder encoder = new BASE64Encoder();
            strNetImageToBase64 = encoder.encode(data.toByteArray());
            / / close the flow
            is.close();
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return strNetImageToBase64;
    }
Copy the code

Download the image to the local code

	private static String changImgUrlToImgFile(String imgSrc){
        String suffix = ".png";
        if(imgSrc.indexOf("gif") != -1){
            suffix = ".gif";
        }else if(imgSrc.indexOf("jpg") != -1){
            suffix = ".jpg";
        }else if(imgSrc.indexOf("jpeg") != -1){
            suffix = ".jpeg";
        }else if(imgSrc.indexOf("png") != -1){
            suffix = ".png";
        }
        String dir = "E:/lingy/asmImg/";
        String fileName = System.currentTimeMillis()+suffix;
        try {
            URL imgURL = new URL(imgSrc);
            final HttpURLConnection conn = (HttpURLConnection) imgURL.openConnection();
            conn.setRequestMethod("GET");
            conn.setConnectTimeout(3000);
            InputStream is = conn.getInputStream();
            File imgFile = new File(dir+fileName);
            if(! imgFile.exists()){ imgFile.createNewFile(); }// FileOutputStream fos = new FileOutputStream(imgFile);
// final byte[] by = new byte[1024];
// int len = -1;
// while ((len = is.read(by)) ! = 1) {
// fos.write(by, 0, len);
/ /}
// is.close();
// FileUtils.copyURLToFile(imgURL,imgFile);
            FileUtils.copyInputStreamToFile(is,imgFile);
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (ProtocolException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        String imagRootPath = "http://localhost/asmImg/";
        return imagRootPath + fileName;
    }
Copy the code

Solve the second problem encountered in the process

In the process of grasping the page, because of the need to simulate the effect of web scrolling browsing the whole grasping is very time-consuming (the longer the page is, the more time-consuming it is), in the case of front-end response time requirements, it is suggested that we make the process of grasping the page asynchronous.

conclusion

Selenium is more powerful for automated testing, simulating browser behavior, manipulating page elements, and more. The field of automated testing is more professional, so I don’t want to go into depth here, but simply introduce how to dynamically capture a process. The key is to get page elements and execute JS scripts, using webdriver.findelements () and driver_js.executescript (). Since the above introduction is only a simple attempt, other scenarios will encounter more complex situations, such as some pages need to click to display the full content, which needs to invoke JS to trigger. The above is a review to solve the problem, I hope to bring you some ideas and help.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Java is based on Selenium to dynamically crawl pages

Antecedents to introduce

To solve

Train of thought

coding

Solve the first problem encountered in the process

Solve the second problem encountered in the process

conclusion

Java is based on Selenium to dynamically crawl pages

Antecedents to introduce

To solve

Train of thought

coding

Solve the first problem encountered in the process

Solve the second problem encountered in the process

conclusion

Related Posts

This section describes new features of MaxCompute2.0

Elasticsearch: Moving Average Aggregation

These 10 Linux commands are surprisingly unknown