“This is the fourth day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

preface

Hi, everyone. I’m sure those of you who are developers hear the word “crawler” a lot. The original language that is better for crawlers is Python, which has relatively mature libraries and plug-ins, and is not very friendly for other languages. There have been sporadic crawler frameworks or libraries for other languages since then. This time we will build a simple crawler based on Java language example, this example is only for everyone to learn to use, do not do commercial and illegal projects.

Jsoup

Crawler is a program or script that automatically captures information on websites or other carriers according to people’s established network rules.

This time, we will build a simple crawler Demo based on Jsoup in Java development language. So first you need to understand what Jsoup is.

Jsoup is a Java-based HTML parser. Able to parse the information on the website according to front-end HTML and CSS. Jsoup provides a very powerful and useful API for the Java development language. You can manipulate HTML elements, attributes, and text, and use DOM or CSS selectors to find and retrieve data. Of course, getting started with Jsoup also requires a certain front-end foundation to get started quickly.

Quick start

The introduction of pom

You’ve looked briefly at the functions and features of Jsoup. Now we will start to build a simple crawler Demo based on Java development language Spring Boot integrated with Jsoup. Jsoup version is 1.11.3, and Spring Boot version is 2.3.0.release. Jsoup has the following dependencies:

         <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.11.3</version>
        </dependency>
Copy the code

For a link

Extract the information of a website, first to query the information you need, and then extract the URL of the information you need. For example, some time ago nuggets held Mid-Autumn Festival activities. At that time, a simple query was made based on the pie sales data of an e-commerce website. Only one page of information is extracted. Including price, store, product name and other basic information. Learn to use only, use only as a test, prohibit large extraction. The URL used this time is as follows:

https:/ / search. * * * *. Com/search? Keyword = % E6%9 A5 E9 c % 88% % % BC&qrst = 1 & SPM = 2.1.0 & stock = 1 & pvid = 9 ac3cb4efb544d6f98239432761506f0 & page = 11&s=296&click=0
Copy the code

Establish a connection

Now that we have extracted the URL of the page where the message is located, we need to establish a Connection. To establish a Connection, we use Jsoup’s Connect to create a Connection object. Java now connects to the web site. The basic code is as follows:

 Connection connect = Jsoup.connect(url);
Copy the code

Access to web pages

After the website establishes the Connection, directly obtains the Document content through the Connection get method.

 Document document = connect.get();
Copy the code

By exporting the Document object, you can see that the Document contains a lot of HTML, so you can extract data from the Document object.

Select the information object

Specific text content can be obtained through the Select method of the Document object, which involves the basic knowledge of the front end. Examples are class selectors, element selectors, wildcard selectors, and so on. Based on the specific element, select. Select the following under class: goods-list-v2 The element is the LI attribute of ul. This will extract all the product information on the page.

 Elements rootselect = document.select(".goods-list-v2 ul li");
Copy the code

Access to information

Once you have all the item information, you can select the elements that need to be advanced according to the CSS selectors. The price, store name and product name of the goods extracted this time. The following code.

    Elements novelname = ele.select(".p-price strong i");
    String price  = novelname.text();
    Elements author = ele.select(".J_im_icon a");
    String shop = author.first().text();
    Elements sumadvice = ele.select(".p-name a em");
    String goodsName = sumadvice.last().text();
    
Copy the code

The data processing

The text information is extracted from Elements. After simple data processing, the following information is obtained.

conclusion

Well, a simple integration of Jsoup to build a simple crawler Demo is complete, is not very simple. There’s more to explore. For learning only, illegal use of data is prohibited, to follow the Internet security law, do not violate the law.

Thank you for reading, I hope you like it, if it is helpful to you, welcome to like collection. If there are shortcomings, welcome comments and corrections. See you next time.

About the author: [Little Ajie] a love tinkering with the program ape, JAVA developers and enthusiasts. Public number [Java full stack architect] maintainer, welcome to pay attention to reading communication.