Like again, form a habit 👏👏

Hello everyone, I’m Llamy, today more text, brought a very dry goods.

preface

Speaking of crawler crawler web data, I believe everyone’s first reaction is Python, indeed Python is naturally suited to do this, but many years of experience in Java development do not necessarily know, in fact, Java can also do crawler. The best known is the Jsoup Web extraction framework.

Become attached to

Many years ago, they made a precious metals information kind website, need to various kinds of real-time displaying the latest exchanges of gold, silver, etc., then have to provide such data API interface by a third party service providers, have to pay, just behind the baidu to Jsoup scraping of the page, and then found a great website according to climb the data, Their own form to display on the page, save a sum of money.

Last night, whim, thinking can write an article, so go to the official website review, decided to shell housing start 😏

Jsoup Guide to Eating

Jsoup is really simple, so simple that you don’t want to introduce the development process. Just go to the official website and get the API in 10 minutes

The official website has a guide and examples, you go to see.

Official website: jsoup.org/

Dry goods

The dry goods part I will use the way of actual combat project, detailed explanation, I believe that through this actual combat example, we can easily master the technology.

1, prepare to crawl the web page

Shell housing – Shenzhen station – new house: sz.fang.ke.com/loupan/pg

2. Create a Maven project

3. Add dependencies to Pom files

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>30.0 the jre</version>
</dependency>
<dependency>
    <groupId>org.projectlombok</groupId>
    <artifactId>lombok</artifactId>
    <version>1.18.18</version>
</dependency>
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.3</version>
</dependency>
<dependency>
    <groupId>com.alibaba</groupId>
    <artifactId>easyexcel</artifactId>
    <version>2.2.10</version>
</dependency>
Copy the code

4. Create Pojo classes to map data to Excel tables

Because the address captured from the web page will be saved to excel file in the end, we use Alibaba’s EasyExcel, so we need to introduce dependencies in THE Pom file, and need to create a Pojo class to do the mapping of the exported file.

@Data
@Accessors(chain = true)
public class House {
    @ExcelProperty(" Property Name ")
    private String title;

    @excelProperty (" Visit web page ")
    private String detailPageUrl;

    @ExcelProperty(" Real Estate picture ")
    private String imageUrl;

    @excelProperty (" address ")
    private String address;

    @ ExcelProperty (" family ")
    private String houseType;

    @ExcelProperty(" Property type ")
    private String propertyType;

    @ ExcelProperty (" state ")
    private String status;

    @ExcelProperty(" Gross Floor area ")
    private String buildingArea;

    @ ExcelProperty (" price ")
    private String totalPrice;

    @ExcelProperty(" Unit Price (Yuan /㎡) ")
    private String singlePrice;

    @ ExcelProperty (" tag ")
    private String tag;
}
Copy the code

The Main method executes the business code

A few things to note here:

  1. Shell house hunting will trigger frequent visits to the same IP address for a short period of timeHuman verification, so we need to put the thread to sleep for a period of time each time we perform paging.
  2. JsoupIs aimed atThe HTML elementCrawl, if the shell web page changes, the program may not be able to crawl data.
  3. The program did not crawl the detailed information of the building, interested students can crawl according toDetails page urlDo secondary development.
  4. If you need to crawl the siteThe login informationAnd other special request parameters,JsoupAlso support Settings, see the official website API.
@SneakyThrows
public static void main(String[] args) {
    AtomicInteger pageIndex = new AtomicInteger(1);
    int pageSize = 10;
    List<House> dataList = Lists.newArrayList();

    // Shell housing Shenzhen region website
    String beikeUrl = "https://sz.fang.ke.com";

    // Shenzhen real estate display page address
    String loupanUrl = "https://sz.fang.ke.com/loupan/pg";

    // Use Jsoup to fetch the complete page information of this address
    Document doc = Jsoup.connect(loupanUrl + pageIndex.get()).get();

    // Page title
    String pageTitle = doc.title();

    // Paging container
    Element pageContainer = doc.select("div.page-box").first();
    if (pageContainer == null) {
        return;
    }
    // Total number of buildings
    int totalCount = Integer.parseInt(pageContainer.attr("data-total-count"));
    // paging execution
    for (int i = 0; i < totalCount / pageSize; i++) {
        log.info("running get data, the current page is {}", pageIndex.get());
        // Shell network has man-machine authentication, so it cannot be accessed frequently in a short time. Every page turn will let the thread sleep for 10 seconds
        Thread.sleep(10000);
        doc = Jsoup.connect(loupanUrl + pageIndex.getAndIncrement()).get();

        // Get the ul element of the listing
        Element list = doc.select("ul.resblock-list-wrapper").first();
        if (list == null) {
            continue;
        }

        // Get the li element of the listing
        Elements elements = list.select("li.resblock-list");
        elements.forEach(el -> {

            // Property introduction
            Element introduce = el.child(0);

            // Details page
            String detailPageUrl = beikeUrl + introduce.attr("href");

            // Real estate picture
            String imageUrl = introduce.select("img").attr("data-original");

            // Property details
            Element childDesc = el.select("div.resblock-desc-wrapper").first();
            Element childName = childDesc.child(0);

            // Property name
            String title = childName.child(0).text();

            // The property is for sale
            String status = childName.child(1).text();

            // Property type
            String propertyType = childName.child(2).text();

            // Address of the building
            String address = childDesc.child(1).text();

            // Room properties
            Element room = childDesc.child(2);

            / / family
            String houseType = "";

            // Set the size of the apartment
            Elements houseTypeSpans = room.getElementsByTag("span");
            if (CollectionUtils.isNotEmpty(houseTypeSpans)) {
                // Delete the copy:
                houseTypeSpans.remove(0);
                // Delete text: [face: XXX]
                houseTypeSpans.remove(houseTypeSpans.size() - 1);
                houseType = StringUtil.join(houseTypeSpans.stream().map(Element::text).collect(Collectors.toList()), "/");
            }

            // Floor area
            String buildingArea = room.select("span.area").text();

            // div - tag
            Element descTag = childDesc.select("div.resblock-tag").first();
            Elements tagSpans = descTag.getElementsByTag("span");
            String tag = "";
            if (CollectionUtils.isNotEmpty(tagSpans)) {
                tag = StringUtil.join(tagSpans.stream().map(Element::text).collect(Collectors.toList()), "");
            }

            // div - price
            Element descPrice = childDesc.select("div.resblock-price").first();
            String singlePrice = descPrice.select("span.number").text();
            String totalPrice = descPrice.select("div.second").text();

            dataList.add(new House().setTitle(title)
                    .setDetailPageUrl(detailPageUrl)
                    .setImageUrl(imageUrl)
                    .setSinglePrice(singlePrice)
                    .setTotalPrice(totalPrice)
                    .setStatus(status)
                    .setPropertyType(propertyType)
                    .setAddress(address)
                    .setHouseType(houseType)
                    .setBuildingArea(buildingArea)
                    .setTag(tag)
            );
        });
    }
    if (CollectionUtils.isEmpty(dataList)) {
        log.info("dataList is empty returned.");
        return;
    }
    log.info("dataList prepare finished, size = {}", dataList.size());
    
    // Call export logic to export data to excel file
    export(pageTitle, dataList);
}
Copy the code

6, EasyExcel export logic

/** * write the crawl data to Excel *@param pageTitle
 * @param dataList
 */
private static void export(String pageTitle, List<House> dataList) {
    WriteCellStyle headWriteCellStyle = new WriteCellStyle();
    // set the head center
    headWriteCellStyle.setHorizontalAlignment(HorizontalAlignment.CENTER);
    // Content strategy
    WriteCellStyle contentWriteCellStyle = new WriteCellStyle();
    // Set the horizontal center
    contentWriteCellStyle.setHorizontalAlignment(HorizontalAlignment.LEFT);
    HorizontalCellStyleStrategy horizontalCellStyleStrategy = new HorizontalCellStyleStrategy(headWriteCellStyle, contentWriteCellStyle);
    // Here you need to set not to close the stream
    EasyExcelFactory.write("D:\ Shenzhen Real Estate summary.xlsx". House.class).autoCloseStream(Boolean.FALSE).registerWriteHandler(horizontalCellStyleStrategy).sheet(pageTitle).doWrite(d ataList); }Copy the code

7. Achievement Display

Interested partners can try their own!

Source: making

Original is not easy, please give a lot of praise, thank you! 🙏 🙏