preface

Recently received the response from customer service, the data of provinces and cities in the system seems to be inaccurate, and some regions are missing. After asking my colleagues, I learned that the data in the database was copied from the old project, which is some years old. No wonder some data is missing. Just recently in the docking network business bank, found that the network provides the data interface of the provinces and cities. This is very comfortable wow, copy up the keyboard is dry, very quickly write the synchronization program.

Then in the process of synchronization, it is found that the data provided by the network business and the database are not on some. So silently open Taobao and JINGdong to add the delivery address, to see who is wrong. Contrast to the later found some differences. This is a pain in the ass. It seems that no one can be trusted at this time, only the country. So I went to the Website of the Ministry of Civil Affairs, PRC to compare the abnormal data.

In the process of comparison, stone hammer net business data is not allowed. It is worth praising that Taobao and JINGdong have synchronized the latest data. However, I can’t find their data interface. In order to correct the system data, you have to crawl.

Lock the crawl target

The crawl address is as follows:

The preview. www.mca.gov.cn/article/sj/…

The principle of crawl is very simple, is to parse the HTML element, and then retrieve the corresponding attribute value and save it. Because Java is used for development, Jsoup was chosen to do this.

<! -- HTML parser -->
<dependency>
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.13.1</version>
</dependency>
Copy the code

Web data analysis

Since you need to parse the HTML to retrieve the data, you need to know what elements the data is stored on. You can open the Console of Chrom and select the corresponding data to see the element that stores the data.

Analysis shows that each row of data is stored under a tag. The area code and area name we need are stored in the first and second < TD >. At the same time, there are many blank < TD > tags that need to be filtered out when writing code.

Defining the base code

Let’s define our crawl target and the Area entity class.

public class AreaSpider{

    // Crawl to target
    private static final String TARGET = "http://preview.www.mca.gov.cn/article/sj/xzqh/2020/2020/202101041104.html";

    @Data
    @AllArgsConstructor
    private static class Area {

        / / area code
        private String code;

        // Region name
        private String name;

        / / parent
        privateString parent; }}Copy the code

Crawler coding

public static void main(String[] args) throws IOException{
  // Request a page
  Jsoup.connect(TARGET).timeout(10000).get()
    // Filter out tr tags
    .select("tr")
    // Filter td tags under tr
    .forEach(tr -> tr.select("td")
    // Filter td tags with an empty value
    .stream().filter(td -> StringUtils.isNotBlank(td.text()))
    // Output the result
    .forEach(td -> System.out.println(td.text())));
}
Copy the code

Analytical results

Code optimization

With the code above, we have already crawled the data on the page. But it didn’t meet our expectations, so we took it a step further and converted it to an Area entity.

public static void main(String[] args) throws IOException{
  // Request a page
  List<Area> areaList = Jsoup.connect(TARGET).timeout(10000).get()
    // Filter out tr tags
    .select("tr")
    // Filter td tags under tr
    .stream().map(tr -> tr.select("td")
    // Filter td tags with empty values and convert them to td lists
    .stream().filter(td -> StringUtils.isNotBlank(td.text())).collect(Collectors.toList()))
    As mentioned earlier, the zone code and the zone name are stored in the first and second TD, respectively, so non-conforming lines are filtered out.
    .filter(e -> e.size() == 2)
    // Convert to an area object
    .map(e -> new Area(e.get(0).text(), e.get(1).text(), "0")).collect(Collectors.toList());
  
	// Iterate over the data
  areaList.forEach(area -> System.out.println(JSONUtil.toJsonStr(area)));
}
Copy the code

Analytical results

At this point, we are still short of the parent region code, which we can calculate from the region code. Take Hebei Province as an example: Hebei Province :130000, Shijiazhuang city :130100, Chang ‘an District :130102 can be found the rule: 0000 end is the province, 00 is the city. So here’s the code:

private static String calcParent(String areaCode){
    // province - special treatment for the first line
    if(areaCode.contains("0000") || areaCode.equals("Administrative Division Code")) {return "0";
    / / the city
    }else if (areaCode.contains("00")) {
        return String.valueOf(Integer.parseInt(areaCode) / 10000 * 10000);
    / / area
    }else {
        return String.valueOf(Integer.parseInt(areaCode) / 100 * 100); }}Copy the code

The final code

public class AreaSpider{

    // Crawl to target
    private static final String TARGET = "http://preview.www.mca.gov.cn/article/sj/xzqh/2020/2020/202101041104.html";

    @Data
    @AllArgsConstructor
    private static class Area{

        / / area code
        private String code;

        // Region name
        private String name;

        / / parent
        private String parent;

    }

    public static void main(String[] args) throws IOException{
        // Request a page
        List<Area> areaList = Jsoup.connect(TARGET).timeout(10000).get()
                // Filter out tr tags
                .select("tr")
                // Filter td tags under tr
                .stream().map(tr -> tr.select("td")
                // Filter td tags with empty values and convert them to td lists
                .stream().filter(td -> StringUtils.isNotBlank(td.text())).collect(Collectors.toList()))
                As mentioned earlier, the zone code and the zone name are stored in the first and second TD, respectively, so non-conforming lines are filtered out.
                .filter(e -> e.size() == 2)
                // Convert to an area object
                .map(e -> new Area(e.get(0).text(), e.get(1).text(), calcParent(e.get(0).text()))).collect(Collectors.toList());

        / / remove the first line of "administrative division code | unit name"
        areaList.remove(0);

        areaList.forEach(area -> System.out.println(JSONUtil.toJsonStr(area)));
    }

    private static String calcParent(String areaCode){
        // province - special treatment for the first line
        if(areaCode.contains("0000") || areaCode.equals("Administrative Division Code")) {return "0";
        / / the city
        }else if (areaCode.contains("00")) {
            return String.valueOf(Integer.parseInt(areaCode) / 10000 * 10000);
        / / area
        }else {
            return String.valueOf(Integer.parseInt(areaCode) / 100 * 100); }}}Copy the code

Data correction

Because we need the three-level data linkage of the province and the city, but the municipality has only two levels, so we artificially add one level to it. Take Beijing as an example: it becomes Beijing -> Beijing -> Dongcheng District, and the same logic applies to other municipalities.

The corrected data is provided, and you can help yourself if you need it.

  • JSON-2020-11 Code of the administrative district at or above the county level
  • Sql-2020-11 Code of administrative division above county level

For municipalities directly under the central government can also do two levels, this mainly depends on the demand of the product

conclusion

Overall, the crawler is relatively simple, with just a few lines of code. After all, the website does not have any anti – raking mechanism, so it is very easy to get the data.

At the end

Hey hey, what websites have you climbed?

If you feel helpful to you, you can comment more, more like oh, you can also go to my home page to have a look, maybe there is an article you like, you can also click a concern oh, thank you.

I am a different kind of tech nerd, making progress every day and experiencing a different life. See you next time!