Recently, I am developing a real estate management system, which has a function of storing administrative divisions into database management. I searched online for a long time, but the data about the administrative divisions of the country are relatively old or incomplete, which cannot match my own needs. Later, I came up with the idea of using crawler to obtain the data of the website of National Bureau of Statistics. Since I was still learning Python, I wrote a simple crawler with my familiar NodeJS, which could obtain the information of national administrative divisions of the website of National Bureau of Statistics (excluding Hong Kong, Macao and Taiwan).

directory

  • createsrcFile price, as the generated administrative division information storage directory
  • createindex.jsFile as primary file

Initialize the

Initialize the project using NPM and enter the appropriate content (project name, version, description, etc.) as prompted

npm init
Copy the code

Install dependencies

  • cheerio: a quick, flexible and concise implementation of jquery’s core functionality, intended for use on the server side where DOM manipulation is required
  • axios: is an HTTP library based on promises (you can also use NodeJS request or other HTTP libraries)
  • iconv-lite: Resolves coding problems in NodeJS
  • async: different from ESasync/await, introduced here as a library that uses async. MapLimit to control the number of concurrent requests

The following details the use of each library

The development of

Analyze target websites

To enter the National Bureau of Statistics web site, find administrative web pages, is currently the latest data in 2020, www.stats.gov.cn/tjsj/tjbz/t…

By checking the source code of the web page, WE found that this website is relatively simple, and the information we want to obtain is directly in HTML, so we can directly use Cheerio to obtain the element information and finally get the data we want.

Analysis of the crawled sites should be the first step, and then depending on the results of the analysis of the different sites, determine the dependency libraries to use.

Start by defining a few variables

const HOST = 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/';
const headers = { 'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36' };

Copy the code

Headers: select * from HOST; select * from HOST; select * from HOST; Is the header information when visiting the site, here is to simulate the browser environment, to prevent problems, in fact, do not write can also request website information

  • The introduction ofaxiosLibrary to get site data

The file header is imported into Axios

const axios = require('axios');
Copy the code

Then define a function to request the address to get the corresponding data. Because according to the analysis of the results of the website, the original website data is divided layer by layer and each link shows the level of data. For example: the home page displays all provincial information, and then each provincial carries corresponding different links to connect to subordinate cities, and then links to districts and counties and streets and neighborhood committees according to city information. So encapsulate the method of the request interface into a function, and access the information through parameters.

const fetchData = async (url) => {
  const res = await axios(url, { headers });
Copy the code

Define a main function as the main entrance to the program, and execute it inside main

const main = async() = > {const provinceData = await fetchData(HOST);
}

main();
Copy the code

To obtain the web page information, run node index.js. Use console to print and view.

At this time, it will be found that the obtained data, Chinese will be garbled, this is because the website coding is GB2312. We need to translate the data we get.

  • The introduction oficonv-lite
const iconv = require('iconv-lite');
Copy the code

Modify the request function:

let data = null;
const res = await axios(url, { responseType: 'arraybuffer', headers });
// IconV-Lite interprets the buffer data to gb2312
data = iconv.decode(res.data, 'gb2312');
Copy the code

Pay attention to

Axios will convert the data format to UTF-8, so here you need to convert the captured data to a stream and convert it to GB2312 using IconV-Lite

After obtaining the correct data, Cheerio can be used for accurate acquisition.

  • The introduction ofcheerio
const cheerio = require('cheerio');
Copy the code

Class =”provincetr” class=”provincetr” class=”provincetr” class=”provincetr”

<tr class="provincetr">
	<td><a href="11.html">The Beijing municipal<br></a></td>
	<td><a href="12.html">tianjin<br></a></td>
	<td><a href="13.html">In hebei province<br></a></td>
	<td><a href="14.html">Shanxi Province<br></a></td>
	<td><a href="15.html">Inner Mongolia Autonomous Region<br></a></td>
	<td><a href="21.html">Liaoning province<br></a></td>
	<td><a href="22.html">Jilin province<br></a></td>
	<td><a href="23.html">Heilongjiang province<br></a></td>
</tr>
Copy the code

Find the rule can write code:

const proviceStr = (html) = > {
  const $ = cheerio.load(html)
  let result = [];
  $(".provincetr a").each(function (index, element) {
    let name = $(element).text().trim();
    let url = $(element).attr("href");
    let id = url.replace('.html'.' ');
    result.push({
      pid: ' ',
      id,
      name,
      url: HOST+url,
    })
  });
  return result;
};
Copy the code

This section iterates through the elements under class to obtain the content and href information in tag A, and obtain the corresponding provincial ID (truncated by link, in fact, the information in href is also the corresponding ID). And the URL information is saved for the next level of data acquisition

Print result and you should see the retrieved data.

Next you need to save the data to a file

  • The introduction of nodejsfsThe module
const fs = require('fs');
Copy the code

Define the variables to store. For convenience, separate the levels of information directly:

const filePath = {
  province: 'src/province.json'.city: 'src/city.json'.country: 'src/country.json'
}
Copy the code

Modify main method:

const main = async() = > {const Index = joinUrl('index.html');
  const provinceData = await fetchData(Index, 'province'.' ');
  fs.writeFileSync(filePath.province, JSON.stringify(provinceData));
}
Copy the code

When you open the SRC folder, you will see the extra type. Json, which is a compressed JSON file.

The city-level data were then analyzed. The layout of the municipal web page is similar to that of the provincial web page, except that it is arranged vertically and the tr tag class is citytr. Districts and counties, streets and neighborhood committees have the same layout, but the class is different, so the common code can be extracted and encapsulated into a function.

It should be noted that the more you climb to the lower levels, the amount of data will be more and more large, then it may report errors, and even be the risk of the original website IP.

Here, because it is only crawling to the district and county data, so it does not do too much consideration, just the introduction of async library mapLimit, used for concurrent requests.

  • Finally, configurepackage.json, the use ofnpmThe script starts the project and is no longer usednode index.js

This is my first time to use NodeJS to crawl a web page. There are still a lot of problems. I hope you can correct them. In the follow-up, I took time to improve the code, solve the problem of error reporting when the amount of data is large, and improve the number of provincial IDS.

  • The project address

Github.com/imchaoyu/no…