The cause of

The reason for this article is that two days ago, the teacher sent us a very troublesome project

Statistics of periodical information.

Thick looked at the group of 14 people, I have such delay in the old dog classmate, also have a second year students as the main body of research that seem to have four followed the teacher do project, known as undergraduate went into the children of the laboratory (good), of course, is very good, but a look at the want me to copy and paste the article 650 +, was a bit difficult, Fortunately, xu quickly came up with a solution:

CODING!!!

Open dry!

First of all, determine the technology stack, because the main front end does not understand the question, so choose Node as the main development language, in addition to do is statistical article information, a little thought, this demand is not crawler CV.

Puppeteer is a very useful automation tool in Nodejs. It is not even a crawler because it is widely used in automated testing. Check out this article.

Take a page from my friend’s article. First of all:

npm i -S puppeteer
Copy the code

It may be a bit difficult to download Chromium due to some well-known reasons. My side installed puppeteer before, so you can solve it by yourself.

Puppeteer is an automated testing library that is essentially programmed by the Chrome browser itself, so using this code is intuitive to me.

To observe the requirements

  • Obtain the title, author and institution, starting and ending page number, abstract key words and other information of all academic articles of the Transactions of the China Welding Institution from 2014 to 2015

  • Authors need to be separated by line, and authors and units need to correspond

  • Based on the above, the other rows need to be merged.

    Article list page

    An example of a single article

The solution

  • The entry point is the CNKI journal article list page, generated based on ASPX.
  • On the articles list page, you can get some information
  • Abstract, keywords need to enter the corresponding article page to obtain
  • The corresponding author and unit needs to be viewed in PDF ** (incomplete) **
  • After fetching, export the data to Excel

As you can see, information is stored in three layers.

Crawl all the first level information

First some preparatory work, introducing packages and specified formats:

const puppeteer = require('puppeteer');
const url = 'https://navi.cnki.net/knavi/JournalDetail?pcode=CJFD&pykm=HJXB';
// Set a unified waiting time to prevent the target from identifying the operation too soon
const TIME = 3000;
Copy the code

Here is the main function:

// an asynchronous function that executes immediately
(async() = > {const browser = await puppeteer.launch({
		// headless: false, // false Browser interface starts
		slowMo: 100.// Slow down browser execution to facilitate test observation
		args: [
			// Start Chrome parameters
			'- no - the sandbox'./ / '- Windows - size = 1280960,]});// Create a new page
    const page = await browser.newPage();
    // Go to the target page
    await page.goto(url, {
		// Network idle indicates that the load is complete
		waitUntil: 'networkidle2'});console.log('Page loaded! '); }) ()Copy the code

As can be seen from the above description, puppeteer and Electron are similar in that they both create child processes in the main process to operate.

Next, select the corresponding year and period on the list page and iterate.

Puppeteer is a puppet, so just tell the browser what you want it to do:

The first are the two util functions used:

// Because the year button id on the web page starts with a number, direct S() will fail
// It needs to be converted to Unicode
function getID(year) {
	let num = year - 2010;
	return ` # \ \ 0032 \ \ 0030 \ \ 0031 \ \ 003${num}\\005f\\0059\\0065\\0061\\0072\\005f\\0049\\0073\\0073\\0075\\0065`;
}

// Select the id of a certain period of a certain year
function getNoDotID(year, num) {
	let _num = num < 10 ? ` 0${num}` : `${num}`;
	return `#yq${year}${_num}`;
}
Copy the code

The following:

// Select 2014 and click on each issue
// Year click events
let yearNum = 2014;
const yearBtn = await page.$(getID(yearNum));
await yearBtn.click();
await page.waitFor(TIME);
let accNum = 1;
// The output is a two-dimensional array.
let output = [];
// Starting with the first issue, a monthly issue
while (accNum < 13) {
    // select the number of cycles
    let NoDot = await page.$(getNoDotID(yearNum, accNum));
    NoDot.click();

    // Save all information
    await page.waitFor(TIME);

    console.log('Select list... ' + accNum);
    const list = await page.$('#CataLogContent');
    const items = awaitlist.? ('dd');

    const res = await page.evaluate(list= > {
        // ...
    }, list);
    output.push(res);
    accNum++;
}
Copy the code
  • page.$(), page.? (a)Similar to thedocument.querySelector/querySelectorAll, returns a node element
  • page.evaluate(function,node)Function is a browser operation on the corresponding node selected above, implemented in function. Function takes node as an argument.

Inside Page. Evaluate, we extract the article information (title, start and end page number, etc.) along with links and save them.

const res = await page.evaluate(list= > {
    // Here you can use browser objects
    const itemList = list.querySelectorAll('dd');
    let arr = [];
    // console.log(itemList);
    for (let item of itemList) {
        Cnki is an ASPX-based web page
        // And jump to the corresponding page is regular, depending on the id after filename
        // In addition, different years have different databases
        const getPaperId = function(id) {
            let match = /filename=(\w+)&/i.exec(id);
            return match[1];
        }
        let paperID = item.querySelector('.opts > .btn-view >a').href;
        let id = getPaperId(paperID);
        // Save the innerText and ID of a 2014 entry as a string for parsing later
        let content = item.innerText + '&' +id;
        arr.push(content);
    }
    return arr;
}, list);
Copy the code

Run NPM start and log the data. So far I’ve just copied it, but there are other ways.

The resulting data.txt:

[["CO2 laser brazing process characteristics of 5052 aluminum alloy/galvanized steel coated with powder; Wenkai jiang; Shu-rong yu; Zhang jian. \n1-4+113&HJXB201401001"."Arc Behavior of Ultrasonic MIG Welding of Aluminum Alloy \n Fan Chenglei; Xie Weifeng; Chun-li Yang; KouYi; \n5-8+113&HJXB201401002". ] . ]Copy the code

Crawl abstracts, keywords and other information

At present there is partial information, but the summary and keywords also need to be retrieved in the second layer;

Perform some pre-processing on the data

npm run analysis

This part is the processing of the list obtained above. First, we flatten the 2-dimensional array:

const out2014S = require('./output2014');
const out2015S = require('./output2015');
const fs = require('fs');

// Get the reference
let out2014 = out2014S;
let out2015 = out2015S;
// flat
while (out2014.some(Array.isArray)) { out2014 = [].concat(... out2014); }while (out2015.some(Array.isArray)) { out2015 = [].concat(... out2015); }Copy the code

The following is an example of the data obtained so far:

"CO2 laser brazing process characteristics of 5052 aluminum alloy/galvanized steel coated with powder; Wenkai jiang; Shu-rong yu; Zhang jian. \n1-4+113&HJXB201401001".Copy the code

To analyze this, define a split function:

function SecondeSplit(arr, year) {
    // Serialize the data, save \n for partitioning
	let str = JSON.stringify(arr);
	console.log('str' + str);
	let nArr = str.split('\\n');
	console.log('nArr' + nArr);
	// 0 title
	// 1 string authors
	// 2 pages and link
	let res = {};
    // clean
	res.title = nArr[0].replace(/\"/i.' ');
	let names = nArr[1].split('; ');
	res.name = names.slice(0, names.length - 1);
    // Some articles do not have page numbers and links
	if (nArr[2]) {
		let linkArr = nArr[2].split('&');
        // clean
		let link = linkArr[1].replace(/\"/i.' ');
        // Two year dbnames are slightly different
		if (year === 2014) {
			res.link = `http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CJFD&filename=${link}&dbname=CJFD2014`;
		}
		if (year === 2015) {
			res.link = `http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CJFD&filename=${link}&dbname=CJFDLAST2015`;
		}
		let pages = linkArr[0].split('+');
		let pageArr = pages[0].split(The '-');
		res.start = pageArr[0];
		res.end = pageArr[1];
	}
	return res;
}

// Perform operations on two years of data
let ret2014 = [];
out2014.forEach(i= > {
	let tmp = SecondeSplit(i, 2014);
	ret2014.push(tmp);
});
/ /... Same as 2015

let ret = ret2014.concat(ret2015);

let jsonObj = {};
jsonObj.data = ret;
// \t can save a beautiful JSON
let wObj = JSON.stringify(jsonObj, ' '.'\t');
fs.writeFile('data.json', wObj, err => {
	console.log(err);
});

Copy the code

Crawl abstracts, etc

npm run abstract

The main idea here is to continue operating the puppeteer, obtaining the corresponding abstract, school and keyword information for each link

Puppeteer is not written async-based; then is convenient.

const obj = require('.. /data1.json');
const fs = require('fs');
const puppeteer = require('puppeteer');
// Because we want to operate on obj
let data = obj;
const len = data.data.length;
puppeteer
	.launch({
		headless: true,
	})
	.then(async browser => {
		for (let i = 0; i < len; i++) {
			if (data.data[i].link) {
				const res = await getAbstract(i, data.data[i].link, browser);
                // Use the keyword to determine whether the fetching succeeded
				console.log(i + ':' + res.keywords);
				data.data[i].abstract = res.abstract;
				data.data[i].school = res.school;
				data.data[i].keywords = res.keywords;
			}
		}
	})
	.then((a)= > {
		console.log('Get the information done! ');
		// console.log(data.data[0].abstract);
    	// Save to data1.json
		save(data);
	});
Copy the code

GetAbstract is a function that gets the summary, passing the browser instance, link, and serial number:

async function getAbstract(num, link, browser) {
	const page = await browser.newPage();
	await page.goto(link);
	await page.waitFor(3000);
    / / the
	let abs = await page.$('#ChDivSummary');
	let abstract = await page.evaluate(abs= > {
		return abs.innerText;
	}, abs);
    / / school
	let schoolDOM = await page.$('.orgn');
	let school = await page.evaluate(schoolDOM= > {
		let arr = schoolDOM.querySelectorAll('span > a');
		let res = ' ';
		arr.forEach(i= > {
			res += i.text + ', ';
		});
        // Concatenate the string and delete the last comma
		return res.slice(0, res.length - 1);
	}, schoolDOM);
    / / keywords
	let keysDOM = await page.$('#catalog_KEYWORD');
	let keys = await page.evaluate(keysDOM= > {
    // let arr = keysDOM.querySelectorAll('p')[2].querySelectorAll('a');
    // The above is not a good way to write, because some of the fund has failed, so it is not necessarily the third
    // find a dom with an ID in it
    // Sibling nodes are used.
    let arr = keysDOM.parentNode.children;
    let res = ' ';
    for(let j=1; j<arr.length; j++){ res += arr[j].text.replace(/ /g.' ').replace(/\n/g.' ');
    }
		return res;
	}, keysDOM);
	await page.waitFor(3000);
    // Save memory and close the page after each query
	await page.close();
	return {
		abstract: abstract,
		school: school,
		keywords: keys,
	};
}
Copy the code

This gives you the complete data:

{
	"data": [{"title": "Process characteristics of CO2 laser brazing of 5052 aluminum alloy/Galvanized steel coated with powder"."name": [
				"FanDing"."Jiang wenkai"."Yu Shurong"."Zhang jian"]."link": "http://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CJFD&filename=HJXB201401001&dbname=CJFD2014"."start": "1"."end": "4"."abstract": Using 5052 aluminum alloy and hot-dip galvanized ST04Z steel as the research object, the process test was carried out by precoated CO2 laser lap brazing method. The microstructure and mechanical properties of fusion brazing joints were studied by optical microscope, scanning electron microscope and tensile testing machine. The results show that the weld forming is improved obviously and the galvanizing layer is not burned. The maximum thickness of the transition layer is less than 10μm, and the acicular Al-Fe intermetallic compounds do not precipitate to the molten aluminum side. The joint has high mechanical properties and the maximum mechanical loading capacity is 208 MPa, which is about 95.41% of the tensile strength of 5052 aluminum alloy base material.."school": "State Key Laboratory of Advanced Non-Ferrous Metal Materials of Gansu Province, Key Laboratory of Non-Ferrous Metal Alloys and Processing of Ministry of Education, Lanzhou University of Technology"."keywords": "Aluminum steel; Laser welding; Molten brazing; Powder;"},... ] }Copy the code

Export the data to EXCEL

Here is to export the data, the requirements are very clear:

My idea is to write several lines according to the author list length of each item, and then combine the lines except the author and unit.

const Excel = require('exceljs');
const data = require('.. /data1.json');

// Data preprocessing
let input = [];
let obj = data.data;
obj.forEach((item, index) = > {
	let len = item.name.length;

	let link = item.link;
	let reg = /HJXB201(4|5)([0-9]{2})/i;

	let year = - 1;
	let juan = - 1;
	let vol = - 1;
	if (link) {
		year = link.substring(link.length - 4, link.length);
        // 2014 was 35 volumes, 2015=36 volumes
		juan = year == 2014 ? 35 : 36;
        // The number of periods can be found in the link, is the second match
		vol = reg.exec(link)[2];
	}

	for (let i = 0; i < len; i++) {
        // Organize the data as excelJS needs
		input.push({
			index: index + 1.title: item.title,
			name: item.name[i],
			lang: 'Chinese'.school: item.school,
			abstract: item.abstract,
			year: year,
			juan: juan,
			vol: vol,
			keyType: 'Key words'.paperName: 'Transactions of the China Welding Institution'.keywords: item.keywords,
			start: item.start,
			end: item.end, }); }});Copy the code

Then use ExcelJS to create the worksheet:

/ / excel processing
let workbook = new Excel.Workbook();

workbook.creator = 'xujx';

let sheet = workbook.addWorksheet('sheet 1');

sheet.columns = [
	{ header: 'number'.key: 'index'.width: 10 },
	{ header: 'Uniquely identified type'.key: 'onlykey'.width: 10 },
	{ header: 'Unique identifier'.key: 'onlyid'.width: 10 },
	{ header: 'title'.key: 'title'.width: 15 },
	{ header: 'Text language'.key: 'lang'.width: 10 },
	{ header: 'Responsible Person/Name of Responsible Person'.key: 'name'.width: 15 },
	{ header: 'Responsible Person/Responsible Organization/Responsible Organization Name'.key: 'school'.width: 15 },
	{ header: 'the'.key: 'abstract'.width: 15 },
	{ header: 'Theme/Theme element Type'.key: 'keyType'.width: 15 },
	{ header: 'Theme/Theme name'.key: 'keywords'.width: 15 },
	{ header: 'Journal Name'.key: 'paperName'.width: 15 },
	{ header: 'Year of publication'.key: 'year'.width: 15 },
	{ header: 'Specification journal URI'.key: 'URI'.width: 15 },
	{ header: 'volume'.key: 'juan'.width: 15 },
	{ header: 'period'.key: 'vol'.width: 15 },
	{ header: 'Starting page number'.key: 'start'.width: 15 },
	{ header: 'End page number'.key: 'end'.width: 15 },
	{ header: 'Inclusion information/Inclusion Category Code'.key: 'typeCode'.width: 15},]; sheet.addRows(input);Copy the code

After that, merge the cells:

// Merge cells
// Get the number of authors for each item and store it in an array
let nameLength = [];
obj.forEach(item= > {
	if (item.name.length) {
		nameLength.push(item.name.length);
	} else {
		nameLength.push(0); }});Copy the code

Merge cells from the second row (the first row is the table header) :

for (let j = 0; j < ret.length; j += 2) {
	sheet.mergeCells(`A${ret[j]}:A${ret[j + 1]}`);
	sheet.mergeCells(`B${ret[j]}:B${ret[j + 1]}`);
	sheet.mergeCells(`C${ret[j]}:C${ret[j + 1]}`);
	sheet.mergeCells(`D${ret[j]}:D${ret[j + 1]}`);
	sheet.mergeCells(`E${ret[j]}:E${ret[j + 1]}`);
	sheet.mergeCells(`H${ret[j]}:H${ret[j + 1]}`);
	sheet.mergeCells(`I${ret[j]}:I${ret[j + 1]}`);
	sheet.mergeCells(`J${ret[j]}:J${ret[j + 1]}`);
	sheet.mergeCells(`K${ret[j]}:K${ret[j + 1]}`);
	sheet.mergeCells(`L${ret[j]}:L${ret[j + 1]}`);
	sheet.mergeCells(`M${ret[j]}:M${ret[j + 1]}`);
	sheet.mergeCells(`N${ret[j]}:N${ret[j + 1]}`);
	sheet.mergeCells(`O${ret[j]}:O${ret[j + 1]}`);
	sheet.mergeCells(`P${ret[j]}:P${ret[j + 1]}`);
	sheet.mergeCells(`Q${ret[j]}:Q${ret[j + 1]}`);
	sheet.mergeCells(`R${ret[j]}:R${ret[j + 1]}`);
}

workbook.xlsx.writeFile('1.xlsx').then(function() {
	// done
	console.log('done');
});
Copy the code

The array RET above is obtained by holding the start and end positions of the merged cells.

let ret = [];
// Start at line 2
ret.push(2);
// For each author length
for (let i = 0; i < nameLength.length; i++) {
    // indicates the position of the node at the tail
	let head = ret[ret.length - 1];
    // The current array length is even, indicating that it is now paired, so we need to add the next number of tail nodes to the array
	if (ret.length % 2= = =0) {
		ret.push(head + 1);
        // Also, since this loop does not use the nameLength array, it is not considered a loop ++
    	i--;
	} else {
        // If it is odd, a step size needs to be added to merge the cells
        // So we need a step size of -1 authors
    	ret.push(head + nameLength[i] - 1); }}Copy the code

That’s 99% done!

Unfinished parts

  • But the requirements also said that the author and the author of the unit corresponding, which needs to download the article down for analysis.
  • What I’m trying to do ispdf2json“, but not successful, pressed for time to open the artificial intelligence mode — do it manually
  • A little tired indeed.

The source address

Github ask for a star 555