motivation

I met a requirement in the previous project, which required the front end to upload a Word document, and then the back end to extract the content of the specified location of the document and save it. Nodejs is used at the back end here. I started to receive this demand, but I found I had no way to start. The main reason is that I have not processed documents of the type of Word before. Excel does have a library for it, and it’s pretty simple

Train of thought

After searching for a while, I found a package called adm-zip on NPM, which can unzip Word documents. The original Word documents can also be unzipped. I did not know before, you can unzip Word documents by following the code, and further extract the content

var admZip = require('adm-zip');
const zip = new admZip('test.docx');
// Unpack the docx to the specified folder result
zip.extractAllTo("./result"./*overwrite*/true);
Copy the code

First we create a new DOCX document with the following contents

Then run the above code to decompress and obtain the following files. It can be seen from the following figure that several folders are generated. The content of Word is actually in the document. XML file in the Word folder (the actual source file is still there after decompression, and does not disappear).

The contents after entering the Word folder

Let’s try the test document with four words in bold color-changing slanted font, as shown below

<w:b/>
<w:i/>
<w:color>

Extract the contents

Having said that XML is just a representation of text, we can read the entire XML with the following code, which results in a string

var contentXml = zip.readAsText("word/document.xml");
Copy the code

The answer is regular expressions. First of all, we need to analyze the structure of word documents. Word documents are actually made up of paragraphs called Paragraphs

If you look closely at the Word document, do you see the small arrows in the figure below? Each arrow is a Paragraph. There are 16 paragraphs in the figure below

<w:p></w:p>
<w:p>

We can extract the text of each paragraph and return an array, where each item is the content of a paragraph
<w:p>
<w:p>
<w:t>
Firstly, all < W :p> contents are extracted with regular expression. Then, for each < W :p> content, further regular extraction is carried out to extract all < W :t> contents and splicing them together to form the total content of a paragraph

Specific code

Below is the specific extraction code

The second argument is a callback to indicate that parsing is complete
var parser = function parseWordDocument(absoluteWordPath,callback){
	// Returns an array of contents
	var resultList = [];
	// If the file exists
	fs.exists(absoluteWordPath, function(exists){
		if(exists){
			/ / decompression
			const zip = new admZip(absoluteWordPath);
			// Read document.xml as text
			var contentXml = zip.readAsText("word/document.xml");
			
      

/ / note? Represents non-greedy mode (as few characters as possible), otherwise only one

var matchedWP = contentXml.match(/ . *? <\/w:p>/gi *?>);

if(matchedWP){ matchedWP.forEach(function(wpItem){ var matchedWT = wpItem.match(/( .*? <\/w:t>)|( ]*? >. *? <\/w:t>)/gi \s.[^> ); var textContent = ' '; if(matchedWT){ matchedWT.forEach(function(wtItem){ // If not format if(wtItem.indexOf('xml:space') = = =- 1){ textContent+=wtItem.slice(5.- 6); }else{ textContent+=wtItem.slice(26.- 6); }}); resultList.push(textContent) } });// Parsing is complete callback(resultList) } }else{ callback(resultList) } }); }; Copy the code

Note that if the paragraph is preceded by a space, the

format is different, as follows

In fact, the amount of code is very small, the key lies in the preparation of the re, the above docX document extraction output results are as follows

I ended up writing the tool as an NPM package, addressed here