How will the text after NLP segmentation be highlighted on the page

I’m participating in nuggets Creators Camp # 4, click here to learn more and learn together!

demand

The company has a project to make a text semantic analysis tool and use NLP word segmentation tool for model training. There are two main aspects of interaction on the front end: text highlighting and manual training

Let’s start with text highlighting

Text is highlighted

First of all, text is paragraph structure, there are multiple paragraphs,

Secondly, there are multiple sentences in each paragraph of text.

For example: ‘Wuhan University is located at Luojiashan, Wuhan’

So the data format returned by the back end

Const data = [{"paraId": 0, "paraText": "Wuhan University is located at Luojiashan ", "paraEntity": [{"category":" noun ", "labelText": "Wuhan university", "startIndex" : 0, "endIndex" : 4, "color" : 'rgba (240215, 12, 5)}, {" category ":" name ", "labelText" : "Wuhan", "startIndex" : 7, "endIndex" : 9, "color" : '# 00 baff'}}]];Copy the code

How does the front end highlight text?

The initial method:

Hardcore replace

The format of the data returned from the back end is extracted into an array, where the elements are stored as objects, and then iterated over, replacing the words in the text that need to be highlighted with

The problem is obvious: it is impossible to give different parts of speech for the same words

Such as:

Wuhan University is located at Luojiashan, Wuhan

Wuhan University is a noun, wuhan is a place name, through the above method, all wuhan will be marked as a place name or other parts of speech;

So how can it be improved? I’m thinking of the split method

Split method

Iterate over each word in the text, replacing it with an EM tag;

The em tag text formed by splitting, we can display different colors according to each word, that is, the part of speech;

< em > wu < / em > < / em > < em > han large < em > < / em > < em > to learn < / em > < em > by < / em > < em > fall < / em > < em > to < / em > < em > wu < / em > < em > han < / em > < em >, < / em > < em > no < / em > < em > yoga < / em > < em > < / em > mountainCopy the code

So, how do you locate it,

Here we’re using the DATA property in HTML

The data attribute

Data -* Global attributes are a class of attributes called custom data attributes, which give us the ability to embed custom data attributes on all HTML elements and to exchange proprietary data between HTML and DOM representations through scripting.

Use the getAttribute method in JS so that we can locate the corresponding text based on the starting position and length given in the background

CreateElement (textStr, tagName = 'em', k) {let NewTextStr = ''; for (let i = 0; i < textStr.length; i += 1) { NewTextStr += `<${tagName} data-paranum="${k}" data-index="${i}">${textStr[i]}</${tagName}>`; } return NewTextStr; },Copy the code

Note that we need to escape this because of the presence of special characters in the text, such as the book name ‘ ‘and the parenthesis’ ()’

/* escape */ escapeStr(STR) {let regStr = "; const specialArry = ['(', ')', '[', ']', '\\', '+', '*', '?', '.', '|']; for (let k = 0; k < str.length; k += 1) { if (specialArry.indexOf(str[k]) > -1) { regStr += `\\${str[k]}`; } else { regStr += str[k]; } } return regStr; },Copy the code

The effect

Summarizes the train of thought

1. First, cut and merge each word of paragraph text into EM label, and add data attribute (paragraph and word position);

2, loop through the data, the minimum unit is each word, so it will be traversed 3 times;

data.forEach((e) => {
  for (let m = 0; m < e.paraEntity.length; m += 1) {
   for (let i = 0; i < e.paraEntity[m].labelText.length; i += 1) {
   }
  }
}
Copy the code

Concatenate unhighlighted strings and highlighted strings in loops

Note: notHightStr and notHightStr are variables declared in forEach

notHightStr = `<em data-paranum="${k}" data-index="${i + e.paraEntity[m].startIndex}">${e.paraEntity[m].labelText[i]}</em>`; hightStr = `<em data-paranum="${k}" data-index="${i + e.paraEntity[m].startIndex}" style="background: ${ e.paraEntity[m].color};" >${e.paraEntity[m].labelText[i]}</em>`;Copy the code

4, replace,

Note: emStr is outside the loop

let emStr = createElement(fileText, 'em', k);

emStr = emStr.replace(notHightStr, hightStr);
Copy the code

5. Generate spliced, highlighted HTML

markTxt += `<p>${emStr}</p>`;
Copy the code

The complete code

Github.com/642134542/H…

The last

In the project, because of the developer, the returned data will be slightly different, in addition to the field name, the data structure and the position of each word will be different, so it needs to adjust according to the actual situation, in general, the idea of splitting is easy to locate

However, there are corresponding disadvantages, such as increasing HTML tags and too much loop nesting, which affects performance.

In addition to semantic model recognition, manual adjustment is also required. The general interaction is to right click to obtain data attributes and terms, and then interact with the back end to achieve the effect of manual correction.