What words (not) should be selected when text data is amplified?

Article source | turbine cloud community (turbine cloud, focus on AI industry platform for the sharing work force)

The original address | paper notes

The original author | Mathor

I am, or I am not, the big guy is there, constantly Posting!

So, I still carry honestly!

Beginning of the body:

Text Augmentation is now used by most people as it can improve the effectiveness of Text classification, with common methods including but not limited to replace, deletion, and addition. Text amplification in general leads to better final performance, and in a few cases worse. You might think it’s because you remove important words from a sentence by deleting or replacing them, but which words are important? Which words can be amplified and which words are best left unamplified?

ACL2022 has an article entitledRoles of Words: What Should (n’t) Be Augmented in Text Augmentation on Text Classification Tasks?”Examines the problem and offers guidance. First, the author trained FD News data set, and finally achieved 98.92% accuracy on the test set, indicating that the model fits the data set very well. Then the author manually enters several test samples, as shown belowBecause the words “basketball” and “athletes” often appear in the training samples in the “sport” category, the model can very accurately predict the “sport” category. However, from the second and fourth samples, the model did not perform as well as we expected. Since “Based on” and “team” often appear together with sentences of category “sport” in the training set, the model trained by this data set is naturally a little “biased”; From the last example, the model failed to correctly identify a sports-related professional word: three-pointer

The example above inspires us to look at each word in a sentence from both “statistical relevance” and “semantic similarity” perspectives. Specifically, we can assign a “role” to each word from these two perspectives, and there are four roles in total:

Common class-afflicted words (CC-words) : High statistical relevance and high semantic similarity
Specific class-afflicted words (SC-words) : Low statistical relevance and high semantic similarity
Intermediate class-indicating words (IC-words) : High statistical relevance and low semantic similarity
5, Class-irrelevant words/Other words (O-words) : Low statistical relevance and semantic similarity

STATISTICAL CORRELATION & SEMANTIC SIMILARITY

Weighted log-likelihood Ratio (WLLR) is used to measure the statistical correlation between each word in a sentence and the category. The WLLR score is calculated as follows:

Where WWW is a word; Yyy is a category; Y ˉ\bar{y}yˉ represents all categories. WLLR (w,y)\text{WLLR}(w,y) The greater the WLLR (w,y), the higher the statistical correlation between the word WWW and the category YYy

In order to measure the semantic similarity of two words, the most direct method is to calculate the cosine similarity of two vectors. However, the author does not use the complex Bert-based model to extract the word vector here, because it requires large computing resources, the author directly uses the simple Word2Vec method to obtain a word vector. The calculation formula of the pre-similarity is as follows:

Where, LLL represents category, vw, vLV_w, v_LVw, vL represents vector representation of word and category respectively

Generally, categories have text descriptions, such as “sports”, “computer”, etc., and we directly use their descriptions as LLL

After calculating the statistical correlation and cosine similarity of all words in a given sentence, we set a threshold to distinguish between high (low) WLLR scores Ch(Cl)C_h(C_l)Ch(Cl), as well as high (low) cosine scores Sh(Sl)S_h(S_l)Sh(Sl)

Among them, $W_{CC}, W_{SC}, W_{IC}, W_{O}$ Cc-words, SC-words, IC-words, and O-words respectively. A real sample sample like the following

RESULTS

The threshold used by the author in the experiment is the median of the two indicators. First, delete experimentsAccording to the results, deleting CC-Words has a significant impact on performance loss; Deleting SC-Words and IC-Words brings more positive effects. In fact, the first conclusion is easy for us to think of. Because CC-Words and tags have high correlation and semantic similarity at the same time, deleting it will definitely greatly reduce the accuracy of model judgment. However, the latter conclusion is somewhat inconsistent with my guess. At first, I thought it would be better to delete The O-Words, because the O-words are not very related to the tag, and it would not hurt much to delete them. However, the fact is that deleting SC-words and IC-words is better. The explanation in the paper is that deleting these words can force the model to pay more attention to CC-words because of the low statistical correlation and high semantic similarity between SC-words and labels. The statistical correlation between IC-words and labels is relatively high, but semantic similarity is relatively low. The paper explains that IC-words are usually some data with noise and bias, and deleting them can help the model avoid learning incorrect features about this category

Similarly, the author has also done data amplification methods of insertion, replacement and exchange. The results are not listed here. Interested readers can read the original paper by themselves. The following table is a summary of the authors’ use of the four data amplification methods

Personal summary

This paper proposes a selective text amplification method. Specifically, the paper sets up four roles, and assigns each word to a role, and operates on words of different roles in the face of different amplification means. This can effectively avoid information loss and generate high-quality textual data

What words (not) should be selected when text data is amplified?

Beginning of the body:

STATISTICAL CORRELATION & SEMANTIC SIMILARITY

RESULTS

Personal summary

Related Posts

OUC_SE code exercise: use VGG16 to classify CIFAR10

Uber machine learning platform — Michelangelo

See the PivotTable in pandas