Simple requirements

Close to work, Xiao Ming finished today’s task, is ready to go home from work.

A message flashed up.

“Recently found that the public spelling check function is good, help users find typos, good experience. Make one for our system.”

Looking at the news, Xiao Ming silently greeting a sentence in his heart.

“I TND of can do this, go to somebody else headquarters to go to work directly, suffer your gas in this.”

“Ok,” Xiao Ming replied. “I’ll see first.”

Today, the king of Heaven came I also have to work, Jesus also can’t stay.

Xiao Ming thought, and went home.

Calm analysis

When it comes to spelling check, Xiao Ming actually knows.

I have never eaten pork, or seen pigs run.

Usually saw some public number big guy to share, said that the public number launched a spell check function, there will be no more typos.

Later, Xiao Ming still saw many typos in their articles. Then, there was no later.

Why not ask the all-powerful github?

Xiaoming opened Github and found that there seems to be no mature Java-related open source projects, and some of them have a few stars. It is not safe to use them.

NLP is supposed to do more python, Java to implement Chinese and English spelling check and error correction? But I only know how to write CRUD!

Xiao Ming silently picked up a thread…

The night outside the window is like water, and I can’t help thinking, where do I come from? Where to? What is the meaning of life?

Still the ashes of residual heat fell in the xiaoming east buy slippers, his mind runaway horse hot a clever.

Without any thought, without any clue, or first wash and sleep.

That night, Xiao Ming had a long dream. There were no typos, all the words were in the right place…

transit

The next day, Xiao Ming opened the search box, input spelling correct.

Gratifying is, found an English spelling correction algorithm explanation.

I have thought about it all day long, rather than learning it for a moment. Xiao Ming sighed and looked up.

Algorithm ideas

English words are mainly composed of 26 English letters, so there may be mistakes in spelling.

First, you can get the correct English words, excerpts are as follows:

apple,16192 applecart,41 applecarts,1 appledrain,1 appledrains,1 applejack,571 applejacks,4 appleringie,1 appleringies,1  apples,5914 applesauce,378 applesauces,1 applet,2Copy the code

Each line is separated by a comma, followed by the frequency of the word.

For example, if the user enters appL, if the word does not exist, you can insert/delete/replace it to find the closest word. (Essentially finding the word with the smallest edit distance)

If the entered word exists, it is correct and no processing is required.

Access to thesaurus

So where do you get an English thesaurus?

Xiao Ming thought about it, then went to various places to look up a circle, finally found a relatively perfect English word frequency word library, a total of 27W+ words.

Excerpts:

aa,1831
aah,45774
aahed,1
aahing,30
aahs,23
...
zythums,1
zyzzyva,2
zyzzyvas,1
zzz,76
zzzs,2
Copy the code

The core code

Get all possible cases of the user’s current input. The core code is as follows:

/** * Builds all possible error cases for the current word **@paramWord Enter the word *@returnReturn result *@since 0.0.1
 * @authorAn old horse whistles against the west wind
private List<String> edits(String word) {
    List<String> result = new LinkedList<>();
    for (int i = 0; i < word.length(); ++i) {
        result.add(word.substring(0, i) + word.substring(i + 1));
    }
    for (int i = 0; i < word.length() - 1; ++i) {
        result.add(word.substring(0, i) + word.substring(i + 1, i + 2) + word.substring(i, i + 1) + word.substring(i + 2));
    }
    for (int i = 0; i < word.length(); ++i) {
        for (char c = 'a'; c <= 'z'; ++c) {
            result.add(word.substring(0, i) + c + word.substring(i + 1)); }}for (int i = 0; i <= word.length(); ++i) {
        for (char c = 'a'; c <= 'z'; ++c) {
            result.add(word.substring(0, i) + c + word.substring(i)); }}return result;
}
Copy the code

Then compare them to the correct words in the thesaurus:

List<String> options = edits(formatWord);
List<CandidateDto> candidateDtos = new LinkedList<>();
for (String option : options) {
    if(wordDataMap.containsKey(option)) { CandidateDto dto = CandidateDto.builder() .word(option).count(wordDataMap.get(option)).build(); candidateDtos.add(dto); }}Copy the code

In the end, the result returned needs to be compared according to the frequency of the word, which is relatively simple overall.

Chinese spelling

A miss

The spelling of Chinese looks similar to English at first glance, but there is something very special about Chinese.

Because the spelling of all Chinese characters is fixed, there are no typo when users type, only other characters.

It makes no sense to say that a single word is another word. There must be a word, or a context.

That makes it much harder to correct.

Xiao Ming helplessly shook his head, Chinese culture, extensive and profound.

Algorithm ideas

There are many ways to correct Chinese characters:

(1) Confusion set.

For example, the commonly used other words, it is written as a mistake to change.

(2) N – “gramm

This is the context of a single word. The most widely used is the 2-gram. Sougou’s lab has the data.

In other words, when the first word is fixed, the second word will have the corresponding probability. The higher the probability, the more likely the user intended to input.

Like running really fast, actually running really fast might be the right thing to do.

Error correction

Of course, another difficulty with Chinese is that you can’t directly insert/delete/replace one character into another.

But similarly, there are many ways:

(1) Homophones/homonyms

(2) The shape of the word

(3) Synonyms

(4) Words and words out of order, words and words

Algorithm implementation

Due to the difficulty of implementation, Xiao Ming chose the simplest set of puzzles.

First, find a dictionary of common words. Here are some excerpts:

A crane, a bird of a feather, a bird of a feather, as usual, as usual. Dejected soul dejected soul stand to help and help and into the noise and into the dragon plate tiger according to the dragonCopy the code

The first is the other word, and the second is the correct usage.

Take the other characters as the dictionary, and then carry out the fast-forward participle of the Chinese text to get the corresponding correct form.

Of course, we can start simple, let the user input a fixed phrase, the implementation is to directly parse the corresponding map

public List<String> correctList(String word, int limit, IWordCheckerContext context) {
    final Map<String, List<String>> wordData = context.wordData().correctData();
    // Check whether it is incorrect
    if(isCorrect(word, context)) {
        return Collections.singletonList(word);
    }
    List<String> allList = wordData.get(word);
    final int minLimit = Math.min(allList.size(), limit);
    List<String> resultList = Guavas.newArrayList(minLimit);
    for(int i = 0; i < minLimit; i++) {
        resultList.add(allList.get(i));
    }
    return resultList;
}
Copy the code

Long mixed Chinese and English text

Algorithm ideas

Actual articles are generally mixed in Chinese and English.

To make it easier for users to use, you can’t just type one phrase at a time.

So what do we do?

The answer is participle, the input sentence, participle into a word. Then distinguish the Chinese and English, the corresponding processing.

For word segmentation, open source projects are recommended:

Github.com/houbb/segme…

Algorithm implementation

The modified core algorithm can be reused in Both Chinese and English.

@Override
public String correct(String text) {
    if(StringUtil.isEnglish(text)) {
        return text;
    }

    StringBuilder stringBuilder = new StringBuilder();
    final IWordCheckerContext zhContext = buildChineseContext();
    final IWordCheckerContext enContext = buildEnglishContext();

    // The first step is to execute the participle
    List<String> segments = commonSegment.segment(text);
    // Only when all is true is considered true.
    for(String segment : segments) {
        // If it is English
        if(StringUtil.isEnglish(segment)) {
            String correct = enWordChecker.correct(segment, enContext);
            stringBuilder.append(correct);
        } else if(StringUtil.isChinese(segment)) {
            String correct = zhWordChecker.correct(segment, zhContext);
            stringBuilder.append(correct);
        } else {
            // Ignore othersstringBuilder.append(segment); }}return stringBuilder.toString();
}
Copy the code

The default implementation of the participle is as follows:

import com.github.houbb.heaven.util.util.CollectionUtil;
import com.github.houbb.nlp.common.segment.ICommonSegment;
import com.github.houbb.nlp.common.segment.impl.CommonSegments;

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

/** * Default mixed participle, support Chinese and English. * *@author binbin.hou
 * @since0.0.8 * /
public class DefaultSegment implements ICommonSegment {

    @Override
    public List<String> segment(String s) {
        // Separated by Spaces
        List<String> strings = CommonSegments.defaults().segment(s);
        if(CollectionUtil.isEmpty(strings)) {
            return Collections.emptyList();
        }

        List<String> results = new ArrayList<>();
        ICommonSegment chineseSegment = InnerCommonSegments.defaultChinese();
        for(String text : strings) {
            // Perform Chinese word segmentation
            List<String> segments = chineseSegment.segment(text);

            results.addAll(segments);
        }


        returnresults; }}Copy the code

First, the word segmentation is carried out for the blank space, and then the fast-forward word segmentation is done for Chinese with the other characters of the puzzle set.

Of course, it’s not hard to say.

It is really more troublesome to achieve, Xiao Ming to complete the implementation has been open source:

Github.com/houbb/word-…

If you feel helpful, you can fork/star a wave

Quick start

Word-checker is used to check the spelling of words. Support English word spelling detection, and Chinese spelling detection.

Without further ado, let’s go through the experience of using the utility class directly.

Features that

  • Can quickly determine if the current word is misspelled

  • You can return the best match

  • You can return a list of correct matches, with support for specifying the size of the returned list

  • Error message I18N is supported

  • Support case, full corner and half corner formatting

  • Support for custom thesaurus

  • Built-in 27W+ English thesaurus

  • Support basic Chinese spelling detection

Quick start

Maven is introduced into

<dependency>
     <groupId>com.github.houbb</groupId>
     <artifactId>word-checker</artifactId>
    <version>0.0.8</version>
</dependency>
Copy the code

The test case

Based on the input, the best corrective results are automatically returned.

final String speling = "speling";
Assert.assertEquals("spelling", EnWordCheckers.correct(speling));
Copy the code

Core API Introduction

The core API is under the EnWordCheckers utility class.

function methods parameter The return value note
Determine if the spelling is correct isCorrect(string) The words to be tested boolean
Returns the best corrective result correct(string) The words to be tested String If no word can be corrected is found, the word itself is returned
Determine if the spelling is correct correctList(string) The words to be tested List Returns a list of all matching corrections
Determine if the spelling is correct correctList(string, int limit) The word to be tested, returning the size of the list Returns a list of corrections of the specified size The size of the list is less than or equal to LIMIT

Test cases

See EnWordCheckerTest. Java

Is it spelled correctly

final String hello = "hello";
final String speling = "speling";
Assert.assertTrue(EnWordCheckers.isCorrect(hello));
Assert.assertFalse(EnWordCheckers.isCorrect(speling));
Copy the code

Returns the best match

final String hello = "hello";
final String speling = "speling";
Assert.assertEquals("hello", EnWordCheckers.correct(hello));
Assert.assertEquals("spelling", EnWordCheckers.correct(speling));
Copy the code

The match list is corrected by default

final String word = "goox";
List<String> stringList = EnWordCheckers.correctList(word);
Assert.assertEquals("[good, goo, goon, goof, gook, goop, goos, gox, goog, gool, goor]", stringList.toString());
Copy the code

Specifies the size of the correct match list

final String word = "goox";
final int limit = 2;
List<String> stringList = EnWordCheckers.correctList(word, limit);
Assert.assertEquals("[good, goo]", stringList.toString());
Copy the code

Chinese spelling correction

Core API

To reduce learning costs, the core API and ZhWordCheckers are consistent with English spelling detection.

Is it spelled correctly

final String right = "Right";
final String error = "Everything changes.";

Assert.assertTrue(ZhWordCheckers.isCorrect(right));
Assert.assertFalse(ZhWordCheckers.isCorrect(error));
Copy the code

Returns the best match

final String right = "Right";
final String error = "Everything changes.";

Assert.assertEquals("Right", ZhWordCheckers.correct(right));
Assert.assertEquals("All change is the same.", ZhWordCheckers.correct(error));
Copy the code

The match list is corrected by default

final String word = "Everything changes.";

List<String> stringList = ZhWordCheckers.correctList(word);
Assert.assertEquals("Every change is the same.", stringList.toString());
Copy the code

Specifies the size of the correct match list

final String word = "Everything changes.";
final int limit = 1;

List<String> stringList = ZhWordCheckers.correctList(word, limit);
Assert.assertEquals("Every change is the same.", stringList.toString());
Copy the code

Long text mixed in Chinese and English

scenario

For actual spelling correction, the best experience is when the user enters a long text, possibly in a mix of Chinese and English.

Then implement the corresponding functions described above.

Core method

The WordCheckers utility class provides autocorrect for long text mixed in Chinese and English.

function methods parameter The return value note
The spelling of the text is correct isCorrect(string) The text to be detected boolean Return true if all are true
Returns the best corrective result correct(string) The words to be tested String If no text can be corrected is found, it returns itself
Determine if the text is spelled correctly correctMap(string) The words to be tested Map Returns a list of all matching corrections
Determine if the text is spelled correctly correctMap(string, int limit) The text to be detected, returning the size of the list Returns a list of corrections of the specified size The size of the list is less than or equal to LIMIT

Is the spelling correct

final String hello = "Hello, hello";
final String speling = "Speling, you use poison to poison.";
Assert.assertTrue(WordCheckers.isCorrect(hello));
Assert.assertFalse(WordCheckers.isCorrect(speling));
Copy the code

Returns the best corrective result

final String hello = "Hello, hello";
final String speling = "Speling, you use poison to poison.";
Assert.assertEquals("Hello, hello", WordCheckers.correct(hello));
Assert.assertEquals("Hey, Spelling, let's fight fire with fire.", WordCheckers.correct(speling));
Copy the code

Determine if the text is spelled correctly

For each word, the corresponding corrective result.

final String hello = "Hello, hello";
final String speling = "Speling, you use poison to poison.";
Assert.assertEquals("{hello=[hello], =[], you =[you], good =[good]}", WordCheckers.correctMap(hello).toString());
Assert.assertEquals("{=[], speling=[spelling, spewing, sperling, spelding, spelding], you =[you], ok =[ok], Set fire to fire =[set fire to fire]}", WordCheckers.correctMap(speling).toString());
Copy the code

Determine if the text is spelled correctly

As above, specify the maximum number of returns.

final String hello = "Hello, hello";
final String speling = "Speling, you use poison to poison.";

Assert.assertEquals("{hello=[hello], =[], you =[you], good =[good]}", WordCheckers.correctMap(hello, 2).toString());
Assert.assertEquals("{= [], speling = [spelling, spewing], you = [you], good = [good], to poison poison = work [with]}", WordCheckers.correctMap(speling, 2).toString());
Copy the code

Formatting process

Sometimes the user input is varied, and the tool supports formatting.

case

Uppercase will be uniformly formatted as lowercase.

final String word = "stRing";

Assert.assertTrue(EnWordCheckers.isCorrect(word));
Copy the code

The Angle of half Angle

The full Angle will be uniformly formatted as a half Angle.

final String word = "string";

Assert.assertTrue(EnWordCheckers.isCorrect(word));
Copy the code

Custom English thesaurus

Configuration file

You can create a file resources/data/define_word_checker_en.txt in the project resources directory

The contents are as follows:

my-long-long-define-word,2
my-long-long-define-word-two
Copy the code

Different words stand on separate lines.

The first column of each line represents the word, and the second column represents the number of occurrences, separated by a comma.

The greater the number of times, the higher the priority to return when correcting, which defaults to 1.

User – defined thesaurus takes precedence over system – built thesaurus.

The test code

After we specify the corresponding word, the spelling test will take effect.

final String word = "my-long-long-define-word";
final String word2 = "my-long-long-define-word-two";

Assert.assertTrue(EnWordCheckers.isCorrect(word));
Assert.assertTrue(EnWordCheckers.isCorrect(word2));
Copy the code

Custom Chinese thesaurus

Configuration file

You can create a file resources/data/define_word_checker_zh.txt in the project resource directory

The contents are as follows:

Stick to the rulesCopy the code

Use the English space to separate, error followed by correct.

summary

The correction of Chinese and English spelling has always been a hot and difficult topic.

In recent years, commercial applications have been successful because of advances in NLP and artificial intelligence.

The main implementation is based on the traditional algorithm, the core in thesaurus.

Xiaoming put the complete implementation has been open source:

Github.com/houbb/word-…

Welcome to fork/star

subsequent

After several days of hard work, Xiao Ming finally finished a simple spell-checking tool.

“Last time and I said the public spelling check function still want?”

“No, YOU don’t say I have forgotten.” “The product appears somewhat surprised. It doesn’t matter whether we do that or not. We’ve been squeezing a bunch of business needs lately. You’re on top of it.”

“……”

“I recently saw a function on XXX is also very good, you give our system to do one.”

“……”