Simple requirements

Near work, Xiao Ming finished today’s task, is ready to go home from work.

A message flashed.

“Recently I found that the spell-check function of the public account is good, helping users find wrong words, and the experience is good. Make one for our system.”

Looking at this message, Xiaoming silently greeted in the heart.

“I TND of meeting do this, went directly to somebody else headquarters to go to work, suffer your gas in this.”

“Okay,” Ming replied, “I’ll have a look first.”

Today, the king came and I had to get off work, and Jesus couldn’t stay.

Xiao Ming thought and went home.

Calm analysis

When it comes to spell check, Xiao Ming knows it.

I haven’t eaten pork, but I’ve seen pigs run.

Usually read some public number big man to share, that is, the public number launched a spell check function, there will be no typos later.

Later, Xiao Ming saw many typos in their articles. Later, there was no later.

Why not ask the almighty GitHub?

Xiaoming opened GitHub and found that there is no mature JAVA related open source project, and some stars are not very reassuring to use.

NLP is estimated to do more Python, Java to achieve Chinese and English spelling check and error correction? But I can only write CRUD!

Xiao Ming silently placed a Hua Zi……

The night outside the window is like water. I can’t help thinking, where do I come from? Where to go? What is the meaning of life?

The ashes of the residual heat fell on the slippers that Xiaoming bought in the east, which made the runaway horse hot in his mind a clever one.

No train of thought, no clue, or first wash sleep.

That night, Xiao Ming had a long and sweet dream. There were no typos, all the words were in the right place…

transit

The next day, Xiao Ming opened the search box and typed “spelling correct”.

Fortunately, I found an English spelling correction algorithm to explain.

I think all day long, it is better to learn in a moment. Xiao Ming sighed 1, looked up.

Algorithm ideas

English words are mainly composed of 26 letters, so mistakes can be made when spelling.

First, you can get the correct English word. Here’s an excerpt:

apple,16192 applecart,41 applecarts,1 appledrain,1 appledrains,1 applejack,571 applejacks,4 appleringie,1 appleringies,1  apples,5914 applesauce,378 applesauces,1 applet,2

Each line is separated by a comma, followed by how often the word appears.

In the case of the user typing appl, if the word does not exist, you can insert/delete/replace it to find the closest word. (essentially finding the word with the smallest edit distance)

If the input word exists, it is correct and does not need to be processed.

Acquisition of thesaurus

So where do you get the English thesaurus?

Xiao Ming thought about it, and then went to various places to check a circle, and finally found a relatively perfect English word frequency thesaurus, a total of 27W+ words.

Here’s an excerpt:

aa,1831
aah,45774
aahed,1
aahing,30
aahs,23
...
zythums,1
zyzzyva,2
zyzzyvas,1
zzz,76
zzzs,2

The core code

Get all possible cases of the user’s current input. The core code is as follows:

/** * Construct all possible error cases for the current word ** @param word * @return * @since 0.0.1 * @author */ private List<String> edits(String word) { List<String> result = new LinkedList<>(); for (int i = 0; i < word.length(); ++i) { result.add(word.substring(0, i) + word.substring(i + 1)); } for (int i = 0; i < word.length() - 1; ++i) { result.add(word.substring(0, i) + word.substring(i + 1, i + 2) + word.substring(i, i + 1) + word.substring(i + 2)); } for (int i = 0; i < word.length(); ++i) { for (char c = 'a'; c <= 'z'; ++c) { result.add(word.substring(0, i) + c + word.substring(i + 1)); } } for (int i = 0; i <= word.length(); ++i) { for (char c = 'a'; c <= 'z'; ++c) { result.add(word.substring(0, i) + c + word.substring(i)); } } return result; }

Then compare it with the correct words in the thesaurus:

List<String> options = edits(formatWord); List<CandidateDto> candidateDtos = new LinkedList<>(); for (String option : options) { if (wordDataMap.containsKey(option)) { CandidateDto dto = CandidateDto.builder() .word(option).count(wordDataMap.get(option)).build(); candidateDtos.add(dto); }}

The results returned in the end need to be compared according to the frequency of words, which is relatively simple in general.

Chinese spelling

A miss

At first glance, the spelling of Chinese looks similar to that of English, but there is something very special about Chinese.

Because all Chinese spelling itself is fixed, there are no wrong words when users input, there are only other words.

It makes no sense to say that a single word is a different word, it has to have a word, or a context.

This makes correction much more difficult.

Xiao Ming helplessly shook his head, Chinese culture, extensive and profound.

Algorithm ideas

There are many ways to correct Chinese characters:

(1) Puzzle set.

For example, commonly used other words, Wanchangbujiao its mistake to write Wanchangbujiao.

(2) N – “gramm

That is the first word corresponding to the context, the more widely used is 2-gram. The corresponding corpus is available in Sougou’s lab.

That is, when the first word is fixed, there will be a corresponding probability of the second occurrence, the higher the probability, the more likely it is that the user intended to enter.

Like running fast, actually running fast might be the right thing to do.

Error correction

Of course, one of the difficulties in Chinese is that you can’t change one character to another by INSERT/DELETE/REPLACE.

But similarly, there are many ways:

(1) homophone words/homophone words

(2) form near the word

(3) Synonyms

(4) words out of order, words add and delete

Algorithm implementation

Forced by the difficulty of implementation, Xiao Ming chose the simplest puzzle set.

First, find a dictionary of common words. Here is an excerpt:

A stork of a hill and a raccoon dog will still do just as they used to do. Likely dispiritedly dispiritedly dispiritedly dispiritedly support each other to help drum manic and into the noise and into the dragon plate tiger according to the dragon panhuju

The first is the other word, the second is the correct usage.

With different characters as the dictionary, and then the Chinese text fast-forward segmentation, to get the corresponding correct form.

Of course, we can start off simple by asking the user to type in a phrase and then parsing the corresponding map directly

public List<String> correctList(String word, int limit, IWordCheckerContext context) { final Map<String, List<String>> wordData = context.wordData().correctData(); / / determine whether error if isCorrect (word, the context)) {return Collections. SingletonList (word); } List<String> allList = wordData.get(word); final int minLimit = Math.min(allList.size(), limit); List<String> resultList = Guavas.newArrayList(minLimit); for(int i = 0; i < minLimit; i++) { resultList.add(allList.get(i)); } return resultList; }

Mixed Chinese and English long text

Algorithm ideas

The actual text is usually a mixture of Chinese and English.

If you want to make it more user-friendly, you can’t just type one phrase at a time.

So what can we do?

The answer is word segmentation, the input of the sentence, word segmentation into a word. Then distinguish Chinese and English, corresponding processing.

For participles, open source projects are recommended:

https://github.com/houbb/segment

Algorithm implementation

The modified core algorithm can be reused in both Chinese and English implementations.

@Override public String correct(String text) { if(StringUtil.isEnglish(text)) { return text; } StringBuilder stringBuilder = new StringBuilder(); final IWordCheckerContext zhContext = buildChineseContext(); final IWordCheckerContext enContext = buildEnglishContext(); // List<String> Segments = CommonSegment. Segment (text); // All are true until they are true. for(String segment : Segments) {// if(StringUtil.isEnglish(segment)) {String correct = enwordChecker.correct (segment, enContext); stringBuilder.append(correct); } else if(StringUtil.isChinese(segment)) { String correct = zhWordChecker.correct(segment, zhContext); stringBuilder.append(correct); } else {// ignore StringBuilder. append(segment); } } return stringBuilder.toString(); }

The default implementation of word participles is as follows:

import com.github.houbb.heaven.util.util.CollectionUtil; import com.github.houbb.nlp.common.segment.ICommonSegment; import com.github.houbb.nlp.common.segment.impl.CommonSegments; import java.util.ArrayList; import java.util.Collections; import java.util.List; /** * Default mixed word segmentation, support Chinese and English. * * public class defaultSegment implements Segment {@Overpass public List<String> segment(String s) {List<String> strings = commonsegment.defaults (). Segment (s); if(CollectionUtil.isEmpty(strings)) { return Collections.emptyList(); } List<String> results = new ArrayList<>(); ICommonSegment chineseSegment = InnerCommonSegments.defaultChinese(); For (String text: strings) {List<String> segments = ChineseSegment (text); results.addAll(segments); } return results; }}

First of all, we participle for blank space, and then we do fast-forward participle for Chinese with different characters of puzzle set.

Of course, this is not hard to say.

It’s really troublesome to realize, but Xiaoming’s complete implementation has been open source:

https://github.com/houbb/word-checker

Fork /star a wave if you find it helpful

Quick start

Word -checker is used for word spell checking. Support English word spelling detection, and Chinese spelling detection.

Without further ado, let’s have a hands-on experience with this tool class.

Features that

  • You can quickly determine whether the current word is misspelled
  • Can return the best match result
  • You can return a list of corrected matches, with support for specifying the size of the return list
  • Error message supports i18n
  • Support case, full corner half corner formatting processing
  • Support for custom thesaurus
  • Built-in 27W+ English thesaurus
  • Support basic Chinese spelling check

Quick start

Maven is introduced into

< the dependency > < groupId > com. Making. Houbb < / groupId > < artifactId > word - the checker < / artifactId > < version > 0.0.8 < / version > </dependency>

The test case

Based on the input, it will automatically return the best correction result.

final String speling = "speling";
Assert.assertEquals("spelling", EnWordCheckers.correct(speling));

Introduction to the core API

The core API is under the EnwordCheckers utility class.

function methods parameter The return value note
Determine if the word is spelled correctly isCorrect(string) The word to be detected boolean
Returns the best correction result correct(string) The word to be detected String If no word is found that can be corrected, it returns itself
Determine if the word is spelled correctly correctList(string) The word to be detected List Returns a list of all matched corrections
Determine if the word is spelled correctly correctList(string, int limit) The word to be detected, returns the size of the list Returns a list of corrections of the specified size List size is less than or equal to limit

Test cases

see
EnWordCheckerTest.java

Whether you spell it correctly

final String hello = "hello";
final String speling = "speling";
Assert.assertTrue(EnWordCheckers.isCorrect(hello));
Assert.assertFalse(EnWordCheckers.isCorrect(speling));

Returns the best match result

final String hello = "hello";
final String speling = "speling";
Assert.assertEquals("hello", EnWordCheckers.correct(hello));
Assert.assertEquals("spelling", EnWordCheckers.correct(speling));

Correct the match list by default

final String word = "goox";
List<String> stringList = EnWordCheckers.correctList(word);
Assert.assertEquals("[good, goo, goon, goof, gook, goop, goos, gox, goog, gool, goor]", stringList.toString());

Specifies the size of the correction match list

final String word = "goox";
final int limit = 2;
List<String> stringList = EnWordCheckers.correctList(word, limit);
Assert.assertEquals("[good, goo]", stringList.toString());

Chinese spelling correction

Core API

To reduce learning costs, the core API and ZhWordCheckers are consistent with English spelling detection.

Whether you spell it correctly

Final String right = "Correct "; Final String error = "Everything changes "; Assert.assertTrue(ZhWordCheckers.isCorrect(right)); Assert.assertFalse(ZhWordCheckers.isCorrect(error));

Returns the best match result

Final String right = "Correct "; Final String error = "Everything changes "; Assert.assertequals (" correct ", zhwordcheckers.correct (right)); Assert.assertequals (" Everything is the same as it is ", zhwordcheckers.correct (error));

Correct the match list by default

Final String Word = "Everything changes "; List<String> stringList = ZhWordCheckers.correctList(word); Assert. AssertEquals (" ", stringList.toString());

Specifies the size of the correction match list

Final String Word = "Everything changes "; final int limit = 1; List<String> stringList = ZhWordCheckers.correctList(word, limit); Assert. AssertEquals (" ", stringList.toString());

Long text mixed in English and Chinese

scenario

For actual spelling correction, the best use experience is for the user to type a long text, possibly in a mix of English and Chinese.

Then realize the above corresponding functions.

Core method

The WordCheckers utility class provides auto-correct functionality for long text mixed in English and Chinese.

function methods parameter The return value note
Whether the text is spelled correctly isCorrect(string) The text to be detected boolean Returns true if all are correct
Returns the best correction result correct(string) The word to be detected String If no text is found that can be corrected, it returns itself
Determine if the text is spelled correctly correctMap(string) The word to be detected Map Returns a list of all matched corrections
Determine if the text is spelled correctly correctMap(string, int limit) Text to be detected, returns the size of the list Returns a list of corrections of the specified size List size is less than or equal to limit

Is your spelling correct?

Final String hello = "Hello Hello "; Final String speling = "Speling is good for you "; Assert.assertTrue(WordCheckers.isCorrect(hello)); Assert.assertFalse(WordCheckers.isCorrect(speling));

Returns the best correction result

Final String hello = "Hello Hello "; Final String speling = "Speling is good for you "; Assert.assertequals ("hello hello ", wordcheckers.correct (hello)); Assert.assertequals ("spelling is good for spelling ", wordcheckers.correct (speling));

Determine if the text is spelled correctly

For each word, the corresponding corrective result.

Final String hello = "Hello Hello "; Final String speling = "Speling is good for you "; Assert.assertQuals ("{hello=[hello], =[], you =[you], OK =[good]}", wordCheckers.correctMap (hello).toString()); Assert.assertEquals("{ =[ ], speling=[spelling, spewing, sperling, seeling, spieling, spiling, speeling, speiling, Spelding], you =[you], good =[good], poison for poison =[poison for poison]}", wordcheckers.correctMap (speling).toString());

Determine if the text is spelled correctly

As above, specify the maximum number of returns.

Final String hello = "Hello Hello "; Final String speling = "Speling is good for you "; Assert.assertQuals ("{hello=[hello], =[], you =[you], OK =[good]}", wordCheckers.correctMap (hello, 2).toString()); Assert.assertQuals ("{=[], speling=[spelling, spewing], you =[you], good =[good], "wordcheckers.correctMap (speling, speling, spewing) 2).toString());

Format processing

Sometimes the user input is various, this tool supports the processing of formatting.

case

Capitals are formatted uniformly as lowercase.

final String word = "stRing";

Assert.assertTrue(EnWordCheckers.isCorrect(word));

The Angle of half Angle

All angles are formatted uniformly as half angles.

Final String word = "String "; Assert.assertTrue(EnWordCheckers.isCorrect(word));

Custom English thesaurus

Configuration file

You can create the file Resources /data/define_word_checker_en.txt in the project resources directory

The contents are as follows:

my-long-long-define-word,2
my-long-long-define-word-two

Each word has its own line.

The first column of each line is the word, and the second column is the number of occurrentions, separated by a comma.

The greater the number of times, the higher the priority returned when corrected, and the default value is 1.

The user – defined thesaurus takes precedence over the system – built thesaurus.

The test code

Once we specify the corresponding word, spell checking will take effect.

final String word = "my-long-long-define-word";
final String word2 = "my-long-long-define-word-two";

Assert.assertTrue(EnWordCheckers.isCorrect(word));
Assert.assertTrue(EnWordCheckers.isCorrect(word2));

Custom Chinese thesaurus

Configuration file

You can create the file RESOURCES/DATA/DEFINE_WORD_CHECKER_ZH.txt in the project resources directory

The contents are as follows:

Get into a rut

Use English space to separate, before error, followed by correct.

summary

The correction of Chinese and English spelling has always been a hot and difficult topic.

In recent years, because of the progress of NLP and artificial intelligence, the application in business is also gradually successful.

The main implementation is based on the traditional algorithm, the core in thesaurus.

The complete implementation of Xiaoming has been open source:

https://github.com/houbb/word-checker

For those who find it helpful, please feel free to fork/star

subsequent

After several days of hard work, Xiao Ming finally finished a simple spell-check tool.

“The spelling check function of the public number that said last time with me still want?”

“No, you don’t say I have forgotten.” “, the product seemed surprised. It doesn’t matter if we do that or not. We’ve been squeezing a lot of business needs lately. You have to take a look at it first.”

“……”

“I recently saw a very good function on XXX, you can make one for our system.”

“……”