Simple needs

Nearing the end of get off work, Xiao Ming has finished today's tasks and is preparing to go home from get off work.

A message flashed.

"Recently, I found that the spell check function of the official account is good, helping users find typos, and the experience is good. Give us a system too."

Seeing this news, Xiao Ming greeted silently in his heart.

"My TND guy can do this, so I went straight to work at the headquarters, and I am angry with you here."

"Okay," Xiao Ming replied, "I'll take a look first."

Today, I have to get off work when the king, Lao Tzu, comes, and Jesus can't keep it.

Xiao Ming thought, and went home.

耶稣也留不住

Calm analysis

When it comes to spell checking, Xiao Ming actually knows it.

I have never eaten pork, and I have seen pigs run.

I have usually read the sharing of some official accounts, saying that the official account has launched a spell check function, and there will be no typos in the future.

Later, Xiao Ming still saw many typos in their articles. Later, there is no later.

Why not ask almighty github?

Xiao Ming opened github and found that there seems to be no mature java-related open source projects, and some stars are not easy to use.

It is estimated that NLP is more engaged in python, right? Java implements spell checking and error correction in Chinese and English? But I can only write CRUD!

Xiao Ming silently placed a piece of Hua Zi...

The night outside the window is like water, and I can't help but fall into contemplation. Where do I come from? Where are you going? What is the meaning of life?

哲学三问

The ashes with the remaining heat fell on the slippers Xiao Ming bought by a certain Dong, and it scalded the wild horse that had run out of his mind.

Without any thoughts, without any clues, let's wash and sleep first.

That night, Xiao Ming had a long sweet dream. There are no typos in the dream, all the words and sentences are located in the correct position...

Turnaround

The next day, Xiao Ming opened the search box and typed spelling correct.

Fortunately, I found an explanation of the English spelling correction algorithm.

I’m thinking about it all day long, it’s better to learn what I need to learn. Xiao Ming sighed and looked at it.

Algorithm ideas

English words are mainly composed of 26 English letters, so there may be errors in spelling.

First, you can get the correct English words, excerpts are as follows:

apple,16192
applecart,41
applecarts,1
appledrain,1
appledrains,1
applejack,571
applejacks,4
appleringie,1
appleringies,1
apples,5914
applesauce,378
applesauces,1
applet,2

Each line is separated by a comma, followed by the frequency of the word.

Take the user input appl as an example, if the word does not exist, you can perform operations such as insert/delete/replace on it to find the closest word. (Essentially finding the word with the smallest edit distance)

If the entered word exists, it is correct and does not need to be processed.

Thesaurus acquisition

So where can I get the English thesaurus?

Xiao Ming thought for a while, so he went to various places to check, and finally found a relatively complete English word frequency thesaurus, a total of 27W+ words.

The excerpt is as follows:

aa,1831
aah,45774
aahed,1
aahing,30
aahs,23
...
zythums,1
zyzzyva,2
zyzzyvas,1
zzz,76
zzzs,2

在这里插入图片描述

Core code

Get all possible situations currently entered by the user, the core code is as follows:

/**
 * 构建出当前单词的所有可能错误情况
 *
 * @param word 输入单词
 * @return 返回结果
 * @since 0.0.1
 * @author 老马啸西风
 */
private List<String> edits(String word) {
    List<String> result = new LinkedList<>();
    for (int i = 0; i < word.length(); ++i) {
        result.add(word.substring(0, i) + word.substring(i + 1));
    }
    for (int i = 0; i < word.length() - 1; ++i) {
        result.add(word.substring(0, i) + word.substring(i + 1, i + 2) + word.substring(i, i + 1) + word.substring(i + 2));
    }
    for (int i = 0; i < word.length(); ++i) {
        for (char c = 'a'; c <= 'z'; ++c) {
            result.add(word.substring(0, i) + c + word.substring(i + 1));
        }
    }
    for (int i = 0; i <= word.length(); ++i) {
        for (char c = 'a'; c <= 'z'; ++c) {
            result.add(word.substring(0, i) + c + word.substring(i));
        }
    }
    return result;
}

Then compare with the correct words in the lexicon:

List<String> options = edits(formatWord);
List<CandidateDto> candidateDtos = new LinkedList<>();
for (String option : options) {
    if (wordDataMap.containsKey(option)) {
        CandidateDto dto = CandidateDto.builder()
                .word(option).count(wordDataMap.get(option)).build();
        candidateDtos.add(dto);
    }
}

The results returned at the end need to be compared according to the frequency of the words, and overall it is relatively simple.

Chinese spelling

Missed

At first glance, Chinese spelling is similar to English, but Chinese has a very special place.

Because the spelling of all Chinese characters is fixed, there are no typos when the user enters, only other characters.

It is meaningless to say a word alone as another word, and there must be a word or context.

This has made it more difficult to correct.

Xiao Ming shook his head helplessly. Chinese culture is broad and profound.

Algorithm ideas

There are many ways to correct Chinese characters:

(1) Confusion set.

For example, the commonly used other characters, always inseparable from its case and wrongly written as .

(2)N-Gram

That is, the context corresponding to the one-time word, and the more widely used is 2-gram. The corresponding corpus is available in Sougou Lab.

That is, when the first word is fixed, there will be a corresponding probability for the second occurrence. The higher the probability, the more likely it is that the user intended to input.

For example, runs fast. In fact, runs fast. It may be correct.

Error correction

Of course, another difficulty in Chinese is that one cannot directly change one word into another through insert/delete/replace.

But similarly, there are still many methods:

(1) Homophonic characters/homonymous characters

(2) The shape is similar to the character

(3) Synonyms

(4) Words are out of order, and words are added or deleted

在这里插入图片描述

Algorithm implementation

Due to the difficulty of implementation, Xiao Ming chose the simplest puzzle set.

First find a dictionary of common characters, excerpts are as follows:

一丘之鹤 一丘之貉
一仍旧惯 一仍旧贯
一付中药 一服中药
...
黯然消魂 黯然销魂
鼎立相助 鼎力相助
鼓躁而进 鼓噪而进
龙盘虎据 龙盘虎踞

The first one is the other character, the latter is the correct usage.

Use other characters as a dictionary, and then perform fast-forward segmentation of the Chinese text to obtain the corresponding correct form.

Of course, at the beginning, we can simply let the user input a phrase, and the realization is to directly parse the corresponding map.

public List<String> correctList(String word, int limit, IWordCheckerContext context) {
    final Map<String, List<String>> wordData = context.wordData().correctData();
    // 判断是否错误
    if(isCorrect(word, context)) {
        return Collections.singletonList(word);
    }
    List<String> allList = wordData.get(word);
    final int minLimit = Math.min(allList.size(), limit);
    List<String> resultList = Guavas.newArrayList(minLimit);
    for(int i = 0; i < minLimit; i++) {
        resultList.add(allList.get(i));
    }
    return resultList;
}

Mixed long text in Chinese and English

Algorithm ideas

Actual articles are generally mixed in Chinese and English.

If you want to make it more convenient for users, you must not only enter one phrase at a time.

What should I do?

The answer is word segmentation. The input sentence is segmented into words. Then distinguish between Chinese and English and perform corresponding processing.

Regarding word segmentation, open source projects are recommended:

https://github.com/houbb/segment

Algorithm implementation

The revised core algorithm can be implemented in both Chinese and English.

@Override
public String correct(String text) {
    if(StringUtil.isEnglish(text)) {
        return text;
    }

    StringBuilder stringBuilder = new StringBuilder();
    final IWordCheckerContext zhContext = buildChineseContext();
    final IWordCheckerContext enContext = buildEnglishContext();

    // 第一步执行分词
    List<String> segments = commonSegment.segment(text);
    // 全部为真,才认为是正确。
    for(String segment : segments) {
        // 如果是英文
        if(StringUtil.isEnglish(segment)) {
            String correct = enWordChecker.correct(segment, enContext);
            stringBuilder.append(correct);
        } else if(StringUtil.isChinese(segment)) {
            String correct = zhWordChecker.correct(segment, zhContext);
            stringBuilder.append(correct);
        } else {
            // 其他忽略
            stringBuilder.append(segment);
        }
    }

    return stringBuilder.toString();
}

The default implementation of word segmentation is as follows:

import com.github.houbb.heaven.util.util.CollectionUtil;
import com.github.houbb.nlp.common.segment.ICommonSegment;
import com.github.houbb.nlp.common.segment.impl.CommonSegments;

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

/**
 * 默认的混合分词,支持中文和英文。
 *
 * @author binbin.hou
 * @since 0.0.8
 */
public class DefaultSegment implements ICommonSegment {

    @Override
    public List<String> segment(String s) {
        //根据空格分隔
        List<String> strings = CommonSegments.defaults().segment(s);
        if(CollectionUtil.isEmpty(strings)) {
            return Collections.emptyList();
        }

        List<String> results = new ArrayList<>();
        ICommonSegment chineseSegment = InnerCommonSegments.defaultChinese();
        for(String text : strings) {
            // 进行中文分词
            List<String> segments = chineseSegment.segment(text);

            results.addAll(segments);
        }


        return results;
    }

}

The first is to segment the words against the spaces, and then the fast-forward segmentation of the Chinese characters in the puzzle set.

Of course, these are not difficult to say.

It is really troublesome to implement. Xiao Ming has open sourced the complete implementation:

https://github.com/houbb/word-checker

Friends who feel helpful can fork/star a wave~

Quick start

word-checker is used to check the spelling of words. Support English word spelling detection, and Chinese spelling detection.

Not much to say, let's directly experience the experience of using this tool class.

Characteristic

  • Can quickly determine whether the current word is misspelled
  • Can return the best match result
  • You can return a list of corrective matches, and you can specify the size of the returned list
  • Error message supports i18n
  • Support uppercase and lowercase, full-width and half-width formatting
  • Support custom thesaurus
  • Built-in 27W+ English thesaurus
  • Support basic Chinese spelling check

Quick start

Introduced by maven

<dependency>
     <groupId>com.github.houbb</groupId>
     <artifactId>word-checker</artifactId>
    <version>0.0.8</version>
</dependency>

Test Case

According to the input, the best correction result will be returned automatically.

final String speling = "speling";
Assert.assertEquals("spelling", EnWordCheckers.correct(speling));

Core api introduction

The core api is under the EnWordCheckers tool category.

Functionmethodparameterreturn valueRemark
Determine whether the spelling of a word is correctisCorrect(string)Word to be detectedboolean
Return the best corrected resultcorrect(string)Word to be detectedStringIf no correctable word is found, return itself
Determine whether the spelling of a word is correctcorrectList(string)Word to be detectedListReturn a corrected list of all matches
Determine whether the spelling of a word is correctcorrectList(string, int limit)The word to be detected, the size of the returned listReturns the correction list of the specified sizeList size is less than or equal to limit

Test example

See EnWordCheckerTest.java

Is it spelled correctly

final String hello = "hello";
final String speling = "speling";
Assert.assertTrue(EnWordCheckers.isCorrect(hello));
Assert.assertFalse(EnWordCheckers.isCorrect(speling));

Return the best match result

final String hello = "hello";
final String speling = "speling";
Assert.assertEquals("hello", EnWordCheckers.correct(hello));
Assert.assertEquals("spelling", EnWordCheckers.correct(speling));

Corrected match list by default

final String word = "goox";
List<String> stringList = EnWordCheckers.correctList(word);
Assert.assertEquals("[good, goo, goon, goof, gook, goop, goos, gox, goog, gool, goor]", stringList.toString());

Specify the size of the corrected match list

final String word = "goox";
final int limit = 2;
List<String> stringList = EnWordCheckers.correctList(word, limit);
Assert.assertEquals("[good, goo]", stringList.toString());

Chinese spelling correction

Core api

In order to reduce the cost of learning, the core api and ZhWordCheckers are consistent with the English spelling check.

Is it spelled correctly

final String right = "正确";
final String error = "万变不离其中";

Assert.assertTrue(ZhWordCheckers.isCorrect(right));
Assert.assertFalse(ZhWordCheckers.isCorrect(error));

Return the best match result

final String right = "正确";
final String error = "万变不离其中";

Assert.assertEquals("正确", ZhWordCheckers.correct(right));
Assert.assertEquals("万变不离其宗", ZhWordCheckers.correct(error));

Corrected match list by default

final String word = "万变不离其中";

List<String> stringList = ZhWordCheckers.correctList(word);
Assert.assertEquals("[万变不离其宗]", stringList.toString());

Specify the size of the corrected match list

final String word = "万变不离其中";
final int limit = 1;

List<String> stringList = ZhWordCheckers.correctList(word, limit);
Assert.assertEquals("[万变不离其宗]", stringList.toString());

Long text mixed in Chinese and English

scene

If the actual spelling is corrected, the best user experience is that the user enters a long text, and it may be a mixture of Chinese and English.

Then realize the corresponding functions mentioned above.

Core method

WordCheckers tool class provides the automatic correction function of long text mixed in Chinese and English.

Functionmethodparameterreturn valueRemark
Is the text spelled correctlyisCorrect(string)Text to be detectedbooleanAll correct, will return true
Return the best corrected resultcorrect(string)Word to be detectedStringIf no text that can be corrected is found, return itself
Determine whether the spelling of the text is correctcorrectMap(string)Word to be detectedMapReturn a corrected list of all matches
Determine whether the spelling of the text is correctcorrectMap(string, int limit)The text to be detected, the size of the returned listReturns the correction list of the specified sizeList size is less than or equal to limit

Is spelling correct

final String hello = "hello 你好";
final String speling = "speling 你好 以毒功毒";
Assert.assertTrue(WordCheckers.isCorrect(hello));
Assert.assertFalse(WordCheckers.isCorrect(speling));

Return the best corrected result

final String hello = "hello 你好";
final String speling = "speling 你好以毒功毒";
Assert.assertEquals("hello 你好", WordCheckers.correct(hello));
Assert.assertEquals("spelling 你好以毒攻毒", WordCheckers.correct(speling));

Determine whether the spelling of the text is correct

Each word corresponds to the correction result.

final String hello = "hello 你好";
final String speling = "speling 你好以毒功毒";
Assert.assertEquals("{hello=[hello],  =[ ], 你=[你], 好=[好]}", WordCheckers.correctMap(hello).toString());
Assert.assertEquals("{ =[ ], speling=[spelling, spewing, sperling, seeling, spieling, spiling, speeling, speiling, spelding], 你=[你], 好=[好], 以毒功毒=[以毒攻毒]}", WordCheckers.correctMap(speling).toString());

Determine whether the spelling of the text is correct

Same as above, specify the maximum number of returns.

final String hello = "hello 你好";
final String speling = "speling 你好以毒功毒";

Assert.assertEquals("{hello=[hello],  =[ ], 你=[你], 好=[好]}", WordCheckers.correctMap(hello, 2).toString());
Assert.assertEquals("{ =[ ], speling=[spelling, spewing], 你=[你], 好=[好], 以毒功毒=[以毒攻毒]}", WordCheckers.correctMap(speling, 2).toString());

Formatting

Sometimes the user's input is various, this tool supports the processing of formatting.

Case

Uppercase will be uniformly formatted as lowercase.

final String word = "stRing";

Assert.assertTrue(EnWordCheckers.isCorrect(word));

Full-width half-width

Full-width will be uniformly formatted as half-width.

final String word = "string";

Assert.assertTrue(EnWordCheckers.isCorrect(word));

Custom English Thesaurus

File configuration

You can create the file resources/data/define_word_checker_en.txt

The content is as follows:

my-long-long-define-word,2
my-long-long-define-word-two

Different words are on their own lines.

The first column of each row represents the word, and the second column represents the number of occurrences. The two are separated ,

The greater the number of times, the higher the return priority when correcting. The default value is 1.

User-defined thesaurus has a higher priority than the built-in thesaurus of the system.

Test code

After we specify the corresponding word, the spelling check will take effect.

final String word = "my-long-long-define-word";
final String word2 = "my-long-long-define-word-two";

Assert.assertTrue(EnWordCheckers.isCorrect(word));
Assert.assertTrue(EnWordCheckers.isCorrect(word2));

Custom Chinese Thesaurus

File configuration

You can create the file resources/data/define_word_checker_zh.txt

The content is as follows:

默守成规 墨守成规

Use English spaces to separate, the front is wrong, the back is correct.

summary

Correction of Chinese and English spelling has always been a hot and difficult topic.

In recent years, because of the advancement of NLP and artificial intelligence, commercial applications have gradually succeeded.

The main implementation this time is based on traditional algorithms, with the core in the vocabulary.

Xiao Ming has open sourced the complete implementation:

https://github.com/houbb/word-checker

Friends who feel helpful welcome a wave of fork/star~

Follow-up

After several days of hard work, Xiao Ming finally completed one of the simplest spell check tools.

"Does the spell check function of the official account I talked about last time still need it?"

"No, I forgot if you don't tell me." The product looked a little surprised. "It doesn't matter if that demand is done or not. We have recently squeezed a bunch of business requirements. You take a look first."

“……”

"I recently saw a function on xxx that is also very good, you make one for our system."

“……”


老马啸西风
191 声望34 粉丝