Does java implement Chinese and English spelling check and error correction? But I can only write CRUD!

Simple needs

Nearing the end of get off work, Xiao Ming has finished today's tasks and is preparing to go home from get off work.

A message flashed.

"Recently, I found that the spell check function of the official account is good, helping users find typos, and the experience is good. Give us a system too."

Seeing this news, Xiao Ming greeted silently in his heart.

"My TND guy can do this, so I went straight to work at the headquarters, and I am angry with you here."

"Okay," Xiao Ming replied, "I'll take a look first."

Today, I have to get off work when the king, Lao Tzu, comes, and Jesus can't keep it.

Xiao Ming thought, and went home.

耶稣也留不住

Calm analysis

When it comes to spell checking, Xiao Ming actually knows it.

I have never eaten pork, and I have seen pigs run.

I have usually read the sharing of some official accounts, saying that the official account has launched a spell check function, and there will be no typos in the future.

Later, Xiao Ming still saw many typos in their articles. Later, there is no later.

Why not ask almighty github?

Xiao Ming opened github and found that there seems to be no mature java-related open source projects, and some stars are not easy to use.

It is estimated that NLP is more engaged in python, right? Java implements spell checking and error correction in Chinese and English? But I can only write CRUD!

Xiao Ming silently placed a piece of Hua Zi...

The night outside the window is like water, and I can't help but fall into contemplation. Where do I come from? Where are you going? What is the meaning of life?

哲学三问

The ashes with the remaining heat fell on the slippers Xiao Ming bought by a certain Dong, and it scalded the wild horse that had run out of his mind.

Without any thoughts, without any clues, let's wash and sleep first.

That night, Xiao Ming had a long sweet dream. There are no typos in the dream, all the words and sentences are located in the correct position...

Turnaround

The next day, Xiao Ming opened the search box and typed spelling correct.

Fortunately, I found an explanation of the English spelling correction algorithm.

I’m thinking about it all day long, it’s better to learn what I need to learn. Xiao Ming sighed and looked at it.

Algorithm ideas

English words are mainly composed of 26 English letters, so there may be errors in spelling.

First, you can get the correct English words, excerpts are as follows:

apple,16192
applecart,41
applecarts,1
appledrain,1
appledrains,1
applejack,571
applejacks,4
appleringie,1
appleringies,1
apples,5914
applesauce,378
applesauces,1
applet,2

Each line is separated by a comma, followed by the frequency of the word.

Take the user input appl as an example, if the word does not exist, you can perform operations such as insert/delete/replace on it to find the closest word. (Essentially finding the word with the smallest edit distance)

If the entered word exists, it is correct and does not need to be processed.

Thesaurus acquisition

So where can I get the English thesaurus?

Xiao Ming thought for a while, so he went to various places to check, and finally found a relatively complete English word frequency thesaurus, a total of 27W+ words.

The excerpt is as follows:

aa,1831
aah,45774
aahed,1
aahing,30
aahs,23
...
zythums,1
zyzzyva,2
zyzzyvas,1
zzz,76
zzzs,2

在这里插入图片描述

Core code

Get all possible situations currently entered by the user, the core code is as follows:

/**
 * 构建出当前单词的所有可能错误情况
 *
 * @param word 输入单词
 * @return 返回结果
 * @since 0.0.1
 * @author 老马啸西风
 */
private List<String> edits(String word) {
    List<String> result = new LinkedList<>();
    for (int i = 0; i < word.length(); ++i) {
        result.add(word.substring(0, i) + word.substring(i + 1));
    }
    for (int i = 0; i < word.length() - 1; ++i) {
        result.add(word.substring(0, i) + word.substring(i + 1, i + 2) + word.substring(i, i + 1) + word.substring(i + 2));
    }
    for (int i = 0; i < word.length(); ++i) {
        for (char c = 'a'; c <= 'z'; ++c) {
            result.add(word.substring(0, i) + c + word.substring(i + 1));
        }
    }
    for (int i = 0; i <= word.length(); ++i) {
        for (char c = 'a'; c <= 'z'; ++c) {
            result.add(word.substring(0, i) + c + word.substring(i));
        }
    }
    return result;
}

Then compare with the correct words in the lexicon:

List<String> options = edits(formatWord);
List<CandidateDto> candidateDtos = new LinkedList<>();
for (String option : options) {
    if (wordDataMap.containsKey(option)) {
        CandidateDto dto = CandidateDto.builder()
                .word(option).count(wordDataMap.get(option)).build();
        candidateDtos.add(dto);
    }
}

The results returned at the end need to be compared according to the frequency of the words, and overall it is relatively simple.

Chinese spelling

Missed

At first glance, Chinese spelling is similar to English, but Chinese has a very special place.

Because the spelling of all Chinese characters is fixed, there are no typos when the user enters, only other characters.

It is meaningless to say a word alone as another word, and there must be a word or context.

This has made it more difficult to correct.

Xiao Ming shook his head helplessly. Chinese culture is broad and profound.

Algorithm ideas

There are many ways to correct Chinese characters:

(1) Confusion set.

For example, the commonly used other characters, always inseparable from its case and wrongly written as .

（2）N-Gram

That is, the context corresponding to the one-time word, and the more widely used is 2-gram. The corresponding corpus is available in Sougou Lab.

That is, when the first word is fixed, there will be a corresponding probability for the second occurrence. The higher the probability, the more likely it is that the user intended to input.

For example, runs fast. In fact, runs fast. It may be correct.

`Error correction`

Of course, another difficulty in Chinese is that one cannot directly change one word into another through insert/delete/replace.

But similarly, there are still many methods:

(1) Homophonic characters/homonymous characters

(2) The shape is similar to the character

(3) Synonyms

(4) Words are out of order, and words are added or deleted

`Algorithm implementation`

Due to the difficulty of implementation, Xiao Ming chose the simplest puzzle set.

First find a dictionary of common characters, excerpts are as follows:

一丘之鹤 一丘之貉
一仍旧惯 一仍旧贯
一付中药 一服中药
...
黯然消魂 黯然销魂
鼎立相助 鼎力相助
鼓躁而进 鼓噪而进
龙盘虎据 龙盘虎踞

The first one is the other character, the latter is the correct usage.

Use other characters as a dictionary, and then perform fast-forward segmentation of the Chinese text to obtain the corresponding correct form.

Of course, at the beginning, we can simply let the user input a phrase, and the realization is to directly parse the corresponding map.

public List<String> correctList(String word, int limit, IWordCheckerContext context) {
    final Map<String, List<String>> wordData = context.wordData().correctData();
    // 判断是否错误
    if(isCorrect(word, context)) {
        return Collections.singletonList(word);
    }
    List<String> allList = wordData.get(word);
    final int minLimit = Math.min(allList.size(), limit);
    List<String> resultList = Guavas.newArrayList(minLimit);
    for(int i = 0; i < minLimit; i++) {
        resultList.add(allList.get(i));
    }
    return resultList;
}

`Mixed long text in Chinese and English`

`Algorithm ideas`

Actual articles are generally mixed in Chinese and English.

If you want to make it more convenient for users, you must not only enter one phrase at a time.

What should I do?

The answer is word segmentation. The input sentence is segmented into words. Then distinguish between Chinese and English and perform corresponding processing.

Regarding word segmentation, open source projects are recommended:

https://github.com/houbb/segment

`Algorithm implementation`

The revised core algorithm can be implemented in both Chinese and English.

@Override
public String correct(String text) {
    if(StringUtil.isEnglish(text)) {
        return text;
    }

    StringBuilder stringBuilder = new StringBuilder();
    final IWordCheckerContext zhContext = buildChineseContext();
    final IWordCheckerContext enContext = buildEnglishContext();

    // 第一步执行分词
    List<String> segments = commonSegment.segment(text);
    // 全部为真，才认为是正确。
    for(String segment : segments) {
        // 如果是英文
        if(StringUtil.isEnglish(segment)) {
            String correct = enWordChecker.correct(segment, enContext);
            stringBuilder.append(correct);
        } else if(StringUtil.isChinese(segment)) {
            String correct = zhWordChecker.correct(segment, zhContext);
            stringBuilder.append(correct);
        } else {
            // 其他忽略
            stringBuilder.append(segment);
        }
    }

    return stringBuilder.toString();
}

The default implementation of word segmentation is as follows:

import com.github.houbb.heaven.util.util.CollectionUtil;
import com.github.houbb.nlp.common.segment.ICommonSegment;
import com.github.houbb.nlp.common.segment.impl.CommonSegments;

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

/**
 * 默认的混合分词，支持中文和英文。
 *
 * @author binbin.hou
 * @since 0.0.8
 */
public class DefaultSegment implements ICommonSegment {

    @Override
    public List<String> segment(String s) {
        //根据空格分隔
        List<String> strings = CommonSegments.defaults().segment(s);
        if(CollectionUtil.isEmpty(strings)) {
            return Collections.emptyList();
        }

        List<String> results = new ArrayList<>();
        ICommonSegment chineseSegment = InnerCommonSegments.defaultChinese();
        for(String text : strings) {
            // 进行中文分词
            List<String> segments = chineseSegment.segment(text);

            results.addAll(segments);
        }


        return results;
    }

}

The first is to segment the words against the spaces, and then the fast-forward segmentation of the Chinese characters in the puzzle set.

Of course, these are not difficult to say.

It is really troublesome to implement. Xiao Ming has open sourced the complete implementation:

https://github.com/houbb/word-checker

Friends who feel helpful can fork/star a wave~

`Quick start`

word-checker is used to check the spelling of words. Support English word spelling detection, and Chinese spelling detection.

Not much to say, let's directly experience the experience of using this tool class.

`Characteristic`

Can quickly determine whether the current word is misspelled
Can return the best match result
You can return a list of corrective matches, and you can specify the size of the returned list
Error message supports i18n
Support uppercase and lowercase, full-width and half-width formatting
Support custom thesaurus
Built-in 27W+ English thesaurus
Support basic Chinese spelling check

`Quick start`

`Introduced by maven`

<dependency>
     <groupId>com.github.houbb</groupId>
     <artifactId>word-checker</artifactId>
    <version>0.0.8</version>
</dependency>

`Test Case`

According to the input, the best correction result will be returned automatically.

final String speling = "speling";
Assert.assertEquals("spelling", EnWordCheckers.correct(speling));

`Core api introduction`

The core api is under the EnWordCheckers tool category.

Function	method	parameter	return value	Remark
Determine whether the spelling of a word is correct	isCorrect(string)	Word to be detected	boolean
Return the best corrected result	correct(string)	Word to be detected	String	If no correctable word is found, return itself
Determine whether the spelling of a word is correct	correctList(string)	Word to be detected	List	Return a corrected list of all matches
Determine whether the spelling of a word is correct	correctList(string, int limit)	The word to be detected, the size of the returned list	Returns the correction list of the specified size	List size is less than or equal to limit

`Test example`

See EnWordCheckerTest.java

`Is it spelled correctly`

final String hello = "hello";
final String speling = "speling";
Assert.assertTrue(EnWordCheckers.isCorrect(hello));
Assert.assertFalse(EnWordCheckers.isCorrect(speling));

`Return the best match result`

final String hello = "hello";
final String speling = "speling";
Assert.assertEquals("hello", EnWordCheckers.correct(hello));
Assert.assertEquals("spelling", EnWordCheckers.correct(speling));

`Corrected match list by default`

final String word = "goox";
List<String> stringList = EnWordCheckers.correctList(word);
Assert.assertEquals("[good, goo, goon, goof, gook, goop, goos, gox, goog, gool, goor]", stringList.toString());

`Specify the size of the corrected match list`

final String word = "goox";
final int limit = 2;
List<String> stringList = EnWordCheckers.correctList(word, limit);
Assert.assertEquals("[good, goo]", stringList.toString());

`Chinese spelling correction`

`Core api`

In order to reduce the cost of learning, the core api and ZhWordCheckers are consistent with the English spelling check.

`Is it spelled correctly`

final String right = "正确";
final String error = "万变不离其中";

Assert.assertTrue(ZhWordCheckers.isCorrect(right));
Assert.assertFalse(ZhWordCheckers.isCorrect(error));

`Return the best match result`

final String right = "正确";
final String error = "万变不离其中";

Assert.assertEquals("正确", ZhWordCheckers.correct(right));
Assert.assertEquals("万变不离其宗", ZhWordCheckers.correct(error));

`Corrected match list by default`

final String word = "万变不离其中";

List<String> stringList = ZhWordCheckers.correctList(word);
Assert.assertEquals("[万变不离其宗]", stringList.toString());

`Specify the size of the corrected match list`

final String word = "万变不离其中";
final int limit = 1;

List<String> stringList = ZhWordCheckers.correctList(word, limit);
Assert.assertEquals("[万变不离其宗]", stringList.toString());

`Long text mixed in Chinese and English`

`scene`

If the actual spelling is corrected, the best user experience is that the user enters a long text, and it may be a mixture of Chinese and English.

Then realize the corresponding functions mentioned above.

`Core method`

WordCheckers tool class provides the automatic correction function of long text mixed in Chinese and English.

Function	method	parameter	return value	Remark
Is the text spelled correctly	isCorrect(string)	Text to be detected	boolean	All correct, will return true
Return the best corrected result	correct(string)	Word to be detected	String	If no text that can be corrected is found, return itself
Determine whether the spelling of the text is correct	correctMap(string)	Word to be detected	Map	Return a corrected list of all matches
Determine whether the spelling of the text is correct	correctMap(string, int limit)	The text to be detected, the size of the returned list	Returns the correction list of the specified size	List size is less than or equal to limit

`Is spelling correct`

final String hello = "hello 你好";
final String speling = "speling 你好 以毒功毒";
Assert.assertTrue(WordCheckers.isCorrect(hello));
Assert.assertFalse(WordCheckers.isCorrect(speling));

`Return the best corrected result`

final String hello = "hello 你好";
final String speling = "speling 你好以毒功毒";
Assert.assertEquals("hello 你好", WordCheckers.correct(hello));
Assert.assertEquals("spelling 你好以毒攻毒", WordCheckers.correct(speling));

`Determine whether the spelling of the text is correct`

Each word corresponds to the correction result.

final String hello = "hello 你好";
final String speling = "speling 你好以毒功毒";
Assert.assertEquals("{hello=[hello],  =[ ], 你=[你], 好=[好]}", WordCheckers.correctMap(hello).toString());
Assert.assertEquals("{ =[ ], speling=[spelling, spewing, sperling, seeling, spieling, spiling, speeling, speiling, spelding], 你=[你], 好=[好], 以毒功毒=[以毒攻毒]}", WordCheckers.correctMap(speling).toString());

`Determine whether the spelling of the text is correct`

Same as above, specify the maximum number of returns.

final String hello = "hello 你好";
final String speling = "speling 你好以毒功毒";

Assert.assertEquals("{hello=[hello],  =[ ], 你=[你], 好=[好]}", WordCheckers.correctMap(hello, 2).toString());
Assert.assertEquals("{ =[ ], speling=[spelling, spewing], 你=[你], 好=[好], 以毒功毒=[以毒攻毒]}", WordCheckers.correctMap(speling, 2).toString());

`Formatting`

Sometimes the user's input is various, this tool supports the processing of formatting.

`Case`

Uppercase will be uniformly formatted as lowercase.

final String word = "stRing";

Assert.assertTrue(EnWordCheckers.isCorrect(word));

`Full-width half-width`

Full-width will be uniformly formatted as half-width.

final String word = "stｒing";

Assert.assertTrue(EnWordCheckers.isCorrect(word));

`Custom English Thesaurus`

`File configuration`

You can create the file resources/data/define_word_checker_en.txt

The content is as follows:

my-long-long-define-word,2
my-long-long-define-word-two

Different words are on their own lines.

The first column of each row represents the word, and the second column represents the number of occurrences. The two are separated ,

The greater the number of times, the higher the return priority when correcting. The default value is 1.

User-defined thesaurus has a higher priority than the built-in thesaurus of the system.

`Test code`

After we specify the corresponding word, the spelling check will take effect.

final String word = "my-long-long-define-word";
final String word2 = "my-long-long-define-word-two";

Assert.assertTrue(EnWordCheckers.isCorrect(word));
Assert.assertTrue(EnWordCheckers.isCorrect(word2));

`Custom Chinese Thesaurus`

`File configuration`

You can create the file resources/data/define_word_checker_zh.txt

The content is as follows:

默守成规 墨守成规

Use English spaces to separate, the front is wrong, the back is correct.

`summary`

Correction of Chinese and English spelling has always been a hot and difficult topic.

In recent years, because of the advancement of NLP and artificial intelligence, commercial applications have gradually succeeded.

The main implementation this time is based on traditional algorithms, with the core in the vocabulary.

Xiao Ming has open sourced the complete implementation:

https://github.com/houbb/word-checker

Friends who feel helpful welcome a wave of fork/star~

`Follow-up`

After several days of hard work, Xiao Ming finally completed one of the simplest spell check tools.

"Does the spell check function of the official account I talked about last time still need it?"

"No, I forgot if you don't tell me." The product looked a little surprised. "It doesn't matter if that demand is done or not. We have recently squeezed a bunch of business requirements. You take a look first."

“……”

"I recently saw a function on xxx that is also very good, you make one for our system."

“……”

Does java implement Chinese and English spelling check and error correction? But I can only write CRUD!

Simple needs

Calm analysis

Turnaround

Algorithm ideas

Thesaurus acquisition

Core code

Chinese spelling

Missed

Algorithm ideas

Error correction

Algorithm implementation

Mixed long text in Chinese and English

Algorithm ideas

Algorithm implementation

Quick start

Characteristic

Quick start

Introduced by maven

Test Case

Core api introduction

Test example

Is it spelled correctly

Return the best match result

Corrected match list by default

Specify the size of the corrected match list

Chinese spelling correction

Core api

Is it spelled correctly

Return the best match result

Corrected match list by default

Specify the size of the corrected match list

Long text mixed in Chinese and English

scene

Core method

Is spelling correct

Return the best corrected result

Determine whether the spelling of the text is correct

Determine whether the spelling of the text is correct

Formatting

Case

Full-width half-width

Custom English Thesaurus

File configuration

Test code

Custom Chinese Thesaurus

File configuration

summary

Follow-up

老马啸西风

引用和评论

resubmit v1.2.0 新特性支持类级别防止重复提交

一文掌握 MCP 上下文协议：从理论到实践

git 常用命令

基于 MCP 的 AI Agent 应用开发实践

OSPO Summit 2025 正式定档！议题征集同步开启

OSPO Summit 2025 首批议程发布！

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

`Error correction`

`Algorithm implementation`

`Mixed long text in Chinese and English`

`Algorithm ideas`

`Algorithm implementation`

`Quick start`

`Characteristic`

`Quick start`

`Introduced by maven`

`Test Case`

`Core api introduction`

`Test example`

`Is it spelled correctly`

`Return the best match result`

`Corrected match list by default`

`Specify the size of the corrected match list`

`Chinese spelling correction`

`Core api`

`Is it spelled correctly`

`Return the best match result`

`Corrected match list by default`

`Specify the size of the corrected match list`

`Long text mixed in Chinese and English`

`scene`

`Core method`

`Is spelling correct`

`Return the best corrected result`

`Determine whether the spelling of the text is correct`

`Determine whether the spelling of the text is correct`

`Formatting`

`Case`

`Full-width half-width`

`Custom English Thesaurus`

`File configuration`

`Test code`

`Custom Chinese Thesaurus`

`File configuration`

`summary`

`Follow-up`

`引用和评论`