Preface

product manager, Xiao Ming wrote a Chinese and English spelling correction tool: 16104c3c022593 https://github.com/houbb/word-checker .

I thought it could be done once and for all, until yesterday I was idle and found another open source project. The description is as follows:

Spelling correction & Fuzzy search: 1 million times faster through Symmetric Delete spelling correction algorithm

The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. 

It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.

Xiao Ming thought that his eyes were dazzling, 100W times, this is a terrible boast.

吹牛

Adhering to the principle of not believing rumors and not spreading rumors, Xiao Ming began his algorithmic learning journey.

Word spelling algorithm ideas

There are several algorithms for spelling English words:

Real-time calculation of edit distance

Given two strings $s_1$ and $s_2$, the edit distance between them is the minimum number of edit operations required to convert $s_1$ to $s_2$.

The most common editing operations allowed for this purpose are: (i) insert a character in the string; (ii) delete a character from the string and (iii) replace a character in the string with another character ; For these operations, the edit distance is sometimes called the Levenshtein distance.

This algorithm is the easiest to think of, but the cost of real-time calculation is very expensive. Generally not implemented as an industry.

Peter Norvig's spelling algorithm

Generate all possible words with edit distance (delete + transpose + replace + insert) from the query word, and search for them in the dictionary.

For a word of length n, the letter size is a, the edit distance is d=1, there will be n deletions, n-1 transpositions, a*n 8 changes, a*(n+1) insertions, and a total of 2n searches for 2n+2an+a-1 words.

This algorithm is the algorithm Xiao Ming chose at the beginning, and its performance is much better than the previous algorithm.

But it is still expensive when searching (114,324 terms with n=9, a=36, d=2) and language-dependent (because the alphabet is used to generate terms, which is different in many ways).

An idea for performance improvement

If Xiao Ming is allowed to improve this algorithm, one way of thinking is to use space for time.

The deletion + transposition + replacement + insertion of a correct word are all generated in advance, and then inserted into the dictionary.

But there is a big problem here. The number of dictionaries generated by this preprocessing is too large, and some are unacceptable.

So, is the world safe and secure?

鱼和熊掌

Let us take a look at the protagonist of this article.

Symmetric deletion and spelling correction (SymSpell)

Algorithm Description

Generate terms with edit distance (deleted only) from each dictionary term, and add them to the dictionary along with the original term.

This must be performed only once in the pre-calculation step.

Generate terms with edit distance (only delete) from the input term and search for them in the dictionary.

For words of length n, words of letter size a, and edit distance of 1, there will be only n deletions, and there will be a total of n terms in the search.

The cost of this method is the pre-calculation time and storage space for x deletions of each original dictionary entry, which is acceptable in most cases.

The number of deletions x of a single dictionary entry depends on the maximum edit distance.

The symmetric deletion spelling correction algorithm reduces the complexity of editing candidate generation and dictionary lookup by using only deletion instead of deletion + transposition + replacement + insertion. It is six orders of magnitude faster (edit distance = 3) and has nothing to do with language.

Some notes

In order to make it easier for everyone to understand, the original author also wrote a little remark.

Note 1: In the pre-calculation process, different words in the dictionary may lead to the same deleted word: delete(sun,1)==delete(sin,1)==sn.

Although we only generate a new dictionary entry (sn), internally we need to store the two original terms as spelling correction suggestions (sun, sin)

Note 2: There are four different types of comparison pairs:

dictionary entry==input entry,
delete(dictionary entry,p1)==input entry  // 预处理
dictionary entry==delete(input entry,p2)
delete(dictionary entry,p1)==delete(input entry,p2)

Only replace and transpose require the last comparison type.

But we need to check whether the suggested dictionary term is really a substitution or adjacent transposition of the input term to prevent false positives at higher edit distances (bank==bnak and bank==bink, but bank!=kanb and bank! = xban and bank! = baxn).

Note 3: We are using the search engine index itself, not a dedicated spelling dictionary.

This has several advantages:

It is dynamically updated. Each newly indexed word whose frequency exceeds a certain threshold will also be automatically used for spelling correction.

Since we need to search the index anyway, spelling correction requires almost no additional cost.

When indexing misspelled terms (that is, they are not marked as correct in the index), we will immediately correct the spelling and index the page for the correct term.

Remark 4: We implemented query suggestion/completion in a similar way.

This is a good way to prevent spelling errors in the first place.

Each newly indexed word whose frequency exceeds a certain threshold is stored as a suggestion for all its prefixes (if they do not already exist, they will be created in the index).

Because we provide an instant search function, there is almost no additional cost to find suggestions. Multiple terms are sorted by the number of results stored in the index.

reasoning

The SymSpell algorithm takes advantage of the fact that the edit distance between two terms is symmetric:

We can generate all entries with an edit distance of <2 from the query entry (trying to reverse the query entry error) and check them against all dictionary entries,

We can generate all terms with an edit distance <2 from each dictionary term (trying to create a query term error), and check the query term against them.

By converting the correct dictionary term into the wrong string and the wrong input term into the correct string, we can combine the two and meet in the middle.

Because adding a character to the dictionary is equivalent to deleting a character from the input string, and vice versa, we can restrict the conversion to deletion only on both sides.

example

The reading of this passage makes Xiaoming a little bit clouded, so here is an example for everyone to understand.

For example, the user input is: goox

The correct thesaurus is only: good

The corresponding edit distance is 1.

Then by deleting, the good preprocessing storage will become: {good = good, ood=good; god=good; goo=good;}

When judging user input:

(1) Goox does not exist

(2) Delete goox

oox gox goo

You can find goo corresponds to good.

After reading this, my friends must have discovered the ingenuity of this algorithm.

在这里插入图片描述

Through the deletion of the original dictionary, the effect of deletion + addition + modification in the original algorithm has been basically achieved.

Edit distance

We are using variant 3 because only removing the conversion has nothing to do with language, and the cost is three orders of magnitude lower.

Where does the speed come from?

Pre-computation, that is, generating possible variants of spelling errors (only deletes) and storing them when indexing is the first prerequisite.

Fast index access during search by using a hash table with an average search time complexity of O(1) is the second prerequisite.

However, only symmetric deletion spelling correction above this can bring this O(1) speed to spell checking, because it can greatly reduce the number of misspelling candidates to be pre-calculated (generated and indexed).

Applying pre-computation to Norvig's method is not feasible, because pre-computing all possible deletion + transposition + replacement + insertion of all term candidates will result in huge time and space consumption.

Computational complexity

The SymSpell algorithm is constant time (O(1) time), that is, independent of the dictionary size (but depends on the average term length and maximum edit distance), because our index is based on the O(1) hash table with average search time complexity ).

Code

Just talk about it, don't practice fake handles.

After reading it, Xiao Ming adjusted his original algorithm implementation overnight.

Thesaurus preprocessing

Previously for the following thesaurus:

the,23135851162
of,13151942776
and,12997637966

Only need to construct a word, and the corresponding frequency freqMap.

Now we need to delete the word with edit distance=1:

/**
 * 对称删除拼写纠正词库
 * <p>
 * 1. 如果单词长度小于1,则不作处理。
 * 2. 对单词的长度减去1,依次移除一个字母,把余下的部分作为 key,
 * value 是一个原始的 CandidateDto 列表。
 * 3. 如何去重比较优雅?
 * 4. 如何排序比较优雅?
 * <p>
 * 如果不考虑自定义词库,是可以直接把词库预处理好的,但是只是减少了初始化的时间,意义不大。
 *
 * @param freqMap    频率 Map
 * @param resultsMap 结果 map
 * @since 0.1.0
 */
static synchronized void initSymSpellMap(Map<String, Long> freqMap,
                                         Map<String, List<CandidateDto>> resultsMap) {
    if (MapUtil.isEmpty(freqMap)) {
        return;
    }

    for (Map.Entry<String, Long> entry : freqMap.entrySet()) {
        String key = entry.getKey();
        Long count = entry.getValue();
        // 长度判断
        int len = key.length();
        // 后续可以根据编辑距离进行调整
        if (len <= 1) {
            continue;
        }
        char[] chars = key.toCharArray();
        Set<String> tempSet = new HashSet<>(chars.length);
        for (int i = 0; i < chars.length; i++) {
            String text = buildString(chars, i);
            // 跳过重复的单词
            if (tempSet.contains(text)) {
                continue;
            }
            List<CandidateDto> candidateDtos = resultsMap.get(text);
            if (candidateDtos == null) {
                candidateDtos = new ArrayList<>();
            }
            // 把原始的 key 作为值
            candidateDtos.add(new CandidateDto(key, count));
            // 删减后的文本作为 key
            resultsMap.put(text, candidateDtos);
            tempSet.add(text);
        }
    }
    // 统一排序
    for (Map.Entry<String, List<CandidateDto>> entry : resultsMap.entrySet()) {
        String key = entry.getKey();
        List<CandidateDto> list = entry.getValue();
        if (list.size() > 1) {
            // 排序
            Collections.sort(list);
            resultsMap.put(key, list);
        }
    }
}

The implementation of constructing the deletion string is relatively simple:

/**
 * 构建字符串
 *
 * @param chars        字符数组
 * @param excludeIndex 排除的索引
 * @return 字符串
 * @since 0.1.0
 */
public static String buildString(char[] chars, int excludeIndex) {
    StringBuilder stringBuilder = new StringBuilder(chars.length - 1);
    for (int i = 0; i < chars.length; i++) {
        if (i == excludeIndex) {
            continue;
        }
        stringBuilder.append(chars[i]);
    }
    return stringBuilder.toString();
}

There are a few points to note here:

(1) If the word is less than or equal to the edit distance, it does not need to be deleted. Because it is deleted and lost ==

(2) Pay attention to skip the repeated words. For example, good, the result of deletion will have 2 gods.

(3) Unified sorting, this is still necessary to improve the performance of real-time query.

Of course, Xiaoming thought that if the lexicon is fixed, the preprocessed lexicon can be processed directly, which greatly improves the loading speed.

But this is better than nothing, and the impact is not great.

Adjustment of core algorithm

The core algorithm obtains the candidate list and directly queries according to the given 4 situations.

freqData The frequency information of the correct dictionary.

The dictionary information after symSpellData is deleted.

/**
 * dictionary entry==input entry,
 * delete(dictionary entry,p1)==input entry  // 预处理
 * dictionary entry==delete(input entry,p2)
 * delete(dictionary entry,p1)==delete(input entry,p2)
 *
 * 为了性能考虑,这里做快速返回。后期可以考虑可以配置,暂时不做处理。
 *
 * @param word    单词
 * @param context 上下文
 * @return 结果
 * @since 0.1.0
 */
@Override
protected List<CandidateDto> getAllCandidateList(String word, IWordCheckerContext context) {
    IWordData wordData = context.wordData();
    Map<String, Long> freqData = wordData.freqData();
    Map<String, List<CandidateDto>> symSpellData = wordData.symSpellData();

    //0. 原始字典包含
    if (freqData.containsKey(word)) {
        // 返回原始信息
        CandidateDto dto = CandidateDto.of(word, freqData.get(word));
        return Collections.singletonList(dto);
    }
    // 如果长度为1
    if(word.length() <= 1) {
        CandidateDto dtoA = CandidateDto.of("a", 9081174698L);
        CandidateDto dtoI = CandidateDto.of("i", 3086225277L);
        return Arrays.asList(dtoA, dtoI);
    }

    List<CandidateDto> resultList = new ArrayList<>();
    //1. 对称删减包含输入的单词
    List<CandidateDto> symSpellList = symSpellData.get(word);
    if(CollectionUtil.isNotEmpty(symSpellList)) {
        resultList.addAll(symSpellList);
    }
    // 所有删减后的数组
    Set<String> subWordSet = InnerWordDataUtil.buildStringSet(word.toCharArray());
    //2. 输入单词删减后,在原始字典中存在。
    for(String subWord : subWordSet) {
        if(freqData.containsKey(subWord)) {
            CandidateDto dto = CandidateDto.of(subWord, freqData.get(subWord));
            resultList.add(dto);
        }
    }
    //3. 输入单词删减后,在对称删除字典存在。
    for(String subWord : subWordSet) {
        if(symSpellData.containsKey(subWord)) {
            resultList.addAll(symSpellData.get(subWord));
        }
    }
    if(CollectionUtil.isNotEmpty(resultList)) {
        return resultList;
    }

    //4. 执行替换和修改(递归调用一次)甚至也可以不做处理。
    // 为保证编辑距离为1,只考虑原始字典
    List<String> edits = edits(word);
    for(String edit : edits) {
        if(freqData.containsKey(edit)) {
            CandidateDto dto = CandidateDto.of(edit, freqData.get(edit));
            resultList.add(dto);
        }
    }
    return resultList;
}

There are several points to note:

(1) If the original dictionary already contains, return directly. The description is spelled correctly.

(2) If the length is 1, just return I and a fixedly.

(3) For every other scene, if you are considering performance, you can also return quickly.

Your server performance can never be improved with a 1000X configuration, but the algorithm can, but the salary can't.

在这里插入图片描述

summary

A good algorithm improves the program very significantly.

Continue to learn in the future.

The code in the article has been simplified a lot in order to make it easier for everyone to understand. Interested friends can see the source code by themselves:

https://github.com/houbb/word-checker

I am an old horse, and I look forward to seeing you again next time.

在这里插入图片描述


老马啸西风
185 声望33 粉丝