Open source implementation of NLP Chinese form similarity algorithm

Project Description

nlp-hanzi-similar provides similarity calculations for Chinese characters.

在这里插入图片描述

Creation purpose

A small partner said that he was doing research on the subject of language cognitive science. I read the NLP Chinese character similarity calculation ideas

I just want to ask if there is any source code or related information.

For the calculation of text similarity in China, open source tools are relatively abundant.

However, the calculation of the similarity between two Chinese characters is basically blank in China. Domestic reference materials are pitiful, and so are foreign related documents.

So I compiled and open source the similarity algorithm I wrote before, hoping to help this little partner.

The purpose of this project is to create a basic similarity calculation tool and contribute a little bit to Chinese character NLP.

characteristic

fluent method, one line of code does everything
Highly customizable, allowing users to define their own implementation
Thesaurus customization, adapt to various application scenarios
Rich implementation strategies

By default, the similarity comparison based on four-corner coding + pinyin + Chinese character structure + Chinese character radicals + number of strokes is realized.

Change log

change log

Quick start

need

jdk1.7+

maven 3.x+

Introduced by maven

<dependency>
    <groupId>com.github.houbb</groupId>
    <artifactId>nlp-hanzi-similar</artifactId>
    <version>1.0.0</version>
</dependency>

Quick start

Basic usage

HanziSimilarHelper.similar Get the similarity of two Chinese characters.

double rate1 = HanziSimilarHelper.similar('末', '未');

The result is:

0.9629629629629629

Custom weight

The default is to compare the similarity based on the four-corner code + pinyin + Chinese character structure + Chinese character radicals + number of strokes.

If the default system weights cannot meet your needs, you can adjust them through custom weights:

double rate = HanziSimilarBs.newInstance()
                .jiegouRate(10)
                .sijiaoRate(8)
                .bushouRate(6)
                .bihuashuRate(2)
                .pinyinRate(1)
                .similar('末', '未');

Custom similarity

In some cases, the calculation of the system cannot be satisfied.

Users can customize it in the root directory hanzi_similar_define.txt .

入人 0.9
人入 0.9

Such calculation human and when the similarity, a user-defined priority to prevail.

double rate = HanziSimilarHelper.similar('人', '入');

The result at this time is a user-defined value.

`Boot class`

`illustrate`

In order to facilitate user customization, HanziSimilarBs supports users to customize configuration.

The list of configurations that allow customization in HanziSimilarBs is as follows:

Serial number	Attributes	illustrate
1	bihuashuRate	Weight of strokes
2	bihuashuData	Number of strokes data
3	bihuashuSimilar	Stroke number similarity strategy
4	jiegouRate	Structural weight
5	jiegouData	Structured data
6	jiegouSimilar	Structural similarity strategy
7	bushouRate	Radical weight
8	bushouData	Radical data
9	bushouSimilar	Radical similarity strategy
10	sijiaoRate	Four-corner coding weight
12	sijiaoData	Four-corner coded data
13	sijiaoSimilar	Four corner coding similarity strategy
14	pinyinRate	Pinyin weight
15	pinyinData	Pinyin data
16	pinyinSimilar	Pinyin similarity strategy
17	hanziSimilar	The core strategy of Chinese character similarity
18	userDefineData	User-defined data

All configurations can be customized based on the interface.

`Quick experience`

`illustrate`

If java language is not your main development language, you can quickly experience it through the following exe file.

`download link`

https://github.com/houbb/nlp-hanzi-similar/releases/download/exe/hanzi-similar.zip

After downloading, unzip it directly to get the executable file of hanzi-similar.exe

`Execution effect`

The interface is implemented using java swing, so it is beautiful and has completely abandoned the treatment of T_T.

Use exe4j to package.

Enter one Chinese character for character one, and another Chinese character for character two. Click Calculate to obtain the corresponding similarity.

`Disadvantages of dictionaries`

This project is open source because there is a small partner who has related needs, but he does not understand java.

At the beginning, I wanted to design the project in the form of a dictionary, with two words corresponding to a similarity.

But there is a problem. 2W Chinese characters, and 2W Chinese characters similarity dictionary, the amount of data is nearly 100 million.

The space complexity is too high, and at the same time it will lead to the time complexity problem.

So currently, real-time calculation is used, and there is time to do some other language migration :)

`Realization principle`

`Realization ideas`

Different from text similarity, the unit of Chinese character similarity is Chinese characters.

So similarity is the disassembly of Chinese characters, such as strokes, pinyin, radicals, structure, etc.

Recommended reading:

NLP Chinese form similar character similarity calculation ideas

The calculation idea describes the principle of realization, but the reaction of the small partners will not be realized, so this project was created.

`Core code`

The core implementation is as follows, which is to perform weighted calculation for various similarities.

/**
 * 相似度
 *
 * @param context 上下文
 * @return 结果
 * @since 1.0.0
 */
@Override
public double similar(final IHanziSimilarContext context) {
    final String charOne = context.charOne();
    final String charTwo = context.charTwo();

    //1. 是否相同
    if(charOne.equals(charTwo)) {
        return 1.0;
    }

    //2. 是否用户自定义
    Map<String, Double> defineMap = context.userDefineData().dataMap();
    String defineKey = charOne+charTwo;
    if(defineMap.containsKey(defineKey)) {
        return defineMap.get(defineKey);
    }

    //3. 通过权重计算获取
    //3.1 四角编码
    IHanziSimilar sijiaoSimilar = context.sijiaoSimilar();
    double sijiaoScore = sijiaoSimilar.similar(context);

    //3.2 结构
    IHanziSimilar jiegouSimilar = context.jiegouSimilar();
    double jiegouScore = jiegouSimilar.similar(context);

    //3.3 部首
    IHanziSimilar bushouSimilar = context.bushouSimilar();
    double bushouScore = bushouSimilar.similar(context);

    //3.4 笔画
    IHanziSimilar biahuashuSimilar = context.bihuashuSimilar();
    double bihuashuScore = biahuashuSimilar.similar(context);

    //3.5 拼音
    IHanziSimilar pinyinSimilar = context.pinyinSimilar();
    double pinyinScore = pinyinSimilar.similar(context);

    //4. 计算总分
    double totalScore = sijiaoScore + jiegouScore + bushouScore + bihuashuScore + pinyinScore;
    //4.1 避免浮点数比较问题
    if(totalScore <= 0) {
        return 0;
    }

    //4.2 正则化
    double limitScore = context.sijiaoRate() + context.jiegouRate()
            + context.bushouRate() + context.bihuashuRate() + context.pinyinRate();

    return totalScore / limitScore;
}

For specific details, if you are interested, you can read the source code yourself.

`Open source address`

In order to facilitate everyone's learning and use, this project has been open sourced.

Open source address:

https://github.com/houbb/nlp-hanzi-similar

Welcome everyone, fork&star, encourage me~

`Advantages and disadvantages of the algorithm`

`advantage`

The few papers are based on the structure of Chinese characters.

This algorithm introduces the four-corner coding + structure + radicals + strokes + pinyin to make it more in line with domestic intuition.

`shortcoming`

The radical part is actually a shortcoming because of data problems at the time.

Subsequent preparations are to introduce a dictionary of disassembled characters to compare all the constituent parts of Chinese characters instead of a simple radical at present.

`Later Road-MAP`

[] Rich Similarity Strategy
[] Optimize the default weight
[] Optimize the exe interface

Open source implementation of NLP Chinese form similarity algorithm

Project Description

Creation purpose

characteristic

Change log

Quick start

need

Introduced by maven

Quick start

Basic usage

Custom weight

Custom similarity

`Boot class`

`illustrate`

`Quick experience`

`illustrate`

`download link`

`Execution effect`

`Disadvantages of dictionaries`

`Realization principle`

`Realization ideas`

`Core code`

`Open source address`

`Advantages and disadvantages of the algorithm`

`advantage`

`shortcoming`

`Later Road-MAP`

老马啸西风

`引用和评论`

resubmit v1.2.0 新特性支持类级别防止重复提交

一文掌握 MCP 上下文协议：从理论到实践

AI Agent爆火后，MCP协议为什么如此重要！

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读

祛魅最热门的通用Agent赛道