Project Description

nlp-hanzi-similar provides similarity calculations for Chinese characters.

在这里插入图片描述

Creation purpose

A small partner said that he was doing research on the subject of language cognitive science. I read the NLP Chinese character similarity calculation ideas

I just want to ask if there is any source code or related information.

For the calculation of text similarity in China, open source tools are relatively abundant.

However, the calculation of the similarity between two Chinese characters is basically blank in China. Domestic reference materials are pitiful, and so are foreign related documents.

So I compiled and open source the similarity algorithm I wrote before, hoping to help this little partner.

The purpose of this project is to create a basic similarity calculation tool and contribute a little bit to Chinese character NLP.

characteristic

  • fluent method, one line of code does everything
  • Highly customizable, allowing users to define their own implementation
  • Thesaurus customization, adapt to various application scenarios
  • Rich implementation strategies

By default, the similarity comparison based on four-corner coding + pinyin + Chinese character structure + Chinese character radicals + number of strokes is realized.

Change log

change log

Quick start

need

jdk1.7+

maven 3.x+

Introduced by maven

<dependency>
    <groupId>com.github.houbb</groupId>
    <artifactId>nlp-hanzi-similar</artifactId>
    <version>1.0.0</version>
</dependency>

Quick start

Basic usage

HanziSimilarHelper.similar Get the similarity of two Chinese characters.

double rate1 = HanziSimilarHelper.similar('末', '未');

The result is:

0.9629629629629629

Custom weight

The default is to compare the similarity based on the four-corner code + pinyin + Chinese character structure + Chinese character radicals + number of strokes.

If the default system weights cannot meet your needs, you can adjust them through custom weights:

double rate = HanziSimilarBs.newInstance()
                .jiegouRate(10)
                .sijiaoRate(8)
                .bushouRate(6)
                .bihuashuRate(2)
                .pinyinRate(1)
                .similar('末', '未');

Custom similarity

In some cases, the calculation of the system cannot be satisfied.

Users can customize it in the root directory hanzi_similar_define.txt .

入人 0.9
人入 0.9

Such calculation human and when the similarity, a user-defined priority to prevail.

double rate = HanziSimilarHelper.similar('人', '入');

The result at this time is a user-defined value.

Boot class

illustrate

In order to facilitate user customization, HanziSimilarBs supports users to customize configuration.

The list of configurations that allow customization in HanziSimilarBs is as follows:

Serial numberAttributesillustrate
1bihuashuRateWeight of strokes
2bihuashuDataNumber of strokes data
3bihuashuSimilarStroke number similarity strategy
4jiegouRateStructural weight
5jiegouDataStructured data
6jiegouSimilarStructural similarity strategy
7bushouRateRadical weight
8bushouDataRadical data
9bushouSimilarRadical similarity strategy
10sijiaoRateFour-corner coding weight
12sijiaoDataFour-corner coded data
13sijiaoSimilarFour corner coding similarity strategy
14pinyinRatePinyin weight
15pinyinDataPinyin data
16pinyinSimilarPinyin similarity strategy
17hanziSimilarThe core strategy of Chinese character similarity
18userDefineDataUser-defined data

All configurations can be customized based on the interface.

Quick experience

illustrate

If java language is not your main development language, you can quickly experience it through the following exe file.

download link

https://github.com/houbb/nlp-hanzi-similar/releases/download/exe/hanzi-similar.zip

After downloading, unzip it directly to get the executable file of hanzi-similar.exe

Execution effect

The interface is implemented using java swing, so it is beautiful and has completely abandoned the treatment of T_T.

Use exe4j to package.

Enter one Chinese character for character one, and another Chinese character for character two. Click Calculate to obtain the corresponding similarity.

在这里插入图片描述

Disadvantages of dictionaries

This project is open source because there is a small partner who has related needs, but he does not understand java.

At the beginning, I wanted to design the project in the form of a dictionary, with two words corresponding to a similarity.

But there is a problem. 2W Chinese characters, and 2W Chinese characters similarity dictionary, the amount of data is nearly 100 million.

The space complexity is too high, and at the same time it will lead to the time complexity problem.

So currently, real-time calculation is used, and there is time to do some other language migration :)

Realization principle

Realization ideas

Different from text similarity, the unit of Chinese character similarity is Chinese characters.

So similarity is the disassembly of Chinese characters, such as strokes, pinyin, radicals, structure, etc.

Recommended reading:

NLP Chinese form similar character similarity calculation ideas

The calculation idea describes the principle of realization, but the reaction of the small partners will not be realized, so this project was created.

Core code

The core implementation is as follows, which is to perform weighted calculation for various similarities.

/**
 * 相似度
 *
 * @param context 上下文
 * @return 结果
 * @since 1.0.0
 */
@Override
public double similar(final IHanziSimilarContext context) {
    final String charOne = context.charOne();
    final String charTwo = context.charTwo();

    //1. 是否相同
    if(charOne.equals(charTwo)) {
        return 1.0;
    }

    //2. 是否用户自定义
    Map<String, Double> defineMap = context.userDefineData().dataMap();
    String defineKey = charOne+charTwo;
    if(defineMap.containsKey(defineKey)) {
        return defineMap.get(defineKey);
    }

    //3. 通过权重计算获取
    //3.1 四角编码
    IHanziSimilar sijiaoSimilar = context.sijiaoSimilar();
    double sijiaoScore = sijiaoSimilar.similar(context);

    //3.2 结构
    IHanziSimilar jiegouSimilar = context.jiegouSimilar();
    double jiegouScore = jiegouSimilar.similar(context);

    //3.3 部首
    IHanziSimilar bushouSimilar = context.bushouSimilar();
    double bushouScore = bushouSimilar.similar(context);

    //3.4 笔画
    IHanziSimilar biahuashuSimilar = context.bihuashuSimilar();
    double bihuashuScore = biahuashuSimilar.similar(context);

    //3.5 拼音
    IHanziSimilar pinyinSimilar = context.pinyinSimilar();
    double pinyinScore = pinyinSimilar.similar(context);

    //4. 计算总分
    double totalScore = sijiaoScore + jiegouScore + bushouScore + bihuashuScore + pinyinScore;
    //4.1 避免浮点数比较问题
    if(totalScore <= 0) {
        return 0;
    }

    //4.2 正则化
    double limitScore = context.sijiaoRate() + context.jiegouRate()
            + context.bushouRate() + context.bihuashuRate() + context.pinyinRate();

    return totalScore / limitScore;
}

For specific details, if you are interested, you can read the source code yourself.

Open source address

In order to facilitate everyone's learning and use, this project has been open sourced.

Open source address:

https://github.com/houbb/nlp-hanzi-similar

Welcome everyone, fork&star, encourage me~

Advantages and disadvantages of the algorithm

advantage

The few papers are based on the structure of Chinese characters.

This algorithm introduces the four-corner coding + structure + radicals + strokes + pinyin to make it more in line with domestic intuition.

shortcoming

The radical part is actually a shortcoming because of data problems at the time.

Subsequent preparations are to introduce a dictionary of disassembled characters to compare all the constituent parts of Chinese characters instead of a simple radical at present.

Later Road-MAP

  • [] Rich Similarity Strategy
  • [] Optimize the default weight
  • [] Optimize the exe interface

在这里插入图片描述


老马啸西风
191 声望34 粉丝