Project Description
nlp-hanzi-similar provides similarity calculations for Chinese characters.
Creation purpose
A small partner said that he was doing research on the subject of language cognitive science. I read the NLP Chinese character similarity calculation ideas
I just want to ask if there is any source code or related information.
For the calculation of text similarity in China, open source tools are relatively abundant.
However, the calculation of the similarity between two Chinese characters is basically blank in China. Domestic reference materials are pitiful, and so are foreign related documents.
So I compiled and open source the similarity algorithm I wrote before, hoping to help this little partner.
The purpose of this project is to create a basic similarity calculation tool and contribute a little bit to Chinese character NLP.
characteristic
- fluent method, one line of code does everything
- Highly customizable, allowing users to define their own implementation
- Thesaurus customization, adapt to various application scenarios
- Rich implementation strategies
By default, the similarity comparison based on four-corner coding + pinyin + Chinese character structure + Chinese character radicals + number of strokes is realized.
Change log
change log
Quick start
need
jdk1.7+
maven 3.x+
Introduced by maven
<dependency>
<groupId>com.github.houbb</groupId>
<artifactId>nlp-hanzi-similar</artifactId>
<version>1.0.0</version>
</dependency>
Quick start
Basic usage
HanziSimilarHelper.similar
Get the similarity of two Chinese characters.
double rate1 = HanziSimilarHelper.similar('末', '未');
The result is:
0.9629629629629629
Custom weight
The default is to compare the similarity based on the four-corner code + pinyin + Chinese character structure + Chinese character radicals + number of strokes.
If the default system weights cannot meet your needs, you can adjust them through custom weights:
double rate = HanziSimilarBs.newInstance()
.jiegouRate(10)
.sijiaoRate(8)
.bushouRate(6)
.bihuashuRate(2)
.pinyinRate(1)
.similar('末', '未');
Custom similarity
In some cases, the calculation of the system cannot be satisfied.
Users can customize it in the root directory hanzi_similar_define.txt
.
入人 0.9
人入 0.9
Such calculation human and
when the similarity, a user-defined priority to prevail.
double rate = HanziSimilarHelper.similar('人', '入');
The result at this time is a user-defined value.
Boot class
illustrate
In order to facilitate user customization, HanziSimilarBs
supports users to customize configuration.
The list of configurations that allow customization in HanziSimilarBs is as follows:
Serial number | Attributes | illustrate |
---|---|---|
1 | bihuashuRate | Weight of strokes |
2 | bihuashuData | Number of strokes data |
3 | bihuashuSimilar | Stroke number similarity strategy |
4 | jiegouRate | Structural weight |
5 | jiegouData | Structured data |
6 | jiegouSimilar | Structural similarity strategy |
7 | bushouRate | Radical weight |
8 | bushouData | Radical data |
9 | bushouSimilar | Radical similarity strategy |
10 | sijiaoRate | Four-corner coding weight |
12 | sijiaoData | Four-corner coded data |
13 | sijiaoSimilar | Four corner coding similarity strategy |
14 | pinyinRate | Pinyin weight |
15 | pinyinData | Pinyin data |
16 | pinyinSimilar | Pinyin similarity strategy |
17 | hanziSimilar | The core strategy of Chinese character similarity |
18 | userDefineData | User-defined data |
All configurations can be customized based on the interface.
Quick experience
illustrate
If java language is not your main development language, you can quickly experience it through the following exe file.
download link
https://github.com/houbb/nlp-hanzi-similar/releases/download/exe/hanzi-similar.zip
After downloading, unzip it directly to get the executable file of hanzi-similar.exe
Execution effect
The interface is implemented using java swing, so it is beautiful and has completely abandoned the treatment of T_T.
Use exe4j to package.
Enter one Chinese character for character one, and another Chinese character for character two. Click Calculate to obtain the corresponding similarity.
Disadvantages of dictionaries
This project is open source because there is a small partner who has related needs, but he does not understand java.
At the beginning, I wanted to design the project in the form of a dictionary, with two words corresponding to a similarity.
But there is a problem. 2W Chinese characters, and 2W Chinese characters similarity dictionary, the amount of data is nearly 100 million.
The space complexity is too high, and at the same time it will lead to the time complexity problem.
So currently, real-time calculation is used, and there is time to do some other language migration :)
Realization principle
Realization ideas
Different from text similarity, the unit of Chinese character similarity is Chinese characters.
So similarity is the disassembly of Chinese characters, such as strokes, pinyin, radicals, structure, etc.
Recommended reading:
NLP Chinese form similar character similarity calculation ideas
The calculation idea describes the principle of realization, but the reaction of the small partners will not be realized, so this project was created.
Core code
The core implementation is as follows, which is to perform weighted calculation for various similarities.
/**
* 相似度
*
* @param context 上下文
* @return 结果
* @since 1.0.0
*/
@Override
public double similar(final IHanziSimilarContext context) {
final String charOne = context.charOne();
final String charTwo = context.charTwo();
//1. 是否相同
if(charOne.equals(charTwo)) {
return 1.0;
}
//2. 是否用户自定义
Map<String, Double> defineMap = context.userDefineData().dataMap();
String defineKey = charOne+charTwo;
if(defineMap.containsKey(defineKey)) {
return defineMap.get(defineKey);
}
//3. 通过权重计算获取
//3.1 四角编码
IHanziSimilar sijiaoSimilar = context.sijiaoSimilar();
double sijiaoScore = sijiaoSimilar.similar(context);
//3.2 结构
IHanziSimilar jiegouSimilar = context.jiegouSimilar();
double jiegouScore = jiegouSimilar.similar(context);
//3.3 部首
IHanziSimilar bushouSimilar = context.bushouSimilar();
double bushouScore = bushouSimilar.similar(context);
//3.4 笔画
IHanziSimilar biahuashuSimilar = context.bihuashuSimilar();
double bihuashuScore = biahuashuSimilar.similar(context);
//3.5 拼音
IHanziSimilar pinyinSimilar = context.pinyinSimilar();
double pinyinScore = pinyinSimilar.similar(context);
//4. 计算总分
double totalScore = sijiaoScore + jiegouScore + bushouScore + bihuashuScore + pinyinScore;
//4.1 避免浮点数比较问题
if(totalScore <= 0) {
return 0;
}
//4.2 正则化
double limitScore = context.sijiaoRate() + context.jiegouRate()
+ context.bushouRate() + context.bihuashuRate() + context.pinyinRate();
return totalScore / limitScore;
}
For specific details, if you are interested, you can read the source code yourself.
Open source address
In order to facilitate everyone's learning and use, this project has been open sourced.
Open source address:
https://github.com/houbb/nlp-hanzi-similar
Welcome everyone, fork&star, encourage me~
Advantages and disadvantages of the algorithm
advantage
The few papers are based on the structure of Chinese characters.
This algorithm introduces the four-corner coding + structure + radicals + strokes + pinyin to make it more in line with domestic intuition.
shortcoming
The radical part is actually a shortcoming because of data problems at the time.
Subsequent preparations are to introduce a dictionary of disassembled characters to compare all the constituent parts of Chinese characters instead of a simple radical at present.
Later Road-MAP
- [] Rich Similarity Strategy
- [] Optimize the default weight
- [] Optimize the exe interface
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。