A platform without sensitive words
Recently, a rapper of Canadian origin was exposed to unscrupulous private life, and it is very likely to be involved in luring X underage girls to become a raper.
Of course, as to whether it is true, whether or not a person is Sea King is clearly remembered in the chat records of WeChat and QQ. When it comes to criminal cases, TX can review all historical records. Tencent Video’s termination of a contract with a certain electric eel is not necessarily unfounded, after all, interests are related.
But during the whole process, I found two very noteworthy points:
(1) His gossip girlfriend, Xiaoyi, was scolded to clear all social platforms. As a big consumer of X blog, can it only cause the server to crash, don't you know the sensitive word filtering?
(2) The whistleblower Du Meizhu received a large number of bloody | gory photos, don't all the platforms that blow artificial intelligence every day have a filtering function?
Of course, I don't know anything about artificial intelligence.
But for sensitive words, I recently wrote a small tool, if the major platforms need it, it has been open sourced, and you are welcome to pick it up.
https://github.com/houbb/sensitive-word
At least it can desensitize the beautiful Chinese words like the following:
你 XXX 的,我 XXX 你 XXX,你 XXXX,XXX!!XXX!
Creation purpose
Based on the implementation of the DFA algorithm, the current sensitive dictionary content includes 6W+ (source file 18W+, after one deletion).
In the later stage, we will continue to optimize and supplement the sensitive vocabulary, and further improve the performance of the algorithm.
I hope that the classification of sensitive words can be refined. I feel that the workload is relatively large and has not been carried out temporarily.
Let's talk about the vision here. The vision is to be the first useful tool for sensitive words.
Of course, the first is always empty.
characteristic
- 6W+ thesaurus, and constantly optimized and updated
- Based on DFA algorithm, better performance
- Based on fluent-api implementation, elegant and concise use
- Support common operations such as judgment, return, and desensitization of sensitive words
- Support full-width and half-width interchange
- Support English case swap
- Support the exchange of common forms of numbers
- Support Chinese Traditional Simplified Simplified Exchange
- Support the exchange of common forms of English
- Support user-defined sensitive words and whitelist
- Support data dynamic update of data, effective in real time
Quick start
Prepare
- JDK1.7+
- Maven 3.x+
Maven introduction
<dependency>
<groupId>com.github.houbb</groupId>
<artifactId>sensitive-word</artifactId>
<version>0.0.15</version>
</dependency>
API overview
SensitiveWordHelper
is a tool class for sensitive words. The core methods are as follows:
method | parameter | return value | instruction |
---|---|---|---|
contains(String) | String to be verified | Boolean value | Verify that the string contains sensitive words |
findAll(String) | String to be verified | String list | Return all sensitive words in the string |
replace(String, char) | Replace sensitive words with the specified char | String | Return the desensitized string |
replace(String) | Use * replace sensitive words | String | Return the desensitized string |
Use case
For all test cases see SensitiveWordHelperTest
Determine whether it contains sensitive words
final String text = "五星红旗迎风飘扬,毛主席的画像屹立在天安门前。";
Assert.assertTrue(SensitiveWordHelper.contains(text));
Return the first sensitive word
final String text = "五星红旗迎风飘扬,毛主席的画像屹立在天安门前。";
String word = SensitiveWordHelper.findFirst(text);
Assert.assertEquals("五星红旗", word);
Return all sensitive words
final String text = "五星红旗迎风飘扬,毛主席的画像屹立在天安门前。";
List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[五星红旗, 毛主席, 天安门]", wordList.toString());
Default replacement strategy
final String text = "五星红旗迎风飘扬,毛主席的画像屹立在天安门前。";
String result = SensitiveWordHelper.replace(text);
Assert.assertEquals("****迎风飘扬,***的画像屹立在***前。", result);
Specify what to replace
final String text = "五星红旗迎风飘扬,毛主席的画像屹立在天安门前。";
String result = SensitiveWordHelper.replace(text, '0');
Assert.assertEquals("0000迎风飘扬,000的画像屹立在000前。", result);
More features
Many subsequent features mainly deal with various situations and improve the hit rate of sensitive words as much as possible.
This is a long offensive and defensive battle.
Ignore case
final String text = "fuCK the bad words.";
String word = SensitiveWordHelper.findFirst(text);
Assert.assertEquals("fuCK", word);
Ignore half-corner fillets
final String text = "fuck the bad words.";
String word = SensitiveWordHelper.findFirst(text);
Assert.assertEquals("fuck", word);
Ignore the wording of numbers
Here realizes the conversion of the common forms of numbers.
final String text = "这个是我的微信:9⓿二肆⁹₈③⑸⒋➃㈤㊄";
List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[9⓿二肆⁹₈③⑸⒋➃㈤㊄]", wordList.toString());
Ignore Traditional and Simplified Chinese
final String text = "我爱我的祖国和五星紅旗。";
List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[五星紅旗]", wordList.toString());
Ignore English writing format
final String text = "Ⓕⓤc⒦ the bad words";
List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[Ⓕⓤc⒦]", wordList.toString());
Ignore repeated words
final String text = "ⒻⒻⒻfⓤuⓤ⒰cⓒ⒦ the bad words";
List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[ⒻⒻⒻfⓤuⓤ⒰cⓒ⒦]", wordList.toString());
Mailbox detection
final String text = "楼主好人,邮箱 sensitiveword@xx.com";
List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[sensitiveword@xx.com]", wordList.toString());
Feature configuration
instruction
The above features are all enabled by default, and sometimes the business needs to flexibly define related configuration features.
So v0.0.14 opened up the attribute configuration.
Configuration method
In order to make the use more elegant, the definition of fluent-api is used uniformly.
Users can use SensitiveWordBs
to define as follows:
SensitiveWordBs wordBs = SensitiveWordBs.newInstance()
.ignoreCase(true)
.ignoreWidth(true)
.ignoreNumStyle(true)
.ignoreChineseStyle(true)
.ignoreEnglishStyle(true)
.ignoreRepeat(true)
.enableNumCheck(true)
.enableEmailCheck(true)
.enableUrlCheck(true)
.init();
final String text = "五星红旗迎风飘扬,毛主席的画像屹立在天安门前。";
Assert.assertTrue(wordBs.contains(text));
Configuration instructions
The description of each configuration is as follows:
Serial number | method | instruction |
---|---|---|
1 | ignoreCase | Ignore case |
2 | ignoreWidth | Ignore half-corner fillets |
3 | ignoreNumStyle | Ignore the wording of numbers |
4 | ignoreChineseStyle | Ignore Chinese writing format |
5 | ignoreEnglishStyle | Ignore English writing format |
6 | ignoreRepeat | Ignore repeated words |
7 | enableNumCheck | Whether to enable digital detection. 8 consecutive digits are considered as sensitive words by default |
8 | enableEmailCheck | Is there a mailbox detection enabled |
9 | enableUrlCheck | Whether to enable link detection |
Dynamic loading (user-defined)
Scenario description
Sometimes we want to design the loading of sensitive words to be dynamic, such as console modification, which can then take effect in real time.
v0.0.13 supports this feature.
Interface Description
In order to achieve this feature and be compatible with previous functions, we have defined two interfaces.
IWordDeny
The interface is as follows, you can customize your own implementation.
The returned list indicates that the word is a sensitive word.
/**
* 拒绝出现的数据-返回的内容被当做是敏感词
* @author binbin.hou
* @since 0.0.13
*/
public interface IWordDeny {
/**
* 获取结果
* @return 结果
* @since 0.0.13
*/
List<String> deny();
}
for example:
public class MyWordDeny implements IWordDeny {
@Override
public List<String> deny() {
return Arrays.asList("我的自定义敏感词");
}
}
IWordAllow
The interface is as follows, you can customize your own implementation.
The returned list indicates that the word is not a sensitive word.
/**
* 允许的内容-返回的内容不被当做敏感词
* @author binbin.hou
* @since 0.0.13
*/
public interface IWordAllow {
/**
* 获取结果
* @return 结果
* @since 0.0.13
*/
List<String> allow();
}
like:
public class MyWordAllow implements IWordAllow {
@Override
public List<String> allow() {
return Arrays.asList("五星红旗");
}
}
Configure and use
interface is customized, of course it needs to be specified to take effect.
In order to make the use more elegant, we designed the boot class SensitiveWordBs
.
You can use wordDeny() to specify sensitive words, wordAllow() to specify non-sensitive words, and init() to initialize the sensitive word dictionary.
The default configuration of the system
SensitiveWordBs wordBs = SensitiveWordBs.newInstance()
.wordDeny(WordDenys.system())
.wordAllow(WordAllows.system())
.init();
final String text = "五星红旗迎风飘扬,毛主席的画像屹立在天安门前。";
Assert.assertTrue(wordBs.contains(text));
Note: init () is a more time-consuming to build sensitive words DFA generally recommended when application initialization initialized only once . Instead of repeated initialization!
Specify your own implementation
We can test the custom implementation as follows:
String text = "这是一个测试,我的自定义敏感词。";
SensitiveWordBs wordBs = SensitiveWordBs.newInstance()
.wordDeny(new MyWordDeny())
.wordAllow(new MyWordAllow())
.init();
Assert.assertEquals("[我的自定义敏感词]", wordBs.findAll(text).toString());
Here only my custom sensitive words are sensitive words, and the
test is not sensitive words.
Of course, here are all our custom implementations. Generally, it is recommended to use the system's default configuration + custom configuration.
You can use the following methods.
Configure multiple at the same time
- Multiple sensitive words
WordDenys.chains()
method combines multiple implementations into the same IWordDeny.
- Multiple whitelists
WordAllows.chains()
method combines multiple implementations into the same IWordAllow.
example:
String text = "这是一个测试。我的自定义敏感词。";
IWordDeny wordDeny = WordDenys.chains(WordDenys.system(), new MyWordDeny());
IWordAllow wordAllow = WordAllows.chains(WordAllows.system(), new MyWordAllow());
SensitiveWordBs wordBs = SensitiveWordBs.newInstance()
.wordDeny(wordDeny)
.wordAllow(wordAllow)
.init();
Assert.assertEquals("[我的自定义敏感词]", wordBs.findAll(text).toString());
Here, both the system default configuration and the custom configuration are used.
spring integration
background
In actual use, for example, you can configure and modify the page, and then take effect in real time.
The data is stored in the database, the following is an example of pseudo code, you can refer to SpringSensitiveWordConfig.java
Requirements, version v0.0.15 and above.
Custom data source
The simplified pseudo code is as follows, the source of the data is the database.
MyDdWordAllow and MyDdWordDeny are custom implementation classes based on the database as the source.
@Configuration
public class SpringSensitiveWordConfig {
@Autowired
private MyDdWordAllow myDdWordAllow;
@Autowired
private MyDdWordDeny myDdWordDeny;
/**
* 初始化引导类
* @return 初始化引导类
* @since 1.0.0
*/
@Bean
public SensitiveWordBs sensitiveWordBs() {
SensitiveWordBs sensitiveWordBs = SensitiveWordBs.newInstance()
.wordAllow(WordAllows.chains(WordAllows.system(), myDdWordAllow))
.wordDeny(myDdWordDeny)
// 各种其他配置
.init();
return sensitiveWordBs;
}
}
The initialization of sensitive lexicon is time-consuming. It is recommended to do an init initialization when the program is started.
Dynamic changes
In order to ensure that the sensitive word modification can take effect in real time and to ensure that the interface is as simplified as possible, there is no new add/remove method here.
Instead, when calling sensitiveWordBs.init()
, the sensitive word database is rebuilt according to IWordDeny+IWordAllow.
Because initialization may take a long time (second level), all optimizations to when init is not completed will not affect the old thesaurus function, and the new will prevail after completion.
@Component
public class SensitiveWordService {
@Autowired
private SensitiveWordBs sensitiveWordBs;
/**
* 更新词库
*
* 每次数据库的信息发生变化之后,首先调用更新数据库敏感词库的方法。
* 如果需要生效,则调用这个方法。
*
* 说明:重新初始化不影响旧的方法使用。初始化完成后,会以新的为准。
*/
public void refresh() {
// 每次数据库的信息发生变化之后,首先调用更新数据库敏感词库的方法,然后调用这个方法。
sensitiveWordBs.init();
}
}
sensitiveWordBs.init();
when the thesaurus of the database is changed, and the thesaurus is required to take effect.
Other uses remain unchanged, no need to restart the application.
Brother Guanxi smiled slightly and wanted to do something, so he should be a man first.
Further reading
Sensitive word tool realization ideas
sensitive dictionary optimization process
summary
Again, we use the law to defend ourselves, but we must never allow some people to entertain everything, thinking that money can buy everything.
On the occasion of this century, let alone let the blood of the ancestors flow in vain.
What's more, it is a Canadian actor with three nos. It is recommended to deal with it according to law, and then (ノ`Д)ノ (beautiful Chinese)
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。