Wu Xfan’s gossip girlfriend Xiaoyi was scolded to empty the social platform? Do the major platforms even have no sensitive vocabulary?

A platform without sensitive words

Recently, a rapper of Canadian origin was exposed to unscrupulous private life, and it is very likely to be involved in luring X underage girls to become a raper.

Of course, as to whether it is true, whether or not a person is Sea King is clearly remembered in the chat records of WeChat and QQ. When it comes to criminal cases, TX can review all historical records. Tencent Video’s termination of a contract with a certain electric eel is not necessarily unfounded, after all, interests are related.

But during the whole process, I found two very noteworthy points:

(1) His gossip girlfriend, Xiaoyi, was scolded to clear all social platforms. As a big consumer of X blog, can it only cause the server to crash, don't you know the sensitive word filtering?

(2) The whistleblower Du Meizhu received a large number of bloody | gory photos, don't all the platforms that blow artificial intelligence every day have a filtering function?

Of course, I don't know anything about artificial intelligence.

But for sensitive words, I recently wrote a small tool, if the major platforms need it, it has been open sourced, and you are welcome to pick it up.

https://github.com/houbb/sensitive-word

At least it can desensitize the beautiful Chinese words like the following:

你 XXX 的，我 XXX 你 XXX，你 XXXX，XXX！！XXX！

Creation purpose

Based on the implementation of the DFA algorithm, the current sensitive dictionary content includes 6W+ (source file 18W+, after one deletion).

In the later stage, we will continue to optimize and supplement the sensitive vocabulary, and further improve the performance of the algorithm.

I hope that the classification of sensitive words can be refined. I feel that the workload is relatively large and has not been carried out temporarily.

Let's talk about the vision here. The vision is to be the first useful tool for sensitive words.

Of course, the first is always empty.

空虚公子

characteristic

6W+ thesaurus, and constantly optimized and updated
Based on DFA algorithm, better performance
Based on fluent-api implementation, elegant and concise use
Support common operations such as judgment, return, and desensitization of sensitive words
Support full-width and half-width interchange
Support English case swap
Support the exchange of common forms of numbers
Support Chinese Traditional Simplified Simplified Exchange
Support the exchange of common forms of English
Support user-defined sensitive words and whitelist
Support data dynamic update of data, effective in real time

Quick start

Prepare

JDK1.7+
Maven 3.x+

Maven introduction

<dependency>
    <groupId>com.github.houbb</groupId>
    <artifactId>sensitive-word</artifactId>
    <version>0.0.15</version>
</dependency>

API overview

SensitiveWordHelper is a tool class for sensitive words. The core methods are as follows:

method	parameter	return value	instruction
contains(String)	String to be verified	Boolean value	Verify that the string contains sensitive words
findAll(String)	String to be verified	String list	Return all sensitive words in the string
replace(String, char)	Replace sensitive words with the specified char	String	Return the desensitized string
replace(String)	Use `*` replace sensitive words	String	Return the desensitized string

Use case

For all test cases see SensitiveWordHelperTest

Determine whether it contains sensitive words

final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。";

Assert.assertTrue(SensitiveWordHelper.contains(text));

Return the first sensitive word

final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。";

String word = SensitiveWordHelper.findFirst(text);
Assert.assertEquals("五星红旗", word);

Return all sensitive words

final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。";

List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[五星红旗, 毛主席, 天安门]", wordList.toString());

Default replacement strategy

final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。";
String result = SensitiveWordHelper.replace(text);
Assert.assertEquals("****迎风飘扬，***的画像屹立在***前。", result);

Specify what to replace

final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。";
String result = SensitiveWordHelper.replace(text, '0');
Assert.assertEquals("0000迎风飘扬，000的画像屹立在000前。", result);

More features

Many subsequent features mainly deal with various situations and improve the hit rate of sensitive words as much as possible.

This is a long offensive and defensive battle.

Ignore case

final String text = "fuCK the bad words.";

String word = SensitiveWordHelper.findFirst(text);
Assert.assertEquals("fuCK", word);

Ignore half-corner fillets

final String text = "ｆｕｃｋ the bad words.";

String word = SensitiveWordHelper.findFirst(text);
Assert.assertEquals("ｆｕｃｋ", word);

Ignore the wording of numbers

Here realizes the conversion of the common forms of numbers.

final String text = "这个是我的微信：9⓿二肆⁹₈③⑸⒋➃㈤㊄";

List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[9⓿二肆⁹₈③⑸⒋➃㈤㊄]", wordList.toString());

Ignore Traditional and Simplified Chinese

final String text = "我爱我的祖国和五星紅旗。";

List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[五星紅旗]", wordList.toString());

Ignore English writing format

final String text = "Ⓕⓤc⒦ the bad words";

List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[Ⓕⓤc⒦]", wordList.toString());

Ignore repeated words

final String text = "ⒻⒻⒻfⓤuⓤ⒰cⓒ⒦ the bad words";

List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[ⒻⒻⒻfⓤuⓤ⒰cⓒ⒦]", wordList.toString());

Mailbox detection

final String text = "楼主好人，邮箱 sensitiveword@xx.com";

List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[sensitiveword@xx.com]", wordList.toString());

Feature configuration

instruction

The above features are all enabled by default, and sometimes the business needs to flexibly define related configuration features.

So v0.0.14 opened up the attribute configuration.

Configuration method

In order to make the use more elegant, the definition of fluent-api is used uniformly.

Users can use SensitiveWordBs to define as follows:

SensitiveWordBs wordBs = SensitiveWordBs.newInstance()
        .ignoreCase(true)
        .ignoreWidth(true)
        .ignoreNumStyle(true)
        .ignoreChineseStyle(true)
        .ignoreEnglishStyle(true)
        .ignoreRepeat(true)
        .enableNumCheck(true)
        .enableEmailCheck(true)
        .enableUrlCheck(true)
        .init();

final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。";
Assert.assertTrue(wordBs.contains(text));

Configuration instructions

The description of each configuration is as follows:

Serial number	method	instruction
1	ignoreCase	Ignore case
2	ignoreWidth	Ignore half-corner fillets
3	ignoreNumStyle	Ignore the wording of numbers
4	ignoreChineseStyle	Ignore Chinese writing format
5	ignoreEnglishStyle	Ignore English writing format
6	ignoreRepeat	Ignore repeated words
7	enableNumCheck	Whether to enable digital detection. 8 consecutive digits are considered as sensitive words by default
8	enableEmailCheck	Is there a mailbox detection enabled
9	enableUrlCheck	Whether to enable link detection

Dynamic loading (user-defined)

Scenario description

Sometimes we want to design the loading of sensitive words to be dynamic, such as console modification, which can then take effect in real time.

v0.0.13 supports this feature.

Interface Description

In order to achieve this feature and be compatible with previous functions, we have defined two interfaces.

IWordDeny

The interface is as follows, you can customize your own implementation.

The returned list indicates that the word is a sensitive word.

/**
 * 拒绝出现的数据-返回的内容被当做是敏感词
 * @author binbin.hou
 * @since 0.0.13
 */
public interface IWordDeny {

    /**
     * 获取结果
     * @return 结果
     * @since 0.0.13
     */
    List<String> deny();

}

for example:

public class MyWordDeny implements IWordDeny {

    @Override
    public List<String> deny() {
        return Arrays.asList("我的自定义敏感词");
    }

}

IWordAllow

The interface is as follows, you can customize your own implementation.

The returned list indicates that the word is not a sensitive word.

/**
 * 允许的内容-返回的内容不被当做敏感词
 * @author binbin.hou
 * @since 0.0.13
 */
public interface IWordAllow {

    /**
     * 获取结果
     * @return 结果
     * @since 0.0.13
     */
    List<String> allow();

}

like:

public class MyWordAllow implements IWordAllow {

    @Override
    public List<String> allow() {
        return Arrays.asList("五星红旗");
    }

}

Configure and use

interface is customized, of course it needs to be specified to take effect.

In order to make the use more elegant, we designed the boot class SensitiveWordBs .

You can use wordDeny() to specify sensitive words, wordAllow() to specify non-sensitive words, and init() to initialize the sensitive word dictionary.

The default configuration of the system

SensitiveWordBs wordBs = SensitiveWordBs.newInstance()
        .wordDeny(WordDenys.system())
        .wordAllow(WordAllows.system())
        .init();

final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。";
Assert.assertTrue(wordBs.contains(text));

Note: init () is a more time-consuming to build sensitive words DFA generally recommended when application initialization initialized only once . Instead of repeated initialization!

Specify your own implementation

We can test the custom implementation as follows:

String text = "这是一个测试，我的自定义敏感词。";

SensitiveWordBs wordBs = SensitiveWordBs.newInstance()
        .wordDeny(new MyWordDeny())
        .wordAllow(new MyWordAllow())
        .init();

Assert.assertEquals("[我的自定义敏感词]", wordBs.findAll(text).toString());

Here only my custom sensitive words are sensitive words, and the test is not sensitive words.

Of course, here are all our custom implementations. Generally, it is recommended to use the system's default configuration + custom configuration.

You can use the following methods.

`Configure multiple at the same time`

Multiple sensitive words

WordDenys.chains() method combines multiple implementations into the same IWordDeny.

Multiple whitelists

WordAllows.chains() method combines multiple implementations into the same IWordAllow.

example:

String text = "这是一个测试。我的自定义敏感词。";

IWordDeny wordDeny = WordDenys.chains(WordDenys.system(), new MyWordDeny());
IWordAllow wordAllow = WordAllows.chains(WordAllows.system(), new MyWordAllow());

SensitiveWordBs wordBs = SensitiveWordBs.newInstance()
        .wordDeny(wordDeny)
        .wordAllow(wordAllow)
        .init();

Assert.assertEquals("[我的自定义敏感词]", wordBs.findAll(text).toString());

Here, both the system default configuration and the custom configuration are used.

`spring integration`

`background`

In actual use, for example, you can configure and modify the page, and then take effect in real time.

The data is stored in the database, the following is an example of pseudo code, you can refer to SpringSensitiveWordConfig.java

Requirements, version v0.0.15 and above.

`Custom data source`

The simplified pseudo code is as follows, the source of the data is the database.

MyDdWordAllow and MyDdWordDeny are custom implementation classes based on the database as the source.

@Configuration
public class SpringSensitiveWordConfig {

    @Autowired
    private MyDdWordAllow myDdWordAllow;

    @Autowired
    private MyDdWordDeny myDdWordDeny;

    /**
     * 初始化引导类
     * @return 初始化引导类
     * @since 1.0.0
     */
    @Bean
    public SensitiveWordBs sensitiveWordBs() {
        SensitiveWordBs sensitiveWordBs = SensitiveWordBs.newInstance()
                .wordAllow(WordAllows.chains(WordAllows.system(), myDdWordAllow))
                .wordDeny(myDdWordDeny)
                // 各种其他配置
                .init();

        return sensitiveWordBs;
    }

}

The initialization of sensitive lexicon is time-consuming. It is recommended to do an init initialization when the program is started.

`Dynamic changes`

In order to ensure that the sensitive word modification can take effect in real time and to ensure that the interface is as simplified as possible, there is no new add/remove method here.

Instead, when calling sensitiveWordBs.init() , the sensitive word database is rebuilt according to IWordDeny+IWordAllow.

Because initialization may take a long time (second level), all optimizations to when init is not completed will not affect the old thesaurus function, and the new will prevail after completion.

@Component
public class SensitiveWordService {

    @Autowired
    private SensitiveWordBs sensitiveWordBs;

    /**
     * 更新词库
     *
     * 每次数据库的信息发生变化之后，首先调用更新数据库敏感词库的方法。
     * 如果需要生效，则调用这个方法。
     *
     * 说明：重新初始化不影响旧的方法使用。初始化完成后，会以新的为准。
     */
    public void refresh() {
        // 每次数据库的信息发生变化之后，首先调用更新数据库敏感词库的方法，然后调用这个方法。
        sensitiveWordBs.init();
    }

}

sensitiveWordBs.init(); when the thesaurus of the database is changed, and the thesaurus is required to take effect.

Other uses remain unchanged, no need to restart the application.

Brother Guanxi smiled slightly and wanted to do something, so he should be a man first.

`Further reading`

Sensitive word tool realization ideas

DFA algorithm explanation

sensitive dictionary optimization process

Thinking record of stop words

`summary`

Again, we use the law to defend ourselves, but we must never allow some people to entertain everything, thinking that money can buy everything.

On the occasion of this century, let alone let the blood of the ancestors flow in vain.

What's more, it is a Canadian actor with three nos. It is recommended to deal with it according to law, and then (ノ｀Д)ノ (beautiful Chinese)