Sensitive words are not the platform

Recently, a Canadian-born rapper has been exposed for sexual indiscretions, possibly involving lulling an underage girl into acting as a raper.

Of course, as for whether it is true, in fact, whether a person is Neptune, WeChat, QQ chat records inside remember clearly. When it comes to criminal cases, TX can review all the historical records. Tencent video and an electric eel termination, it is not necessarily groundless, after all, interests related.

But there were two things that I noticed in the process that were really remarkable:

(1) His gossiped girlfriend, Xiao Yi, was scolded for clearing all social platforms. As a melon big household X blog, can only server breakdown, do not know the sensitive word filter?

(2) the whistleblowers are beautiful bamboo | received many blood red in the photo, every day is each big blow artificial intelligence platform, also does not have the function of the filter?

Of course, I know nothing about artificial intelligence.

But for sensitive words, recently wrote a small tool, if the major platform needs, has been open source, welcome to pick up their own.

https://github.com/houbb/sensitive-word

At least you can desensitize the beautiful Chinese words as follows:

You * * *, I * * you * *, you * * *, * *! XXX!

Writing purpose

Based on the DFA algorithm, the content of the sensitive thesaurus includes 6W+ (the source file contains 18W+, after one deletion).

In the later stage, we will continue to optimize and supplement the sensitive lexicon, and further improve the performance of the algorithm.

I hope to refine the classification of sensitive words, but I feel it is too much work, so I haven’t done it yet.

Let’s talk about vision here. Vision is to become the number one tool for sensitive words.

Of course, the first is always empty.

features

  • 6W+ thesaurus, and constantly optimized and updated
  • Based on DFA algorithm, the performance is better
  • Based on Fluent – API implementation, use elegant and simple
  • Support sensitive word judgment, return, desensitization and other common operations
  • Support full Angle half Angle interchange
  • Support English case exchange
  • Support for the interchange of common forms of numbers
  • Support the exchange of traditional and simplified Chinese characters
  • Support the exchange of common English forms
  • Support user-defined sensitive words and whitelists
  • Support data dynamic update, effective in real time

Quick start

To prepare

  • JDK1.7 +
  • Maven 3.x+

Maven is introduced into

< the dependency > < groupId > com. Making. Houbb < / groupId > < artifactId > sensitive - word < / artifactId > < version > 0.0.15 < / version > </dependency>

An overview of the API

SensitiveWordHelper, as a tool class for sensitive words, has the following core methods:

methods parameter The return value instructions
contains(String) The string to be validated Boolean value Verify that the string contains sensitive words
findAll(String) The string to be validated String list Returns all sensitive words in the string
replace(String, char) Replaces the sensitive word with the specified char string Returns the desensitized string
replace(String) use*Replace sensitive words string Returns the desensitized string

Using the instance

See SensitiveWordHelperTest for all test cases

Determine whether sensitive words are included

The Five-Starred Red Flag is waving in the wind, and the portrait of Chairman Mao stands in front of Tian 'anmen Square. ; Assert.assertTrue(SensitiveWordHelper.contains(text));

Returns the first sensitive word

The Five-Starred Red Flag is waving in the wind, and the portrait of Chairman Mao stands in front of Tian 'anmen Square. ; String word = SensitiveWordHelper.findFirst(text); Assert. Assertequals (" five-star red flag ", word);

Returns all sensitive words

The Five-Starred Red Flag is waving in the wind, and the portrait of Chairman Mao stands in front of Tian 'anmen Square. ; List<String> wordList = SensitiveWordHelper.findAll(text); Assert.assertequals ("[Five-Starred Red Flag, Chairman Mao, Tian 'anmen Square]", wordlist.toString ());

The default replacement policy

The Five-Starred Red Flag is waving in the wind, and the portrait of Chairman Mao stands in front of Tian 'anmen Square. ; String result = SensitiveWordHelper.replace(text); **** The wind is blowing, and the picture of *** stands in front of ***. , result);

Specify what to replace

The Five-Starred Red Flag is waving in the wind, and the portrait of Chairman Mao stands in front of Tian 'anmen Square. ; String result = SensitiveWordHelper.replace(text, '0'); "0000 blows in the wind, and the picture of 000 stands before it." , result);

More features

Many subsequent features, mainly for a variety of processing for a variety of situations, as far as possible to improve the hit rate of sensitive words.

It was a long offensive and defensive battle.

Ignoring case

final String text = "fuCK the bad words.";

String word = SensitiveWordHelper.findFirst(text);
Assert.assertEquals("fuCK", word);

Ignore the half rounded corners

Final String Text = "Fuck the bad words."; String word = SensitiveWordHelper.findFirst(text); Assert. AssertEquals (" fuck ", the word);

Ignore the way the numbers are written

Here, the conversion of common forms of numbers is realized.

Final String text = "This is my WeChat:9 ⓿ 2four-CAS. ③⑸ Ohm and FIFTH Ohm "; List<String> wordList = SensitiveWordHelper.findAll(text); Assert. AssertEquals (" [9 ⓿ two boss ⁹ ₈ (3) [5] _____ the ➃ and ㊄] ", wordList. The toString ());

Ignore the simplified characters

Final String Text = "I love my motherland and the Five-Starred Red Flag" ; List<String> wordList = SensitiveWordHelper.findAll(text); Assert.assertequals ("[five-star red flag]", wordlist.toString ());

Ignore the written form of English

Final String text = "Ⓕⓤc the bad words"; List<String> wordList = SensitiveWordHelper.findAll(text); Assert. AssertEquals (" [Ⓕ ⓤ c ⒦] ", wordList. The toString ());

Ignore repeat words

Final String text = "ⒻⒻⒻf null u null c void the bad words"; List<String> wordList = SensitiveWordHelper.findAll(text); Assert. AssertEquals (" [Ⓕ Ⓕ Ⓕ ⓤ f u ⓤ ⒰ c ⓒ ⒦] ", wordList. The toString ());

Mail detection

Final String text = "[email protected]"; List<String> wordList = SensitiveWordHelper.findAll(text); Assert.assertEquals("[[email protected]]", wordList.toString());

Feature configuration

instructions

All of the above features are enabled by default, and sometimes businesses need the flexibility to define the associated configuration features.

So V0.0.14 opens up property configuration.

Configuration method

In order to make the use more elegant, the definition of Fluent-API is used uniformly.

Users can define SensitiveWordBs as follows:

SensitiveWordBs wordBs = SensitiveWordBs.newInstance() .ignoreCase(true) .ignoreWidth(true) .ignoreNumStyle(true) .ignoreChineseStyle(true) .ignoreEnglishStyle(true) .ignoreRepeat(true) .enableNumCheck(true) .enableEmailCheck(true) .enableUrlCheck(true) .init(); The Five-Starred Red Flag is waving in the wind, and the portrait of Chairman Mao stands in front of Tian 'anmen Square. ; Assert.assertTrue(wordBs.contains(text));

Configuration instructions

The description of each configuration is as follows:

The serial number methods instructions
1 ignoreCase Ignoring case
2 ignoreWidth Ignore the half rounded corners
3 ignoreNumStyle Ignore the way the numbers are written
4 ignoreChineseStyle Ignore the writing format of Chinese
5 ignoreEnglishStyle Ignore the written form of English
6 ignoreRepeat Ignore repeat words
7 enableNumCheck Whether digital detection is enabled. Default 8 consecutive digits are considered sensitive words
8 enableEmailCheck Yes, email detection is enabled
9 enableUrlCheck Whether link detection is enabled

Dynamic loading (user-defined)

scenario

Sometimes we want to design the loading of sensitive words to be dynamic, such as console modification, and then take effect in real time.

V0.0.13 supports this feature.

Interface specification

To implement this feature and to be compatible with previous functionality, we defined two interfaces.

IWordDeny

The interface is as follows, and you can customize your own implementation.

Returns a list indicating that the word is a sensitive word.

* @author binbin.hou * @since 0.0.13 */ public interface iWordDeny {/** * Get the result * @Return the result * @since 0.0.13 */ List<String> deny(); }

Such as:

Public class MyWordDeny implements = public List<String bb0 Deny () {return Arrays.asList(" Arrays.implements "); }}

IWordAllow

The interface is as follows, and you can customize your own implementation.

Returns a list indicating that the word is not a sensitive word.

* @author binbin.hou * @since 0.0.13 */ public interface iWordAllow {/** * Get the result * @return the result * @since 0.0.13 */ List<String> allow(); }

Such as:

Public class MyWordAllow implements IWordAllow {@ Override public a List < String > allow () {return Arrays. The asList (" flag "); }}

Configured to use

After the interface is customized, of course, it needs to be specified to take effect.

To make the use more elegant, we designed the guide class SensitiveWordbs.

You can specify sensitive words through wordDeny(), non-sensitive words through wordAllow(), and initialize the dictionary of sensitive words through init().

The default configuration of the system

SensitiveWordBs wordBs = SensitiveWordBs.newInstance() .wordDeny(WordDenys.system()) .wordAllow(WordAllows.system()) .init(); The Five-Starred Red Flag is waving in the wind, and the portrait of Chairman Mao stands in front of Tian 'anmen Square. ; Assert.assertTrue(wordBs.contains(text));

Note: init() is time-consuming to build for the sensitive word DFA, and it is generally recommended that the application be initialized only once. Instead of repeated initializations!

Specify your own implementation

We can test our custom implementation as follows:

String text = "This is a test of my custom sensitive words." ; SensitiveWordBs wordBs = SensitiveWordBs.newInstance() .wordDeny(new MyWordDeny()) .wordAllow(new MyWordAllow()) .init(); Assert.assertQuals ("[My custom sensitive words]", wordbs.findAll (text).toString());

Here only my custom sensitive words are sensitive words and the test is not sensitive words.

Of course, this is all to use our custom implementation, it is generally recommended to use the system default configuration + custom configuration.

You can do it in the following way.

Configure multiple at once

  • Multiple sensitive words

WordDenys.Chains () method which combines multiple implementations into one WordDeny.

  • Multiple whitelists

Wordallows. Chains () method combines multiple implementations into one iWordAllow.

Example:

String text = "This is a test. My custom sensitive words." ; IWordDeny wordDeny = WordDenys.chains(WordDenys.system(), new MyWordDeny()); IWordAllow wordAllow = WordAllows.chains(WordAllows.system(), new MyWordAllow()); SensitiveWordBs wordBs = SensitiveWordBs.newInstance() .wordDeny(wordDeny) .wordAllow(wordAllow) .init(); Assert.assertQuals ("[My custom sensitive words]", wordbs.findAll (text).toString());

Here is the use of both the system default configuration, and the custom configuration.

Spring integration

background

In actual use, for example, changes can be made in page configuration and then take effect in real time.

Data stored in the database, the following is an example of a pseudo code, you can refer to SpringSensitiveWordConfig. Java

Requirements, Version V0.0.15 and above.

Custom data source

The simplified pseudocode is as follows. The source of the data is the database.

MyDDWordAllow and MyDDWordDeny are custom implementation classes based on the database as the source.

@Configuration public class SpringSensitiveWordConfig { @Autowired private MyDdWordAllow myDdWordAllow; @Autowired private MyDdWordDeny myDdWordDeny; / initialize the bootstrap class * * * * @ return to initialize the bootstrap class * @ since 1.0.0 * / @ Bean public SensitiveWordBs SensitiveWordBs () {SensitiveWordBs sensitiveWordBs = SensitiveWordBs.newInstance() .wordAllow(WordAllows.chains(WordAllows.system(), MyddwordAllow)).wordDeny(myddwordDeny) // Various other config.init (); return sensitiveWordBs; }}

The initialization of the sensitive lexicon is time consuming. It is recommended to do an init initialization when the program starts.

The dynamic change of

In order to ensure that changes to sensitive words can take effect in real time and that the interface is as simple as possible, there is no add/remove method here.

Instead, when sensitivewordbs.init () is called, the sensitive lexicon is rebuilt from iWordDeny + iWordAllow.

Because initialization can take a long time (at the second level), all optimizations to init do not affect the old thesaurus function until it is complete, and the new one will prevail when it is completed.

@Component public class SensitiveWordService { @Autowired private SensitiveWordBs sensitiveWordBs; /** update thesaurus ** Each time the database information changes, the first call to update the database sensitive thesaurus method. * This method is called if it needs to take effect. * * Note: Reinitialization does not affect the use of old methods. After initialization is complete, the new will prevail. */ public void refresh() {// Each time a database change occurs, the method that updates the database's sensitive lexicon is first called and then called. sensitiveWordBs.init(); }}

As above, you can actively trigger an initialization of sensitivewords.init () when the database lexicon changes and needs the lexicon to take effect; .

Other uses remain the same without restarting the application.

The crown xi elder brother smiled slightly, want to work, first life.

Develop reading

Sensitive word tool implementation ideas

DFA algorithm

Sensitive lexicon optimization process

Stop thinking about words and record them

summary

Again, we use the law to defend ourselves, but we must not allow some people to turn everything into entertainment, thinking that money can buy everything.

On the occasion of the centenary, we must not let our ancestors’ blood flow in vain.

Not to mention is a three no Canadian actor, the proposal to dispose of according to law, and then (ノ ‘pas) (beautiful Chinese words)