Sensitive word filtering algorithm comparison, incidentally open source a tool library

Comparison of sensitive word algorithms

At present, there are two kinds of sensitive word algorithms in the community: DFA(Deterministic Finite Automaton) algorithm and AC(Aho-Corasick) algorithm. Two representative articles are found in the gold mining community: “Js implementation of sensitive word filtering algorithm” and “open source a JavaScript version of sensitive word filtering library”

I have looked at both codes and made a simple comparison from my perspective (the DFA algorithm is the version after some changes based on the original author).

The current computer tested is a MacBook Pro (Retina, 15-inch, Mid 2015) with a 2.2ghz Intel Core I7 CPU

Code implementation

AC algorithm is relatively complex, so its implementation is also more complex, without a pen and paper is really difficult to understand. The DFA algorithm, on the other hand, is relatively straightforward, and you can roughly build the entire implementation logic by looking at the code. So from the implementation of the code and readability, DFA algorithm is relatively deep in my heart

DFA algorithm dominates

Code function

AC algorithm, the author provides many functions, such as support for fast query, temporary support to add words and so on, and the first kind of algorithm in my improved currently only supports the quick query, so the function level AC algorithm is slightly better, but does not mean that these functions are not achieve the DFA algorithm, just see you need not to need, need I will constantly improve in the open source library.

AC algorithm dominates

Algorithms in time and space

Here is the benchmark script I used. The test data can be seen here: Portal. The script is as follows:

const { makeSensitiveMap, checkSensitiveWord } = require('.. /dist/help')
const FastScanner = require('fastscan')
const fs = require('fs')
const path  = require('path')

function loadOneHundredThousandSentences() {
  const data = fs.readFileSync(path.resolve(__dirname, './sensitiveWords.txt'), 'utf8')
  const wordsArray = data.trim().split('|')
  console.log(`Now we have sensitive words length is: [${wordsArray.length}] `)return wordsArray
}

function loadOneHundredThousandWords() {
  const data = fs.readFileSync(path.resolve(__dirname, './beCheckedWords.txt'), 'utf8')
  const words = data.trim()
  console.log(`Now we have checking words length is: [${words.length}] `)return words
}

function benchmarkForDFAalgorithm() {
  const wordsArray = loadOneHundredThousandSentences()
  const before = process.memoryUsage()
  console.time('DFA algorithm load sensitive map tree')
  const wordMaps = makeSensitiveMap(wordsArray)
  console.timeEnd('DFA algorithm load sensitive map tree')
  const after = process.memoryUsage()
  console.log("DFA algorithm build tree of %d words costs rss=%dM heapTotal=%dM heapUsed=%dM", wordsArray.length, (after.rss-before.rss) >> 20, (after.heapTotal - before.heapTotal) >> 20, (after.heapUsed - before.heapUsed) >> 20)

  const toBeCheckedWords = loadOneHundredThousandWords()
  console.time('DFA algorithm check one hundred thousand words')
  checkSensitiveWord(toBeCheckedWords, false, wordMaps)
  console.timeEnd('DFA algorithm check one hundred thousand words')}function benchmarkForACalgorithm() {
  const wordsArray = loadOneHundredThousandSentences()
  const before = process.memoryUsage()
  console.time('AC algorithm load sensitive map tree')
  const scanner = new FastScanner(wordsArray)
  console.timeEnd('AC algorithm load sensitive map tree')
  const after = process.memoryUsage()
  console.log("AC algorithm build tree of %d words costs rss=%dM heapTotal=%dM heapUsed=%dM", wordsArray.length, (after.rss-before.rss) >> 20, (after.heapTotal - before.heapTotal) >> 20, (after.heapUsed - before.heapUsed) >> 20)

  const toBeCheckedWords = loadOneHundredThousandWords()
  console.time('AC algorithm check one hundred thousand words')
  scanner.search(toBeCheckedWords)
  console.timeEnd('AC algorithm check one hundred thousand words'// benchmarkForDFAalgorithm() // benchmarkForACalgorithm()Copy the code

We directly build with 100000 words as sensitive words, and input 100000 words to search, and get the test result as follows :(the memory data is run separately, or the test may not be accurate due to GC problems)

Now we have sensitive words length is: [107888] DFA algorithm load sensitive map tree: 244.374 MS DFA algorithm build tree of 107888 words costs RSS =121M heapTotal=117M heapUsed=98M Now we have checking words Length is: [100000] DFA algorithm check one hundred thousand words: 98.768ms Now we have sensitive words Length is: [107888] AC algorithm load sensitive map tree: 1529.913 MS AC algorithm build tree of 107888 words costs RSS =174M heapTotal=161M heapUsed=136M Now we have checking Words Length is: [100000] AC algorithm check one hundred thousand words: 98.532msCopy the code

According to the above test results, AC test results are roughly consistent with those provided by the original author, and it can be seen that DFA algorithm is far better than AC algorithm in terms of word tree construction and memory occupancy, and the two algorithms are comparable in search execution, so DFA algorithm is dominant in this round

DFA algorithm dominates

conclusion

In conclusion, the conclusions are as follows:

If you want to judge a small number of words simply, you can use both
If your sensitive vocabulary is large, it is recommended that you use the DFA algorithm

Js Implementation of Sensitive Word Filtering AlgorithmCode changes to

The author of the original text has introduced a lot of ideas of DFA algorithm, so I will not repeat them here. Because the author divided the checkable function into two, it is easy to omit the second function filterSensitiveWord. So I want to integrate into a function, at first THOUGHT to use recursive tail call to implement, pseudo-code implementation is as follows:

checkSensitiveWord(sensitiveMap: wordMap, txt: string, index: number) {
  let currentMap = sensitiveMap;
  // const matchWords = new Map()
  letwordNum = 0; // Record filteringlet sensitiveWord = ' '; // Record the filtered sensitive words // recursion to the end, end recursionif (index === txt.length) {
    return
  }
  for (let i = index; i < txt.length; i++) {
    const word = txt.charAt(i);
    currentMap = currentMap.get(word);
    if (currentMap) {
      wordNum++;
      sensitiveWord += word;
      if (currentMap.get('laster') = = =true) {// indicates the end of the word, the sensitive word is stored in the stored code.... // Continue the recursionreturn this.checkSensitiveWord(sensitiveMap, txt, index + 1)
      }
    } else {
      return this.checkSensitiveWord(sensitiveMap, txt, index + 1)
    }
  }
}
Copy the code

As a result, nodeJS no longer supports recursive tail-call optimization starting with version 8, so this code works fine if it checks for a small amount of text, but when it gets large, it can easily overflow the stack.

So stick to the loop and add support for quick searches, and the final code looks like this

/** * check whether the text to be searched contains sensitive words * @param TXT text to be searched for sensitive words * @param sensitiveWordsMap sensitiveWordsMap structure, allowing you to define, MakeSensitiveMap generates * @param isQuickSearch with makeSensitiveMap, and returns the value foundtrueAnd vice isfalse* /export function checkSensitiveWord(
  txt: string,
  isQuickSearch = false, sensitiveWordsMap? : wordMap) { const defaultMap = () => { const data = fs.readFileSync(path.resolve(__dirname,'.. /utils/sensitiveWords.txt'), 'utf8')
    const wordsArray = data.trim().split('|')
    return makeSensitiveMap(wordsArray)
  }
  const _sensitiveWordsMap = sensitiveWordsMap ? sensitiveWordsMap : defaultMap()
  const matchWords = new Map()
  for (let i = 0; i < txt.length; i++) {
    let currentMap = _sensitiveWordsMap;
    let sensitiveWord = ' '; // Record the sensitive words filtered outfor (let j = i; j < txt.length; j++) {
      const word = txt.charAt(j);
      currentMap = currentMap.get(word);
      if (currentMap) {
        sensitiveWord += word;
        if (currentMap.get('isEnd') = = =true) {// If it is a quick search, do not care about the search results of sensitive words, find a return, suitable for normal input detectionif (isQuickSearch) {
            return true} const isExist = matchWords. Get (sensitiveWord)if (isExist) {
            isExist.push({ location: i })
            matchWords.set(sensitiveWord, isExist)
          } else {
            matchWords.set(sensitiveWord, [{ location: i }])
          }
          break}}else {
        break; }}} // If a quick query is not returned, the sensitive word is not foundif (isQuickSearch) {
    return false
  }
  return matchWords
}
Copy the code

For a complete example, see awesome-js

Incidentally, open source a tool library

These functions are not just pasted and copied directly, but belong to the awesome-js library, which contains many common functions that I use in business development. It is divided into four parts: Math tools, regular expression tools, accessibility tools, and HTTP processing tools.

How do I use this library?

download

npm install awesome-js --save

yarn add awesome-js
Copy the code

use

Use import {AwesomeHelp} from ‘awesome-js’

What functionality does it offer?

Thanks to the types definition of TS, we just need to check out the types file, so we can post it here for your convenience

exportinterface Deferred { resolve: (value? : any) => any reject: (reason? : any) => void promise: Promise<any> }type wordMap = Map<string, recursiveMap | boolean>

interface recursiveMap extends wordMap {}

exportNamespace AwesomeRegx {/** * @description Matches the mobile phone number */exportconst phoneNumber: RegExp; /** * @description matches the Emoji character */exportconst isEmoji: RegExp; /** * @description Private phone number, which will replace the middle four digits of the phone number with ** /exportconst privacyMobile: (mobile: string) => string; /** * @description name desensitization, replace all non-null characters except the first word with ** /exportconst privacyName: (name: string) => string; /** * @description Matches Chinese characters and full-angle characters */exportconst chineseAndfullWidthChar: RegExp; /** * @description matches the SRC attribute in the image tag, often used to remove the SRC HTTP */exportconst imgSrc: RegExp; /** * @description simple matching id number */exportconst simpleIdentityNo: RegExp; /** * @description matches the Chinese name */exportconst chineseName: RegExp; /** * @description matches a positive integer */exportconst positiveInteger: RegExp; /** * @description matches the integer */export const integer: RegExp; /** * @description matches a negative integer */exportconst negativeInteger: RegExp; /** * @description matches a non-negative integer */exportconst nonnegativeInteger: RegExp; /** * @description matches a non-positive integer */exportconst nonPostiveInterger: RegExp; /** * @description matches a positive floating point number */exportconst postiveFloat: RegExp; /** * @description matches a negative float */exportconst negativeFloat: RegExp; /** * @description matches a floating point */export const float: RegExp; /** * @description matches a non-negative float */exportconst nonNegativeFloat: RegExp; /** * @description matches a non-positive floating point number */exportconst nonPositiveFloat: RegExp; /** * @description Contains 26 letters */exportconst alphabat: RegExp; /** * @description Matches uppercase letters */exportconst upperAlpha: RegExp; /** * @description Matches lowercase letters */exportconst lowerAlpha: RegExp; /** * @description Matches letters and digits and underscores */exportconst alphaNumWithUnderline: RegExp; /** * @description matches double-byte characters */exportconst DBC: RegExp; /** * @description matches an empty line */exportconst emptyLine: RegExp; /** * @description matches the string */ with whitespace at the beginning or endexportconst emptyCharInStartAndEnd: RegExp; /** * @description Matches Chinese characters */exportconst chinese: RegExp; /** * @description Matches the mailbox */exportconst email: RegExp; /** * @description matches the url */exportconst url: RegExp; /** * @description Matches the IP address */exportconst ip: RegExp; /** * @description matches the phone landline */exportconst telPhone: RegExp; /** * @description matches the zip code */export const postalCode: RegExp;
}
exportNamespace AwesomeHelp {/** * @description Classifies array objects based on the values of some fields of the object * @param List The array objects to classify (must be an array) * @param fields Fields to classify (you must pass a function that supports multiple fields) */export functiongroupBySomeFields<T>(list: T[], fields: (item: T) => any[]): T[][] /** * @description extends Date, converts Date to String of specified format * @param Date converts Date to String of specified format * @param format Converts Date to last format, Such as YYYY - MM - DD * /export functionConvertDate (date: date, format: string): string /** * @description Add floating point numbers */export functionAddFloat (arg1: number, arg2: number): number /** * @description Floating point number minus */export functionMinusFloat (arg1: number, arg2: number): number /** * @description Floating point number divided by */export functionDivFloat (arg1: number, arg2: number): number /** * @description Float number multiplied */export function timesFloat(arg1: number, arg2: number): number

  export functionMakeDeferred (): Deferred /** * @description check if it is a generator */export functionIsGenerator (obj: any): Boolean /** * @description check whether the function isGenerator */export functionIsGeneratorFunction (obj: any): Boolean /** * @description Check whether it is a Promise */export functionIsPromise (obj: any): Boolean /** * @description 1000 count */export functionToThousands (num: number): string /** * To hide all the digits except for the specified digit, for example, to convert 100000 zeros to? Will do this, then call hiddenNumberExpectSpecified (100000, 0,'? ') = > 1??????? @param hiddenStr = @param hiddenStr = @param hiddenStr = @param hiddenStr = @param hiddenStr = @param hiddenStr = @param hiddenStr * /export functionhiddenNumberExpectSpecified(num: number, expected: number, hiddenStr: string): String /** * combines all sensitive words into a nested Map structure using the DFA data structure algorithm * @param sensitiveWordList */export functionmakeSensitiveMap(sensitiveWordList: string[]): WordMap /** * Check whether the text to be searched contains sensitive words * @param TXT Text to be searched for sensitive words * @param sensitiveWordsMap Map structure of sensitive words, allowing custom, MakeSensitiveMap; makeSensitiveMap; makeSensitiveMap; makeSensitiveMap; makeSensitiveMapfalseIf yes, the found value is returnedtrueAnd vice isfalse* /export functioncheckSensitiveWord( txt: string, isQuickSearch? : null, sensitiveWordsMap? : wordMap): Map<string, { location: number}[] >export functioncheckSensitiveWord( txt: string, isQuickSearch: boolean, sensitiveWordsMap? : wordMap): boolean }export namespace AwesomeMath {
  exportClass Region {constructor(points: number[][]) /** * @constructor (points: number[]) */ public centroid: () => {x: Number, y: number} /** * @description @param {number} lng1 latitude @param {number} lat1 latitude @param {number} lat1 latitude @param {number} lat1 latitude @param {number} Lng2 terminal latitude * @param {number} lat2 terminal latitude * @returns {number} The straight-line distance between two points, in meters */export functionGetDistance (lng1: number, lat1: number, lng2: number, lat2: number): number /** * Convert longitude or latitude to a map-aware format * @param origin */export functionDecodeLatLng (Origin: number): number /** ** @param Origin convert longitude or latitude to integer format */export function encodeLatLng(origin : number): number
}

exportNamespace AwesomeHttp {/** * @description Specifies a parameter in the query request to update the URL. This parameter can be used with replaceState to update the browser history @param key Key to be updated * @param value New value */export functionupdateQueryStringParam(baseUrl: string, key: string, value: any): String /** * @description parses queryObject and appends it to path */export function queryObject2String(path: string, queryObject: object): string
}
Copy the code

The last

Welcome to PR, add more functions, convenient for you and me. I will update the tool library constantly, welcome to Watch