This is the fourth day of my participation in the August Text Challenge.More challenges in August

The introduction

I’ve been planning on doing this for a long time, and I’ve had time to study it recently. A lot of information can be found on the net commonly, here simply say my understanding.

PS: Mobile phone number matching uses regular expression, which does not belong to the category of sensitive words, please note.

In order to shield some scalpers to promote advertising, similar to QQ, wechat, mobile phone number… Wait, we want to replace them all with *. For the sake of simplicity, using wechat as an example (not discrimination), we will encounter the following situations:

  • Chinese
    • Simplified characters: WeChat
    • Traditional Chinese: Wechat
    • Mars text (deformation or homonym) : 嶶 letter, prestige letter
  • With a special symbol in the middle
    • Half Angle special symbol(within ASCII) : *&! #@(){}[] etc., such as wechat, wechat && wechat, wechat _ letter
    • The Angle of(other than ASCII) : Punctuation marks in Chinese, some emotion, etc., such as wechat — letter, wechat 😊 letter.
  • acronym:
    • Half Angle: such as VX and WX.
    • Full Angle: VX, Wx.

There are so many variations and combinations that we need to make trade-offs when implementing them. What if we want to filter for sensitive words in simplified characters, special half-corners and acronyms?

String lookup KMP

If performance is not considered, just use the contians() function of String, which contains as many times as there are sensitive words. However, if there are thousands of words of text to be filtered and tens of thousands of sensitive words, this scheme is definitely not suitable.

Look at how the Trie tree solves this problem.

Trie tree

A Trie tree, or dictionary or prefix tree, is a tree structure. It is widely used to count and sort large numbers of strings (but not limited to strings), so it is often used for text word frequency statistics by search engine systems. It has the advantage of minimizing unnecessary string comparison and high query efficiency.

The core idea of Trie is to use the common prefix of the string to reduce the cost of query time to improve efficiency.

Basic properties of Trie trees

  1. The root node contains no characters, and each node except the root node contains only one character.
  2. From the root node to a node, the characters on the path are connected to the string corresponding to the node.
  3. All children of each node contain different characters
  4. Characters that are repeated continuously from the first character occupy only one node, such as to and ten above, and the repeated word t occupies only one node.

So each node should have two fields:

  1. End_flag logo: Indicates whether to end, namely, the leaf node.
  2. map<key, value>: Stores the value of the current node and the value of its children.

introductory

LeetCode 208. Implementing Trie (prefix tree)

Implement a Trie (prefix tree) with insert, search, and startsWith operations. Example:

Trie trie = new Trie(a); trie.insert("apple");
trie.search("apple");   / / return true
trie.search("app");     / / returns false
trie.startsWith("app"); / / return true
trie.insert("app");   
trie.search("app");     / / return true
Copy the code

Description:

  • You can assume that all input is made up of lowercase letters a-z.
  • Ensure that all inputs are non-empty strings.

Analysis and Implementation

Let’s start by drawing it as a trie tree (using the words apple and app2 in the title) :throughInsert (" apple ")andInsert (" app ")With 2 function calls, we build the trie tree shown in the figure above. We found that:

  • Apple and app share 1 prefix, no extra node is created, just an extra # end node.
  • There are 2 # nodes. As long as # is encountered, it represents the end of the match. There should be as many # nodes as there are keywords.

Of course, there is another benefit to adding an end node: we can determine whether it is startsWith or equal.

Data structure definition

Because root contains no characters, you can define a TrieNode to represent a node and a Trie to represent a tree.

TrieNode:

#include <string>
#include <unordered_map>

// represents a node
class TrieNode {
public:
  // Add child nodes
  void addSubNode(const char &c, TrieNode *subNode) { 
    subNodes_[c] = subNode; 
  }

  // Get the child node
  TrieNode *getSubNode(const char &c) { 
    return subNodes_[c];
  }

private:
  // Child node dictionary (key is the child character, value is the child node)
  std::unordered_map<char, TrieNode *> subNodes_;
};
Copy the code

Trie:

// represents a Trie tree
class Trie {
public:
  // Inserts a word into the trie.
  void insert(std::string word);

  // Returns if the word is in the trie.
  bool search(std::string word);

  // Returns if there is any word in the trie that starts with the given prefix.
  bool startsWith(std::string prefix);
  
private:
  TrieNode *root_; // Root node, does not store characters
};
Copy the code

Insert implementation

There is a root field, which represents the root node. Now look at the insert operation:

void Trie::insert(std::string word) {
  TrieNode *curNode = root_;
  // Iterate over the string with the character as the key (note that Chinese usually have 3 bytes, so it will become 3 nodes, but does not affect the match)
  for (int i = 0; i < word.length(a); i++) {char c = word[i];
    TrieNode *subNode = curNode->getSubNode(c);

    // If there is no such node, create a new one
    if (subNode == nullptr) {
      subNode = new TrieNode(a); curNode->addSubNode(c, subNode);
    }
    // Drill down to the child node
    curNode = subNode;
  }
  // Set the end flag
  curNode->addSubNode(' # ',new TrieNode());
}
Copy the code

Search implementation

bool Trie::search(std::string word) {
  TrieNode *curNode = root_;
  for (int i = 0; i < word.length(a); i++) { curNode = curNode->getSubNode(word[i]);
    if (curNode == nullptr)
      return false;
  }
  return curNode->getSubNode(The '#') != nullptr;
}
Copy the code

StartsWith implementation

bool Trie::startsWith(std::string prefix) {
  TrieNode *curNode = root_;
  for (int i = 0; i < prefix.length(a); i++) { curNode = curNode->getSubNode(prefix[i]);
    if (curNode == nullptr)
      return false;
  }
  // The difference with search is that it does not determine whether the next node is the end node
  return true;
}
Copy the code

If run, the following output is displayed:

int main(a) {
  Trie t;
  t.insert("apple");
  printf("%d \n", t.search("apple"));   / / return true
  printf("%d \n", t.search("app"));     / / returns false
  printf("%d \n", t.startsWith("app")); / / return true
  t.insert("app");
  printf("%d \n", t.search("app"));     / / return true
  printf("%d \n", t.search("this is apple")); // Return false, why?
}
Copy the code
$ ./main 
1 
0 
1 
1 
0 
Copy the code

So, we have a question: why does the last “this is apple” return false when it clearly contains “apple”?

It’s pretty simple, it’s all default sensitive words at the beginning of the string, we just add a pointer and keep searching.

Sensitive word search

Ideas:

  1. First, p1 points to root, and p2 and P3 point to the first character in the string.
  2. The algorithm starts with character T and detects whether there is a sensitive word prefixed with t. Here, it can directly determine whether there is a child node t in root. I don’t have one here, so I’m going to shift both P2 and P3 to the right.
  3. Keep moving p2 and P3 and find the presence of sensitive words prefixed with a, thenJust move right p3 to see if the string between p2 and p3 is a sensitive word (complete, of course). If a sensitive word is found in a string, it can be replaced with another string such as ***. Then keep looping until the entire string is iterated.

The code implementation is as follows:

bool Trie::search(std::string word) {
  // TrieNode *curNode = root_;
  // for (int i = 0; i < word.length(); i++) {
  // curNode = curNode->getSubNode(word[i]);
  // if (curNode == nullptr)
  // return false;
  // }
  // return curNode->getSubNode(kEndFlag) ! = nullptr;

  // Convert to lowercase
  transform(word.begin(), word.end(), word.begin(), ::tolower);
  bool is_contain = false;
  for (int p2 = 0; p2 < word.length(a); ++p2) {int wordLen = getSensitiveLength(word, p2);
    if (wordLen > 0) {
      is_contain = true;
      break; }}return is_contain;
}

// Add a function to return the position and length of sensitive words for easy substitution or detection logic
int Trie::getSensitiveLength(std::string word, int startIndex) {
  TrieNode *p1 = root_;
  int wordLen = 0;
  bool endFlag = false;

  for (int p3 = startIndex; p3 < word.length(a); ++p3) {const char &cur = word[p3];
    auto subNode = p1->getSubNode(cur);
    if (subNode == nullptr) {
      break;
    } else {
      ++wordLen;
      // The whole thing contains sensitive words until the tail is located
      if (subNode->getSubNode(kEndFlag)) {
        endFlag = true;
        break;
      } else{ p1 = subNode; }}}// Note that the tail is not found
  if(! endFlag) { wordLen =0;
  }
  return wordLen;
}
Copy the code

About time complexity:

  • The time complexity of building sensitive words is negligible because we can use them countless times after building them.
  • If the length of the string is N, the time complexity of each sensitive word search is O(n).

The performance test

For testing purposes, we add another function:

std::set<SensitiveWord> Trie::getSensitive(std::string word) {
  // Convert to lowercase
  transform(word.begin(), word.end(), word.begin(), ::tolower);
  std::set<SensitiveWord> sensitiveSet;

  for (int i = 0; i < word.length(a); ++i) {int wordLen = getSensitiveLength(word, i);
    if (wordLen > 0) {
      // Record the index and length of the sensitive words found
      std::string sensitiveWord = word.substr(i, wordLen);
      SensitiveWord wordObj;
      wordObj.word = sensitiveWord;
      wordObj.startIndex = i;
      wordObj.len = wordLen;

      // Insert into set
      sensitiveSet.insert(wordObj);
      i = i + wordLen - 1; }}return sensitiveSet;
}
Copy the code

The test code is as follows:

void test_time(Trie &t) {
  auto t1 = std::chrono::steady_clock::now(a);auto r = t.getSensitive("SHit, you you you you you are stupid force you, say you, you big idiot.");
  for (auto &&i : r) {
    std::cout << "[index=" << i.startIndex << ",len=" << i.len
              << ",word=" << i.word << "],";
  }
  std::cout << std::endl;

  // run code
  auto t2 = std::chrono::steady_clock::now(a);/ / millisecond
  double dr_ms = std::chrono::duration<double, std::milli>(t2 - t1).count(a); std::cout <<"Time (ms) :" << dr_ms << std::endl;
}

int main(a) {
  Trie t;

  t.insert("You're an idiot.");
  t.insert("You're an idiot.");
  t.insert("You're bad.");
  t.insert("You big idiot.");
  t.insert("I bought a watch last year.");
  t.insert("shit");

  test_time(t);
}
Copy the code

Output:

$./main [index=0,len=4,word=shit],[index=16,len=12,word= you idiot],[index=49,len=15,word= you idiot], time (ms) : 0.093765Copy the code

I’m curious to see how string implements the find function:

void test_time_by_find(a) {
  auto t1 = std::chrono::steady_clock::now(a); std::string origin ="SHit, you you you you you are stupid force you, say you, you big idiot.";
  std::vector<std::string> words;
  words.push_back("You're an idiot.");
  words.push_back("You're an idiot.");
  words.push_back("You're bad.");
  words.push_back("You big idiot.");
  words.push_back("I bought a watch last year.");
  words.push_back("shit");

  for (auto &&i : words) {
    size_t n = origin.find(i);
    if(n ! = std::string::npos) { std::cout <<"[index=" << n << ",len=" << i.length() < <",word=" << i
                << "],";
    }
  }
  std::cout << std::endl;

  // run code
  auto t2 = std::chrono::steady_clock::now(a);/ / millisecond
  double dr_ms = std::chrono::duration<double, std::milli>(t2 - t1).count(a); std::cout <<"Time (ms) :" << dr_ms << std::endl;
}
Copy the code

Output:

$$./main [index=0,len=4,word=shit],[index=16,len=12,word= you idiot],[index=49,len=15,word= you idiot] 0.113505 [index=16,len=12,word= shit], [index=16,len=15,word= shit], [index=49,len=15,word= shit], [index=0,len=4,word=shit], Time: 0.021829Copy the code

The find version of string takes 0.021829 milliseconds, which is 5 times faster. Why is that?

Chinese is replaced by the implementation of *

The getSensitive() function shows the location and length of sensitive words. How do we replace them with asterisks?

SHit, you, you, you are an idiot you, you, you big idiot.Copy the code
int main(a) {
    Trie t;
    t.insert("You're an idiot.");
    t.insert("You big idiot.");
    t.insert("shit");

    std::string origin = "SHit, you you you you you are stupid force you, say you, you big idiot.";
    auto r = t.getSensitive(origin);

    for (auto &&i : r) {
        std::cout << "[index=" << i.startIndex << ",len=" << i.len
                  << ",word=" << i.word.c_str() < <"]," << std::endl;
    }

    std::cout << t.replaceSensitive(origin) << std::endl;

    return 0;
}
Copy the code

After running, we get a set of sensitive word information, including starting position, length:

[index=0,len=4,word=shit], [index=16,len=12,word= you idiot], [index=49,len=15,word= you idiot],Copy the code

There is a problem here, because Linux uses UTF8, 1 Chinese character takes up 3 bytes, which results in * being twice as much if we simply iterate over the substitution.

# error, part of Chinese character * has been tripled****, you you ************ oh you, you, ***************.# hope****, you you **** oh you, you, *****.Copy the code

Before we solve this problem, let’s look at the relationship between Unicode, UTF8, and Chinese characters.

Unicode

Unicode is an industry standard in computer science, including character set, encoding scheme, etc. Unicode was created to overcome the limitations of traditional character encoding schemes. It provides a unified and unique binary encoding for each character in each language to meet the requirements of cross-language and cross-platform text conversion and processing. It was developed in 1990 and officially announced in 1994.

Let’s take a look at the Unicode encoding of It’s Zhihu daily.

I 0049
t 0074
'0027s 0073 0020 know 77E5 on 4E4E 65E5 report 62a5Copy the code

Each character corresponds to a hexadecimal number.

Computers only understand binary, so, strictly in Unicode fashion (UCS-2), they should be stored like this:

I 00000000 01001001
t 00000000 01110100
'00000000 00100111 s 00000000 01110011 00000000 00100000 know 01110111 11100101 about 01001110 01001110 date 01100101 11100101 reported 01100010, 10100101,Copy the code

This string takes up 18 bytes in total, but the first nine bits of English are 0! Waste. Waste of hard drives. Waste of traffic. How to do? UTF.

UTF8

Utf-8 (8-bit Unicode Transformation Format) is a Unicode variable length character encoding and prefix code. It is also called universal code. Founded by Ken Thompson in 1992. It can be used to represent any character in the Unicode standard, and the first byte of its encoding remains ASCII compatible, allowing original ASCII software to continue to be used with little or no modification. As a result, it is becoming the preferred encoding for E-mail, web pages and other applications that store or transmit text.

Utf-8 does this:

  1. The first byte is set to 0. For English text, UTF-8 takes only one byte, the same as ASCII.

  2. The first n bits of the first byte are set to 1, the n+1 bits are set to 0, and the first two bits of the following bytes are set to 10. The remainder of the n bytes will fill the unicode code, and the high bits will be filled with zeros.

This results in the following UTF-8 flag bits:

0xxxxxxx 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx ... .Copy the code

Thus, “It’s Zhihu Daily” became:

I 01001001
t 01110100
'00100111 s 01110011 00100000 know 11100111 10011111 10100101 about 11100100 10111001 10001110 date 11100110 10010111 10100101 reported 11100110 10001010 10100101Copy the code

Compared to the above scheme, English is shorter, but each Chinese character is one byte longer. But the entire string is only 17 bytes, a little short of 18.

Part of Chinese character coding range

Unicode symbol range (hexadecimal) Utf-8 encoding (binary)
0000 0000-0000 007F (1 byte) 0xxxxxxx
0000 0080-0000 07FF (2 bytes) 110xxxxx 10xxxxxx
0000 0800-0000 FFFF (3 bytes) 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF (4 bytes) 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The Chinese character “yan” is taken as an example to demonstrate how to implement UTF-8 encoding.

Given that the unicode for Yan is 4E25 (100111000100101), we can see from the table above that 4E25 is in the range of the third line (0000 0800-0000 FFFF), so the UTF-8 encoding for Yan requires three bytes. That is, the format is 1110XXXX 10XXXXXX 10XXXXXX.

Then, starting with the last binary bit of strict, the x in the format is filled in from back to front, and the extra bits are filled in with zeros. As a result, the UTF-8 code for “yan” is “11100100 10111000 10100101”, which translates into hexadecimal E4B8A5.

     100   111000   100101
1110xxxx 10xxxxxx 10xxxxxx # From back to front
1110x100 10111000 10100101 So this extra x is going to add 0
11100100 10111000 10100101
Copy the code

Implement the Replacement of the Chinese character with *

Method 1: Manually process the data by judging the Chinese characters and the number of bytes occupied

/** @fn * @brief In Linux, the Chinese version takes up three bytes, and the Windows version takes up two bytes. https://blog.csdn.net/taiyang1987912/article/details/49736349 * @ param [in] the STR: string * @ return * /
std::string chinese_or_english_append(const std::string &str) {
    std::string replacement;
    //char chinese[4] = {0};
    int chinese_len = 0;
    for (int i = 0; i < str.length(a); i++) {unsigned char chr = str[i];
        int ret = chr & 0x80;
        if(ret ! =0) { // chinese: the top is 1
            if (chr >= 0x80) {
                if (chr >= 0xFC && chr <= 0xFD) {
                    chinese_len = 6;
                } else if (chr >= 0xF8) {
                    chinese_len = 5;
                } else if (chr >= 0xF0) {
                    chinese_len = 4;
                } else if (chr >= 0xE0) {
                    chinese_len = 3;
                } else if (chr >= 0xC0) {
                    chinese_len = 2;
                } else {
                    throw std::exception();
                }
            }
            / / to skip
            i += chinese_len - 1;

            //chinese[0] = str[i];
            //chinese[1] = str[i + 1];
            //chinese[2] = str[i + 2];
        } else /** ascii **/ {
        }
        replacement.append("*");
    }
    return replacement;
}

std::string Trie::replaceSensitive(const std::string &word) {
    std::set<SensitiveWord> words = getSensitive(word);
    std::string ret;
    int last_index = 0;
    for (auto &item : words) {
        std::string substr = word.substr(item.startIndex, item.len);
        std::string replacement = chinese_or_english_append(substr);

        // Original content
        ret.append(word.substr(last_index, item.startIndex - last_index));

        // Replace the content
        ret.append(replacement);
        last_index = item.startIndex + item.len;
    }

    // append reset
    ret.append(word.substr(last_index, word.length() - last_index));

    return ret;
}
Copy the code

Method 2 (recommended) : Use wString instead of string. When traversing wstring, item is of type wchar_t (4 bytes), and wchar_t is directly used as the key of unordered_map, so there is no special need to deal with the substitution of * in Chinese.

Pause word realization

How do I continue to recognize special characters such as wechat, wechat – wechat, wechat && letter? It’s easy to just ignore the word and look it up:

int Trie::getSensitiveLength(std::string word, int startIndex) {
    TrieNode *p1 = root_;
    int wordLen = 0;
    bool endFlag = false;

    for (int p3 = startIndex; p3 < word.length(a); ++p3) {const char &cur = word[p3];
        auto subNode = p1->getSubNode(cur);
        if (subNode == nullptr) {
        		// When a pause word is encountered, skip the word and continue searching the string
            if (cur == '&' ||
                cur == The '-'||
                cur == ' ') {
                ++wordLen;
                continue;
            }

            break;
        } else {
            // ...}}// ...
}
Copy the code

Full Angle and half Angle implementation

Reference: C/C++ — Check the presence of Chinese and sensitiveWd-filter in the string

In addition to special symbols, there is a special case that will disable our filtering algorithm, and that is full-angle (except ASCII), as follows:

  • Full Angle of letters:
    • Uppercase: ABCDEFG
    • Lowercase: abcdefg
  • Symbol:
    • Chinese punctuation: [],. ? ! ()… -; ‘, etc.

In games, we often see the use of full corners to bypass the sensitive word filter, such as mixing half corners with full corners: Fuck, V, etc.

In fact, like case processing, we only need to convert the full corner letters and some Chinese symbols into the corresponding half corner. Other unconventional full-angle special symbols can be added directly to the pause word.

By looking at the ASCII encoding table and Unicode encoding table above, we find that there is a certain correspondence between them:

ASCII UNICODE
! (0 x21) ! (0 xff01)
“(0 x22) “(0 xff02)
# (0 x23) # (0 xff03)
. .
A (0 x41) A (0 xff21)
B (0 x42) B (0 xff22)
. .
~ (0 x7e) ~ (0 xff5e)

ASCII! The characters between ~ correspond to the Unicode characters 0xFF01 through 0xFF5E, so converting a full Angle to a half Angle only requires subtracting a fixed offset.

const wchar_t kSBCCharStart = 0xFF01; / / whole Angle!
const wchar_t kSBCCharEnd = 0xFF5E; / / the whole Angle ~
// The value of full-angle space, which does not comply with the relative offset from ASCII and must be processed separately
const wchar_t kSBCSpace = 0x508;    // Full space

// The relative offset of visible characters except whitespace from corresponding full-corner characters in the ASCII table
const wchar_t kConvertStep = kSBCCharEnd - kDBCCharEnd;

int SBCConvert::qj2bj(const wchar_t &src) {
    // offset to the corresponding ASCII half Angle
    if (src >= kSBCCharStart && src <= kSBCCharEnd) {
        return (wchar_t) (src - kConvertStep);
    } else if (src == kSBCSpace) { // if it is full space
        return kDBCSpace;
    }
    return src;
}
Copy the code

Why does it return an int?

In fact, the principle behind the text we see is just a string of numbers (encoded in Unicode)

// In Linux, string is utF8 encoding
int main(a) {
    std::wstring str = L "yan ~ A";
    for (wchar_t &item : str) {
        int unicode = static_cast<int>(item);
        std::cout << std::hex << unicode << std::endl;
    }
    return 0;
}
Copy the code
4e25
ff5e
41
Copy the code

Again, you can go back:

#include <string>
#include <locale>
#include <codecvt>

std::string ws2s(const std::wstring &wstr) {
    using convert_typeX = std::codecvt_utf8<wchar_t>;
    std::wstring_convert<convert_typeX, wchar_t> converterX;

    return converterX.to_bytes(wstr);
}

int main(a) {
    wchar_t ws[4] = {};
    ws[0] = 0x4e25;
    ws[1] = 0xff5e;
    ws[2] = 0x41;
    std::cout << ws2s(ws) << std::endl;
    return 0;
}
Copy the code
Yan ~ ACopy the code

Sensitive word library

  • Chinese sensitive thesaurus

  • mgck

  • Textfilter: several implementations of sensitive word filter + a 1W sensitive word library

  • Sensitive -stop-words: a glossary of commonly used sensitive Internet words

  • Keywordfilter: 5W illegal words

conclusion

Time complexity comparison of algorithms:

KMP Trie tree DFA AC automata
O(LenA + LenB) * m log(n) The unknown The unknown

Finally, the C++ implementation code please move:

  • Github-cpp-dirtyfilter

reference

Article:

  • KMP+Trie+AC automata Summary (String board)
  • AC automata algorithm in detail (diagram) and template
  • DFA Algorithm for Sensitive Word Filtering (Dictionary tree)
  • String pattern matching (KMP) algorithm
  • The next array of KMP is worth finding
  • AC automata Summary (Hyperdetailed notes)
  • sensitive-stop-words
  • sensitivewd-filter
  • Recognition of Chinese characters in Linux
  • What is the difference between Unicode and UTF-8?
  • C/C++ – Determines the presence of Chinese characters in the string
  • Explain the KMP algorithm

Video:

  • B station video: “day qin open class” KMP algorithm easy to understand version, comments: animation cattle force, but the speed of speech a little fast. Suitable for entry
  • B station video: KMP algorithm to calculate the next function value (textbook version, super simple!) Review the public prefix.
  • KMP string matching algorithm 1, comments: Introduce the concept of prefix, in-depth understanding of prefix suffix. The illustrations are very graphic and easy to understand and highly recommended.

Web site:

  • Online Unicode character table: unicode-table.com/