The preface.

Like most apes, when they first put forward their demands, they offered a magic artifact, Google, Baidu, tried many methods, and finally settled on one:

+(BOOL)stringContainsEmoji:(NSString *)string {// filter all expressions.returnValue: NO indicates that NO expression is present, and YES indicates that __block BOOL is presentreturnValue = NO; [string enumerateSubstringsInRange:NSMakeRange(0, [string length]) options:NSStringEnumerationByComposedCharacterSequences usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) { const unichar hs = [substring characterAtIndex:0];  // surrogate pairif (0xd800 <= hs && hs <= 0xdbff) {
            if (substring.length > 1) {
                const unichar ls = [substring characterAtIndex:1];
                const int uc = ((hs - 0xd800) * 0x400) + (ls - 0xdc00) + 0x10000;
                if (0x1d000 <= uc && uc <= 0x1f77f) {
                    returnValue = YES; }}}else if (substring.length > 1) {
            const unichar ls = [substring characterAtIndex:1];
            if (ls == 0x20e3) {
                returnValue = YES; }}else {
            // non surrogate
            if (0x2100 <= hs && hs <= 0x27ff) {
                returnValue = YES;
            } else if (0x2B05 <= hs && hs <= 0x2b07) {
                returnValue = YES;
            } else if (0x2934 <= hs && hs <= 0x2935) {
                returnValue = YES;
            } else if (0x3297 <= hs && hs <= 0x3299) {
                returnValue = YES;
            } else if (hs == 0xa9 || hs == 0xae || hs == 0x303d || hs == 0x3030 || hs == 0x2b55 || hs == 0x2b1c || hs == 0x2b1b || hs == 0x2b50) {
                returnValue = YES; }}}];return returnValue;
}
Copy the code

Is really happy, is thanking Google Baidu’s eight generation of ancestors, all evil test!! 9 bar keyboard input is also filtered!! Magic is not

Popular science.

There is no need to talk about the origin of Emoji. As long as it is known that Emoji are included in a certain version of Unicode code and not put in a single block, it means that in Unicode code, the address of Emoji is not regular, so it can only be hard matching. However, there are hundreds or thousands of Emoji, it is too stupid to match them one by one. We need to narrow down the match. I believe that everyone now uses UTF8 encoding, which is a variable length encoding, when it comes to variable length, there must be a description header, several content bodies, UTF8 is the same. In a byte, if the first bit is 0, then represent the current for single-byte characters, 0 after the seven bits of data, on behalf of the serial number corresponding in Unicode, if the first bit is 1 the beginning, representative is multi-byte character, if the second is 0, representing the byte is multi-byte character data bytes, followed in the first byte; If preceded by more than one, the number of ones represents the number of bytes (including the current byte) of the character, for example:

110XXXXX // indicates two bytes, which must be followed by a data byte starting with 10. 110XXxxx 10XXXXXX 1110XXXX // indicates three bytes, followed by two data bytes starting with 10. 1110XXXX 10XXXXXX 10XXXXXX 10XXXXXX

It can be inferred that the maximum length of a character in Utf8 is 7 bytes, including 6 bytes of data bits. Emojis are distributed in the addresses of 2, 3, 4 and 4+ in Unicode. Most of the emojis with length 2 are text characters, which can be allowed, while emojis with length 4 and 4+ can be filtered. We can see that the text is basically divided in the 3-byte address, here the emphasis is to filter 3-byte Emoji (3-byte Emoji can already be stored in the library, but for the purpose of unified experience, still need to filter), fortunately there are not many 3-byte Emoji, hard matching is reasonable.

The implementation.

1. Match three-byte Unicode based on information found on the Unicode official website

- (BOOL) emojiInUnicode:(short)code
{
if(code == 0x0023 || code == 0x002A || (code >= 0x0030 && code <= 0x0039) || code == 0x00A9 || code == 0x00AE || code == 0x203C || code == 0x2049 || code == 0x2122 || code == 0x2139 || (code >= 0x2194 && code <= 0x2199) || code == 0x21A9 || code == 0x21AA || code == 0x231A || code == 0x231B || code == 0x2328 || code == 0x23CF || (code >= 0x23E9 && code <= 0x23F3) || (code >= 0x23F8 && code <= 0x23FA) || code == 0x24C2 || code == 0x25AA || code == 0x25AB || code == 0x25B6 ||  code == 0x25C0 || (code >= 0x25FB && code <= 0x25FE) || (code >= 0x2600 && code <= 0x2604) || code == 0x260E || code ==  0x2611 || code == 0x2614 || code == 0x2615 || code == 0x2618 || code == 0x261D || code == 0x2620 || code == 0x2622 || code == 0x2623 || code == 0x2626 || code == 0x262A || code == 0x262E || code == 0x262F || (code >= 0x2638 && code <= 0x263A) || (code >= 0x2648 && code <= 0x2653) || code == 0x2660 || code == 0x2663 || code == 0x2665 || code == 0x2666 ||  code == 0x2668 || code == 0x267B || code == 0x267F || (code >= 0x2692 && code <= 0x2694) || code == 0x2696 || code == 0x2697 || code == 0x2699 || code == 0x269B || code == 0x269C || code == 0x26A0 || code == 0x26A1 || code == 0x26AA || code == 0x26AB || code == 0x26B0 || code == 0x26B1 || code == 0x26BD || code == 0x26BE || code == 0x26C4 || code == 0x26C5 || code == 0x26C8 || code == 0x26CE || code == 0x26CF || code == 0x26D1 || code == 0x26D3 || code == 0x26D4 || code == 0x26E9 || code == 0x26EA || (code >= 0x26F0 && code <= 0x26F5) || (code >= 0x26F7 && code <= 0x26FA) || code == 0x26FD || code == 0x2702 || code == 0x2705 || (code >= 0x2708 && code <= 0x270D) || code == 0x270F || code == 0x2712 || code == 0x2714 || code == 0x2716 || code == 0x271D || code == 0x2721 || code == 0x2728 || code == 0x2733 || code == 0x2734 || code == 0x2744 || code == 0x2747 || code == 0x274C || code == 0x274E || (code >= 0x2753 && code <= 0x2755) || code == 0x2757 || code == 0x2763 || code == 0x2764 || (code >= 0x2795 && code <= 0x2797) || code == 0x27A1 || code == 0x27B0 || code == 0x27BF || code == 0x2934 || code == 0x2935 || (code >= 0x2B05 && code <= 0x2B07) || code == 0x2B1B || code == 0x2B1C || code == 0x2B50 || code == 0x2B55 || code == 0x3030 || code == 0x303D || code == 0x3297 || code == 0 x3299 / / the second paragraph | | code = = 0 x23f0) {return YES;
}
return NO;
} 
Copy the code

2. There is also a very old set of Emoji, using Unicode private zones, which are almost useless now, but it is filtered

- (BOOL) emojiInSoftBankUnicode:(short)code
{
return ((code >> 8) >= 0xE0 && (code >> 8) <= 0xE5 && (Byte)(code & 0xFF) < 0x60);
} 
Copy the code

3. To filter the input string, filter out the characters whose length is not 3 bytes and verify the 3-byte Unicode code

- (BOOL) containEmoji
{
NSUInteger len = [self lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
if(len < 3) {// Emojis with more than 2 characters need to be verified (some emojis only have 3 characters)returnNO; } / / only consider the characters of byte length is 3, is greater than the range of all do Emoji processing NSData * data = [self dataUsingEncoding: NSUTF8StringEncoding]; Byte *bts = (Byte *)[data bytes]; Byte bt; short v;for (NSUInteger i = 0; i < len; i++) {
bt = bts[i];

if((bt | 0 x7f) = = 0 x7f) {/ / 0 xxxxxxxasiic codingcontinue;
}
if((bt | 0 x1f) = = 0 XDF) {/ / 110 characters into the two bytes of I + = 1;continue;
}
if((bt | 0 x0f) = = 0 xef) {/ / 1110 XXXX three bytes (key filter project) / / calculation Unicode characters subscript v = bt & 0 x0f; v = v << 6; v |= bts[i + 1] & 0x3F; v = v << 6; v |= bts[i + 2] & 0x3F; // NSLog(@"%02X%02X", (Byte)(v >> 8), (Byte)(v & 0xFF));
if ([self emojiInSoftBankUnicode:v] || [self emojiInUnicode:v]) {
return YES;
}

i += 2;
continue;
}
if((bt | 0 x3f) = = 0 XBF) {/ / 10 xxxxxx10 beginning, for the data bytes, direct filtrationcontinue;
}

returnYES; // All characters that are not in the above situation exceed three bytes, do Emoji processing}return NO;
} 
Copy the code

Thanks to author xoHome from Oscine for this article.