So much for one little emoji?

preface

Product evaluation list page, showing the details of each user’s evaluation. In order to protect user privacy, only the first and last one can be displayed when displaying the user’s nickname, and the other ones can be replaced by ※.

For example, enter 🐳🐳🐠 and export 🐳*** 🐳

It seemed like a bland demand, but I didn’t pay much attention to it. The server stores the user’s comment information in DB, and the evaluation list interface is to display the comment information of the commodity in the database, especially the nickname of the reviewer.

But!!!!! There was a problem when students found that the user’s nickname contained emoji, and the cut data was displayed with a question mark!!

Example code for simulation is as follows:

Output:

When I saw this output, I was totally blindsided, which was not what I wanted at all!!

These three fish have me stumped. Should I just dismiss the test as emoji being too special? And try to get away with it?

After thinking about it for a long time, I decided to face up to this problem and solve it! (After all, I’m still that little smart guy 🤔)

PS: This article is largely inspired by the unicode sharing of a colleague in my previous company. I salute my teacher here! The following sections will take a step-by-step look at how this problem arose and how to solve it.

Concept of common sense

To solve these problems, you need to lay down some basic knowledge, and you can’t wait to see the code examples that the solution pulls to the end of this article.

utf8mb4

This is the default encoding used when creating tables in a database:

When we want to store emoji data in the database, the database format needs to be specified as UTF8MB4, otherwise the storage will report an error. So in many companies’ DB specifications, the default database encoding must be UTF8MB4

But have you ever wondered why UTF8 doesn’t work and UTF8MB4 does? What kind of twists and turns are there?

There’s a bit of Unicode involved, which we’ll cover next, but keep reading.

Before mysql 5.5, utF8 encoding only supported 1 to 3 bytes. Starting with mysql 5.5, UTF encoding utF8MB4 is supported. A character can have up to 4 bytes, so more character sets can be supported.

This table contains all emojis and their Corresponding Unicode encoding, as well as utF-8 encoding implementation.

As you can see from the figure, emojis take up 4 bytes when expressed in UTF-8, which is why the database cannot store emojis in UTF8.

We can also look at how many bytes emoji take in Java code:

We can also see string.getbytes (), which is utF-8 encoded by default:

ASCII

Unicode was mentioned in the introduction to UTF8MB4, but before introducing it we need to mention our old friend ASCII

ASCII (American Standard Code for Information Interchange) is a computer coding system based on the Latin alphabet. It is mainly used to display modern English.

This allows us to represent modern English in a single byte, which looks pretty good, with the following partial data correspondence:

But this one only stands for Latin, which is obviously not enough.

Unicode

Obviously, the development of computers did not support English alone. ASCII was limited to displaying only 26 basic Latin letters, Arabic numerals, and British punctuation marks, and thus could only be used to display modern American English.

If there is a character set that contains all the characters in the world, and each region has a unique binary representation in the character set, then there will be no garbled characters. So Unicode was born.

concept

Unicode, also known as universal code, international code, Unified code and single code, is an industry standard in the field of computer science. It organizes and encodes most of the world’s writing systems, making it easier for computers to present and process text.

The plane

Unicode first recognized ASCII’s use of integers from 0 to 127, and then again from 128 to 65535. With so many integers, we can assign an integer representation to every character in the world’s various languages.

After that, the Unicode Consortium found that 65536 integers were not enough to allocate, so it simply took up the following 16 65536 digits, i.e., 65536-1114111, at one time and named the additional 16 65536 segments as 16 planes. With the original planes 0-65535, Unicode has a total of 17 planes. For example, the first plane is 65536-131072. Of course, only seven planes have been allocated so far.

Plane 0 is an encoding section in Unicode. The encoding ranges from U+0000 to U+FFFF, and the characters in this plane are the ones we use most often.

Most characters assigned after 65535 are emoji, such as 😺 is 128570 (\uD83D\uDE3A)

Here recommend a code with an online site: ctf.ssleye.com/cencode.htm…

Said the scope of

The Unicode value ranges from U+0000 to U+10FFFF

Which is roughly: U+0000~U+110000(plus 1), or 17 FFFF (65535)
About 17*6w, about 100w code points can be used to map characters
The exact value is 1114,112, almost 112w code points
The latest version of Unicode contains 136,690 characters, far from 100W.
Unicode officials say the current number of code points is sufficient and will not be expanded

implementation

Unicode is implemented differently than it is encoded. The Unicode encoding of a character is determined. However, in the actual transmission process, because the design of different system platforms is not necessarily the same, and for the purpose of saving space, the implementation of Unicode encoding is different. The implementation of Unicode is called Unicode Transformation Format (UTF).

The encoding of a character in Unicode is unique and deterministic. However, There are several implementations of Unicode (for transport, storage, processing, or backward compatibility purposes), the most popular of which are UTF-8, UTF-16, UCS2, UCS4/UTF-32, and size differences if broken down.

For our Java, we can infer from the fact that the char takes up 2 bytes that we are using utF-16 encoding for storage

For all kinds of coding issues to recommend a good letter: thorough analysis problem of Chinese coding in Java (developer.ibm.com/zh/articles…).

Check whether Chinese is included

Now that you’ve seen what Unicode means and what it does, what does it actually do?

Let’s look at a small requirement, such as: How do I tell if a string contains Chinese?

I believe you have also encountered this kind of demand, usually we will go to baidu a pass, must be able to find a regular expression to determine whether to contain Chinese, and then filled with joy to solve the problem.

Coincidentally, we also have such a regular judgment in our system, which is encapsulated by colleagues in the architecture group. Let’s take a look:

Obviously, this is done through A Unicode interval, any questions?

The interval here is using the unified Ideographic characters of China, Japan and South Korea, but this is the 1993 version, which contains most of the commonly used Chinese, with a total of 20,902 characters. Seeing the later version, many characters are added, so we can imagine that the judgment method we use now will definitely miss the added characters:

Let’s take the example of A unified Ideographic extension area added in 2000:

There are a lot of rare words added here, none that I even recognize. Let’s use the second row of data to do a verification:

Are you surprised to see this? And shout that you wrote a bug here, haha.

In fact, it is not said that our regular judgment has a bug, it depends on whether our requirements are accurate enough to recognize all obscure words. According to the usage habits of users, the probability of inputting these rare words is not very high, so this regular is not as bad as the feedback of partners.

Address emoji capture issues

Anyway, we still have to solve the question raised at the beginning: how to correctly intercept emoji strings? Let’s start with utF-16 encoding.

UTF-16

Utf-16 specifies how Unicode characters can be accessed on a computer. Utf-16 uses two bytes to represent the Unicode conversion format. This is a fixed-length representation. Any character can be represented in two bytes. Utf-16 is a very convenient way to represent characters, representing one character every two bytes, which greatly simplifies string manipulation and is one of the reasons why Java uses UTF-16 as the character storage format for memory.

The utF-16 encoding in the basic multilingual plane (the code point range U+0000-U+FFFF) uses one code element and its value is equivalent to Unicode (no conversion required). This is our normal Chinese character. For example, codes in the auxiliary plane (U+10000-U+10FFFF) are encoded in UTF-16 as a pair of 16-bit codes (32-bit, 4-byte) called a surrogate pair. The first of the two codes that make up the proxy pair is called lead surrogates and ranges 0xD800-0xDBFF and the second is called trail surrogates and ranges 0xDC00-0xDFFF

surrogate

Surrogate is a concept derived not from the Java language but from UTF-16, a Unicode encoding. Specific please see: zh.wikipedia.org/wiki/UTF-16

In short, character information inside the Java language is encoded using UTF-16. Because char is a 16-bit type. It can have 65536 kinds of values, that is, 65536 numbers, each number can represent 1 kind of character. However, Unicode contains far more than 65,536 characters. Then the number is greater than 65536, but also use 16-bit coding, how to do? So the Unicode standard-setting group came up with the idea of taking 2048 of the 65536 numbers and defining them to be “Surrogates”, pairing them up to represent characters larger than 65536.

More specifically, the specifications numbered U+D800 to U+DBFF are “High Surrogates” — 1024 in all. The specification numbered U+DC00 to U+DFFF is “Low Surrogates”, also 1024. When they appear in pairs, they can represent 1048,576 more characters.

Cause of abnormal emoji interception

This is all a bit of conceptual stuff, and if it’s a bit confusing, let’s go back and start with the code:

We can separate emojis as follows:

🐳 – > \ uD83D \ uDC33

🐠 – > \ uD83D \ uDC20

Emoji must be more than 65,536, so here’s a pairing of “High Surrogates” and “Low Surrogates”.

It can be inferred from the above UTF-16 coding knowledge that the reason for garble characters in our emoji after intercepting a char is that it belongs to the agent pair in the utF-16 coding auxiliary plane, and if we separate the agent pair during interception, there will be abnormal problems.

In this case, static methods isHighSurrogate and isLowSurrogate of the Character class can be used to judge. A single emoji combination is high and low, so the whole surrogate pair in the auxiliary plane can be removed or retained.

The source of isHighSurrogate method is as follows:

public static final char MIN_HIGH_SURROGATE = '\uD800';

public static final char MAX_HIGH_SURROGATE = '\uDBFF';

public static boolean isHighSurrogate(char ch) {
    return ch >= MIN_HIGH_SURROGATE && ch < (MAX_HIGH_SURROGATE + 1);
}
Copy the code

This judgment is actually High Surrogates, and we can turn it around:

U+D800 <= ch <= U+DBFF

Similarly, the isLowSurrogate method works in the same way:

U+DC00 <= ch <= U+DFFF

Problem solving

Let’s run the code and see what happens:

The specific implementation code is as follows:

Public static void main(String[] args) {// The user nickname is: 🐳🐳🐠. The normal result should be: 🐳***🐠 String context = "\uD83D\uDC33\uD83D\uDC33\uD83D\uDC20"; int realNameLength = realStringLength(context); String namePrefix = subString(context, 1, 0); String nameSuffix = subString(context, realNameLength - 1, 1); context = String.format("%s%s%s", namePrefix, "***", nameSuffix); System.out.println(context); } /** * subString method containing emoji ** @param STR original STR * @param len STR length * @param type type = 0 Suffix */ private static String subString(String STR, int len, int type) {if (len < 0) {return STR; } int count = 0; for (int i = 0; i < str.length(); I ++) {if (count == len) {// type = 0 for prefix, other for suffix if (type == 0) {return str.substring(0, I); } return str.substring(i); } char c = str.charAt(i); if (Character.isHighSurrogate(c) || Character.isLowSurrogate(c)) { i++; } count++; } return str; Private static int realStringLength(String STR) {private static int realStringLength(String STR) {private static int realStringLength(String STR) { int count = 0; for (int i = 0; i < str.length(); i++) { char c = str.charAt(i); if (Character.isHighSurrogate(c) || Character.isLowSurrogate(c)) { i++; } count++; } return count; }Copy the code

conclusion

A small emoji is really a lot of knowledge. Due to the lack of space, I have omitted many things here, such as UTF-8 and UTF-16 encoding forms, which will involve a lot of content.

I hope this article can serve as a primer to inspire friends to explore more secrets together.

Author: a flower is not romantic

Reprinted from: wechat official account

The original link: mp.weixin.qq.com/s/CcC2VUYdC…

preface

Concept of common sense

ASCII

Unicode

Check whether Chinese is included

Address emoji capture issues

conclusion

Related Posts

Eight steps to deploy the NGINX Plus API gateway

Entry to the Python Tkinter component

SRC mining from scratch