preface

Almost every programmer, with the possible exception of english-speaking programmers, will encounter garbled characters at the beginning of programming. At this point, we are often taught to use UTF-8. Then all the gibberish disappeared and the world was at peace again. Utf-8 stands for 8-bit Unicode Transformation Format. Unicode Transformation Format (UTF-8) What is the relationship between Unicode and UTF-8? What is the difference between utF-16 and UTF-8 that you often hear about? Does a Chinese character really take up 2 bytes? What is Modified UTF-8 in JNI? What is a Unicode broker area? Character coding is pretty basic, but almost all programming books are written in one brush, with a cloud of garbled code swirling overhead. Here is a brief share of their understanding of character encoding, understanding, and then encountered some problems, at least not a face of confusion.

ASCII

American Standard Code for Information Interchange, a coding scheme known to all programmers, requires no further explanation. One byte encodes one character. Although it only took seven bits to encode 128 characters. Then came EASCII(Extended ASCII), which used all eight bits. It had 256 characters, which was a lot weirder than ASCII. ASCII is probably the simplest coding scheme that developers know about first, and I won’t go into that here.

Traditional coding scheme

When ASCII was proposed, it did not take into account the characters of other countries such as China, Japan and Korea. With a single byte, only 256 characters can be encoded, and many countries do not have characters. With the popularization of computers, various countries have put forward their own coding schemes, these schemes are similar, the basic routine is compatible with ASCII, and use two bytes to encode other characters, such as GB2312, BIG5 coding schemes, but these schemes have a very obvious problem: each fighting for its own. That is, they are compatible with ASCII and write native languages, regardless of other languages, so these coding schemes only allow computers to handle bilingual environments (Latin letters + native languages), not multilingual environments (a mixture of languages). GB2312, for example, does not support Arabic. If a line of characters contains Chinese, English and Arabic, GB2312 definitely won’t code. Unicode was created to address the limitations of traditional encoding schemes. Before we can talk about Unicode, we need to understand the modern encoding model.

Modern coding model

Modern encoding models divide the concept of character encoding into several layers.

Abstract Character repertoire

This collection contains all abstract characters, such as “in” is a character, but the word is bold or song style, size, abstract character table does not care, some characters can not print, such as our common “\n”. The abstract character table describes only the abstract characters supported by the system. Character tables can be closed, that is, they do not allow new symbols to be added, as in ASCII, whereas Unicode character tables are open, allowing new symbols to be added.

Coded Character Set

Mapping each character in a character set to a coordinate or a non-negative integer, a character set plus a mapping is called a coded character set, and Unicode is a coded character set. Since the characters in a character set are mapped to something else, the concept of a coding space arises, which is simply the dimension of a table containing all characters, which can be described in terms of integers, coordinates, or rows, columns, planes, etc. A position in the coding space is called a code point

Character Encoding Form

A sequence that converts the code points of the coded character set into integer values of finite bit length (code units). This is a mapping to itself for fixed-length codes such as UCS-2, but for variable-length codes such as UTF-8, the mapping is more complicated. Map additional code points to a sequence of multiple code points. The simplest character encoding table simply selects units large enough to ensure that all values in the coded character set can be directly encoded (one code point for one code value). This is reasonable for coded character sets that can be represented using 8-bit sets, and reasonable enough for coded character sets that can be represented using 16-bit sets, such as earlier versions of Unicode. However, as the size of the coded character set increased (today’s Unicode character set requires at least 21 bits to be fully represented), this direct representation became increasingly inefficient and difficult to accommodate existing computer systems with larger code values.

Character Encoding Scheme

Code units are mapped to 8-bit byte sequences for file storage or network transmission of encoded data. In the case of Unicode, a simple character is used to specify whether the byte order is big-endian or little-endian (though utF-8 does not specify the byte order).

Transfer Encoding Syntax

Used to process a sequence of bytes provided by the character encoding scheme at the previous level. Generally, its functions include two kinds: one is to map the values of byte sequence to a set of more restricted value domains to meet the limitations of transmission environment. For example, Base64 in Email transmission encodes 8-bit bytes into 7-bit long data; The other is to compress the value of byte sequence, such as LZW or process length encoding) and other lossless compression techniques.

Unicode

When Unicode was first proposed, it was thought that only 2 bytes were needed to hold all modern characters, but in fact it encodes a lot of weird characters, so the Unicode encoding space is now 0x0-0x10FFFF, which requires at least 21 bits and is a little less than 3 bytes. I occasionally see “a Unicode character takes 4 bytes”. In fact, Unicode as a coded character set has no concept of how many bytes a character takes, because what it does is map characters to code point values between 0x0 and 0x10FFFF. The number of bytes it takes, Has nothing to do with Unicode, only with the actual character encoding scheme in use. For example, in UCS-4, a character occupies 4 bytes. Unicode divides the coding space into 17 planes, ranging from 0 to 16.0. Plane 0 is called basic Multilanguage plane (BMP). The other 16 planes can be collectively referred to as Supplementary plane (SP).

The plane The scope of Chinese name English names
0 plane U+0000 – U+FFFF Basic multilingual plane Basic Multilingual Plane, abbreviated as BMP
1 plane U+10000 – U+1FFFF Multilingual supplementary plane Supplementary Multilingual Plane, SMP for short
2 plane U+20000 – U+2FFFF Ideograms complement the plane Supplementary Ideographic Plane, SIP for short
No. 3 plane U+30000 – U+3FFFF Ideographic third Plane (not in official use) 3) Tertiary Ideographic Plane (TIP
Planes 4 through 13 U+40000 – U+DFFFF Not yet used
14 plane U+E0000 – U+EFFFF Special purpose supplemental plane Supplementary Special- Purpose Plane, SSP for short
15 plane U+F0000 – U+FFFFF Reserved for Private Use (Area A) Private Use Area-A, puA-A for short
16 the plane U+100000 – U+10FFFF Reserved for Private Use (Area B) Private Use Area-B, puA-B for short

Although the Unicode BMP only has 2^16=65536 code points, which may seem like a small number, most of the text is actually in the BMP, and many strange languages, such as the small languages of India, and Arabic written from right to left, even those that we think of as ghost languages, It’s all in the BMP.

Unicode defines two mapping methods: Unicode Transformation Format (UTF) and Universal Coded Character Set (UCS). The most common of these are UTF-8 and UTF-16. Ucs-2 is the predecessor of UTF-16, while UTF-32 and UCS-4 are functionally equivalent.

UCS-2

Let’s start with a simple encoding: UCS-2. It’s actually a character encoding table. Ucs-2 fixed length 16 bits, in fact, BMP code points are directly mapped to 16 bits of code, the value is equivalent to the corresponding code points, which also determines that UCS-2 cannot encode characters in the auxiliary plane.

UCS-4 / UTF-32

Since UCS-2 cannot encode all Unicode characters, the easiest way to do this is to directly expand the code to 32 bits and then map the code points directly from Unicode. Utf-32 is a crude, space-consuming encoding method that I don’t know anyone uses.

UTF-16

Note pad on Windows allows you to save text files in an optional encoding, including Unicode, which is actually UTF-16. Utf-16 has the same 16-bit code as UCS-2. Android/Java developers may think UTF-16 is far away from them, but it is not. Strings in Java, represented inside virtual machines, use UTF-16 encoding. What evidence supports this? In Java, char is 2 bytes long. I used to wonder why char can store Chinese characters and why char is 2 bytes long. We now know that what we actually get when we call String.charAt is a UTF-16 symbol. Since UTF-16 has the same 16-bit code as UCS-2, why can it encode Unicode character sets? Two things. One is utF-16 is variable-length code. The second is a special area in BMP: the agent area. The Unicode standard states that the U+D800 -u +DFFF value does not correspond to any character.

For BMP(U+ 0000-U +FFFF), UTF-16 encodes in the same way as UCS-2, directly mapping code points to 16-bit codes. For SP(U+ 010000-U +10FFFF) code points, the code point values are calculated as follows: First subtract 0x010000, remaining 20 bits (0x00000~0xFFFFF) 10 bits (range 0x0000~0x03FF) plus 0xD800 to get the first 16-bit code element, range 0xD800~0xDBFF, Also known as leading surrogate. The lower 10 bits (also in the range of 0x0000 to 0x03FF) plus 0xDC00 gives the second 16-bit symbol, in the range of 0xDC00 to 0xFFF, known as trailing surrogate. An agent pair consisting of a leading agent and a trailing agent represents a code point in SP.

As you can see, the leading agent and the back-end agent are connected to the scope of the BMP agent area. At the same time, we can find that the range of code points, leading proxies, and trailing proxies corresponding to valid characters in BMP do not overlap each other. Therefore, in UTF-16 encoded data, we can see the value of a 16-bit data, which one of the three can be clearly identified.

If you don’t understand what this algorithm is doing, there are 0x10FF-0x010000 + 0x1 = 0x0F0000 = 2^20 code points in SP. I fully map these 2^20 code points to a coordinate, both horizontal and vertical coordinates need to have 2^10 = 1024 values. The leading proxy is exactly 1024 code points in the range 0xD800 to 0xDBFF, and the trailing proxy is exactly 1024 code points in the range 0xDC00 to 0xFFF. This encodes the code points in SP by a coordinate (leading proxy, trailing proxy).

So Android/Java developers need to be aware when using String.charAt that you may get a leading proxy or a trailing proxy, not necessarily a valid character in BMP. For UTF-16, not only Chinese characters, but all characters in BMP take up two bytes.

The JNI method uses utF-16 for string operations that do not have UTF names, such as: const jchar * GetStringChars(JNIEnv *env, jstring string, jboolean *isCopy); Jstring NewString(JNIEnv *env, const jchar *unicodeChars, jsize len); Construct a jString using utF-16 encoded JCHAR stream.

UTF-8

Utf-8 code 8 bit, variable length encoding, this is also very simple, directly according to the following table set can be.

Bits for code points Code point range Byte 1 Byte 2 Byte 3 Byte 4
7 U+0000 – U+007F 0xxxxxxx
11 U+0080 – U+07FF 110xxxxx 10xxxxxx
16 U+0800 – U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
21 U+10000 – U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

In UTF-8, the characters we use, at least the ones I’ve seen, are three bytes long, because they fall between U+ 0800-U +FFFF.

BOM

BOM(byte-order mark) is also a common term. For example, our code files are required to be stored in UTF-8 without BOM, otherwise they might not compile, or something weird might happen. BOM is actually the name of the Unicode character at the code point U+FEFF. For utF-16, UCS-2, UTF-32 / UCS-4 codes that are not 8-bit encoding methods, there will inevitably be a byte order problem when the encoded data is stored or transmitted. BOM is used to identify the byte order of the byte stream if it appears at the beginning of the byte stream. Each encoding scheme encodes U+FEFF in its own way, which can be placed in the header to mark the byte order used to encode the byte stream. For example, if we know that the byte stream we are about to read is encoded in UTF-16 and the byte order is unknown, and the first two bytes we read are 0xFF, 0xFE, then U+FFFE in Unicode does not map to characters, and these two bytes must be encoded U+FEFF, so we can determine that the current byte stream is using a little encoder. The UTF – 16 LE

For UTF-8, since it uses 8-bit codes, there is no byte order problem, and adding a BOM to the header is not recommended because it may affect some tools, so using UTF-8 without BOM is the mainstream.

Different encodings of the byte order indicate:

Encoding (BE: big end; LE: small side) The value is in hexadecimal notation
UTF-8 EF BB BF
UTF-16 BE FE FF
UTF-16 LE FF FE
UTF-32 BE 00 00 FE FF
UTF-32 LE FF FE 00 00

Modified UTF-8

If your Android App uses C++ code to handle characters, you may encounter a crash. Input is not valid Modified utf-8. The crash was caused by inconsistent coding. The external code usually uses the standard UTF-8 encoding, while the JNI related method deals with the utF-8 variant encoding. Although they both have UTF-8 on their names, they are essentially two encodings, although they are partially compatible. In JNI, all methods associated with String with UTF names are essentially mutF-8 (utF-8, Modified UTF-8). For example, jString NewStringUTF(JNIEnv *env, const char *bytes); Const char * GetStringUTFChars(JNIEnv *env, jString String, jBoolean *isCopy); Gets a mutF-8 encoded byte array of jString.

GetBytes (“UTF-8”) is the standard UTF-8 encoded byte array of the String retrieved using the standard UTF-8.string.getBytes (“UTF-8”). “Utf-8 “) is a String constructed using the standard UTF-8 byte stream.

Many developers do not know the difference between UTF-8 in the Java and JNI layers, so they are confused when they encounter problems caused by this different encoding.

There are two differences between MUTF-8 and standard UTF-8. One is that the null character (“\0”) is encoded as two bytes in mutF-8:0xC080(i.e. 1100000010000000). It is encoded as 0x00 in standard UTF-8. The second character in SP is first encoded in UTF-16 as a leading agent and a trailing agent, and then encoded separately in standard UTF-8. BMP characters are encoded in mutF-8 and UTF-8 in the same way except for \0. Since the SP range is U+ 10000-U +10FFFF, standard UTF-8 encodes code points in SP as 4 bytes, while code points in the proxy region U+ D800-U +DFFF are encoded as 3 bytes in standard UTF-8. Therefore, mutF-8 encodes code points in SP by 3+3 = 6 bytes, 2 bytes more than standard UTF-8. Strings in dex are encoded by MUTF-8.

The resources

Zh.wikipedia.org/wiki/ASCII wenku.baidu.com/view/cb9fe5… Zh.wikipedia.org/wiki/UTF-8 en.wikipedia.org/wiki/Unicod… Zh.wikipedia.org/wiki/UTF-16 zh.wikipedia.org/wiki/%E5%AD…