The background of Unicode

Preliminary knowledge:

  1. Computers only understand binary numbers such as 010101 bits
  2. Generally speaking, the smallest storage unit in a computer is an 8-bit binary number, or 1 byte
  3. In order for a computer to recognize a character such as the English character A, it needs to be encoded
  4. The American Standard Code for Information Interchange (ASCII) encoding is a set of encoding pairs of characters using a storage length of one byte.

Why Unicode?

ASCII encoding uses the highest bit b7 in 8bit (B0-B7) as a parity bit to ensure the reliability of transmission, so ASCII defines a total of 2^7=128 character sets.

The so-called parity check is a method used to check whether there is an error in the process of code transmission, generally divided into parity check and parity check two. Odd check specifies that the correct code must have an odd number of 1’s in a byte. If not, 0 is added to the highest bit b7. Parity specifies that the correct code must have an even number of 1s in a byte, or the highest bit, b7, if not even.

ASCII coding problems

ASCII code is a coding standard developed by the United States. It can represent the set of characters in English, but it is not enough to represent other languages such as Chinese and French. In order for computers to recognize Chinese, China has formulated the GB2312 code specification, which uses two bytes to represent one Chinese character and supports 65536 Chinese characters.

The tendency is for each country or region to create its own computer character code for its own language, leading to chaos.

Unicode produce

Unicode was born to solve this problem by organizing and encoding most of the world’s text.

In fact, there have historically been two separate attempts to create a single character set: the International Organization for Standardization (ISO) and the Unicode Consortium, a consortium of multilingual software manufacturers. The former development of ISO/IEC 10646 project, the latter development of unicode project. So different standards were initially developed.

Around 1991, participants in both projects realized that the world did not need two incompatible character sets. So they began to merge their work and work together to create a single coding table. Starting with Unicode 2.0, Unicode uses the same font libraries and codes as ISO 10646-1; ISO also promises that ISO 10646 will not assign values to UCS-4 encodes beyond U+10FFFF to keep them consistent.

Both programs still exist and publish their standards independently. However, both unicode Consortium and ISO/IEC JTC1/SC2 have agreed to maintain code table compatibility for both standards and closely coordinate any future extensions.

Unicode generally uses the most common font for a character at the time of release, but ISO 10646 generally uses the Century font whenever possible. (from baidu encyclopedia https://baike.baidu.com/item/Unicode).

Unicode encoding

The Unicode encoding space can be divided into 17 planes, each containing 2 to the power of 16 (65536) code points.

The code points of the 17 planes can be expressed as 1114112 code points ranging from U+0000 to U+10FFFF. The first Plane is called Basic Multilingual Plane (BMP), or Plane 0. The other Planes are called Supplementary Planes.

In the basic multilingual plane, the code point segment from U+D800 to U+DFFF is permanently reserved and does not map to Unicode characters, so there are 1112064 valid code points.

Why define a plane? Why is the base plane and the auxiliary plane divided? Why are there reserved segments in the base plane?

Computer implementation

Unicode is an encoding method, and there are many computer implementations based on Unicode encoding. In fact, different implementation methods are different in the way of Storing Unicode. Computer implementation of Unicode can be regarded as the storage encoding of Unicode.

We have done two encoding conversions here, Unicode itself is the character-to-digit encoding scheme, and the computer implementation of Unicode is the computer storage encoding scheme for Unicode.

Why do you need to encode Unicode for a computer implementation?

Let’s discuss this issue by introducing different Unicode computer implementations.

We should know that the probability of characters appearing in life is not the same. For example, in our daily life, we often use words such as “hello” and “morning”, but we seldom use characters such as “ninetieth year” and “taotie”.

Based on these facts, if we use shorter storage codes for high-probability characters like “hello” and “morning,” and longer storage codes for characters that are rarely used,

Definition: Assume n characters c1….. Cn, the probability of each character is P (n), and the storage space of each character is S1….. Sn, then, the formula for calculating the average character storage space is T = p(1)*s1+…… p(n)*sn

Let’s calculate the average character storage space of different encoding schemes.

UTF-32

The easiest and simplest computer implementation to think of is to store Unicode encoded characters in four bytes (32 bits), known as UTF-32. Utf-32 is the simplest program implementation (no conversion required, one-to-one correspondence to Unicode encoding).

Benefits: No conversion, fast

Cons: Wasted storage space

T = 32bit

UTF-8

Utf-8 is a variable-length encoding that encodes 1 to 4 bytes for a Unicode character. The Unicode encoding corresponds to the UTF-8 encoding:

Unicode Utf-8 encoding (binary)
U + 0000 – U + 007 f 0xxxxxxx
U + 0080-07 U + ff 110xxxxx 10xxxxxx
U + 0800 – U + FFFF 1110xxxx 10xxxxxx 10xxxxxx
U + 10000 – U + 10 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

A byte of uFT8 represents a Unicode code in the range (0 to 0x7F)

The unicode code for a two-byte uFT8 is in the range (0x80 to 0x07FF).

Three bytes of uFT8 represent unicode codes ranging from 0x0800 to 0xFFFF

Four bytes of UFT8 represent unicode codes ranging from 0x10000 to 0x10FFFF

This is a lot more complex, but it saves storage and is compatible with older ASCII codes.

T=???

There is no character probability data at present, I will update it after I check the information.

UTF-16

Utf-16 is also a variable-length encoding, encoding one or two 16-bit codes for a Unicode character.

Basic multilingual plane (code point range U+0000-U+FFFF) UtF-16 encoding in the basic multilingual plane uses one code element and its value is equivalent to Unicode (no conversion required). For example, the following

Unicode character UTF-16 (code element) UTF-16 LE (byte) UTF-16 BE (byte) U+0041 A 0x0041 0x00 0x00 0x41 U+7834 broken 0x7834 0x34 0x78 0x78 0x34 U+6653 0x6653 0x66 0x66 0x53

Auxiliary plane (code points range U+10000-U+10FFFF) Code points in the auxiliary plane are encoded as a pair of 16-bit codes (32-bit, 4-byte) in UTF-16, called a surrogate pair. The first of the two codes that make up the proxy pair is called lead surrogates and ranges 0xD800-0xDBFF and the second is called trail surrogates and ranges 0xDC00-0xDFFF.

The specific conversion process is

  1. Start with the Unicode code table -0x10000, so that the resulting secondary plane has a code table range of (U+ 0000-U +FFFFF), up to 20 bits in total

  2. Divide 20bit into high 10bit and low 10bit. High 1 bit | 0 xd800 leading agent, low 10 bit | 0 xdc00 telephoned the agency

In the basic multilingual plane, (U+D800 ~ U+DFFF) is reserved

Utf-16 preserves parsing speed while saving storage space. This is a combination of utF-8 and UTF-32.

T=??? There is no character probability data at present, I will update it after I check the information.