Character and encoding

In this world, we rely on words to communicate. English is a character, Chinese is a character, even the emoji 🐶 is a character.

There are 26 letters and 10 numbers in English. Thousands of characters are common in Chinese. And there are many other languages.

But computer languages are binary, zeros and ones, so how do you represent these natural languages in terms of zeros and ones?

ASCII encoding and extension

After the computer cup was invented and used, there was an urgent need to recognize our real world characters, such as how to store the character A?

So at first computer scientists invented a code to store these letters, numbers and common symbols.

  • Because these common symbols are few, you can use a byte to represent a character
  • The ASCII code uses 128 codes.

But ASCII only recognizes English, not Chinese or other languages.

Later, ASCII was extended to express some European languages.

Latin1 is an alias for ISO-8859-1, or latin-1 in some environments. Iso-8859-1 encoding is single-byte encoding, backward compatible with ASCII, its encoding range is 0x00-0xFF, 0x00-0x7F is completely consistent with ASCII, 0x80-0x9F is the control character, 0xA0-0xFF is the character symbol. In addition to ASCII characters, isO-8859-1 contains characters of Western European languages, Greek, Thai, Arabic, and Hebrew.

In addition, some other countries also use 256-bit encoding and backwards compatibility with ASCII encoding.

Chinese code GB series

The extension of ASCII to China is a different story, because there are so many characters. More than 3,000 men are common. So some new codes were developed for Chinese. To represent Chinese in two bytes:

  • GB2312 contains a total of more than 7000 characters, including more than 6000 Chinese characters and hundreds of symbols, compatible with ASCII
  • Big5 is used for traditional
  • GBK: contains more than 20,000 symbols, compatible with GB2312
  • But since they are compatible with ASCII, they also have a single byte, so how do you know which character is a single byte
    • The highest bit in Chinese is not 0
    • ASCII encoding has only 7 bits, with the highest bit being 0

Unicode

As more people and countries use computers, more and more code is written. So there are a lot of problems. The diagram below:

  • Individual codedNatural languageDon’t each other
    • The Latin-1 encoding does not recognize Chinese and must have a GB-related encoding installed to display Chinese
    • GBK does not recognize the languages of other countries and must install the corresponding codesets

But is there a code that recognizes all the characters in the world, all the natural languages? That’s hard for individuals to do, but it’s not hard for computers.

A specification was developed to unify symbols and languages, a set of coded characters called Unicode.

Character set

The correspondence between characters and binary is the character set. As described above, there are

  • The ASCII character set
  • Extended ASCII encoding
  • GB character set

Unicode also specifies the relationships between various language characters, special characters, and binary. Unicode has two character sets:

  • UCS2 (Universal Character set2) occupies two bytes, which can represent 65536 characters in total. Generally speaking, it is enough.
  • UCS4 (Universal Character Set4) occupies four bytes and is compatible with UCS2. It can represent more symbols, such as emoticons.

UTF – 16 coding

Utf-16 encoding is an encoding implementation of the UCS2 character set.

Utf-16 uses two bytes to represent a character, and as with any data type that has more than one byte, it is a question of big-endian versus small-endian storage.

The figure below shows two ways to store the number 1537: the encoding of 1537 assumes two bytes, the high byte is 00000011 and the low byte is 00000001.

We read a content in the direction of address growth, first reading the low address, then reading the high address.

  • The big-endian flag of UTF-16 encoding is appended to the characterFEFFThis is consistent with the way humans use it
    • When we say a number, say 1537, we always say high first.
  • The utF-16 encoding’s little endian flag is preceded by FFFE

Similarly, UTF-32 encoding is an encoding implementation of the UCS4 character set. A character requires four bytes of storage, and a marker is set to distinguish between big endian and small endian.

Utf-8 encoding

ASCII takes up only one byte, whereas UTF-16 takes up two bytes, which is a waste of storage and network space.

A variable length encoding was designed to represent Unicode, known as UTF-8 encoding. You can use 1 to 4 bytes to represent a symbol, varying the length of the byte depending on the symbol.

Utf-8 encoding rules are simple:

1) For a single-byte symbol, the first byte is set to 0 and the next 7 bits are the Unicode code for the symbol. So utF-8 encoding is the same as ASCII for English letters.

2) For symbols greater than 1 byte, the first n bits of the first byte are set to 1, the n + 1 bits are set to 0, and the first two bits of the following bytes are set to 10. The remaining bits, not mentioned, are the Unicode encoding for this symbol.

Like knowing the UTF-8 encoding for this Unicode character,

  • Unicode number: U+77E5

  • Corresponding binary: 01110111 11100101

  • But what I found was actually stored

# echoKnow | hexdump) - C00000000 e7 9f a5 0a |.... | 00000004Copy the code

0A is a newline character, the actual binary store is e7, 9F, a5, not 77E5, why is that?

This is because 77 E5 is encoded by UTF-8 which is e7, 9F, a5.

And because of this encoding, UTF-8 does not require big-end tags.

Utf-8 is the most widely used character encoding for computers and networks today.

The environment in which characters are used

We use characters in various environments of the computer, and there are quite a few details to pay attention to in each environment.

Operating system and shell

Generally speaking, the default encoding for Linux and MAC is UTF-8, and for Chinese Windows is GBK.

  • The content we input through keyboard input method can be converted into a code, this time using the operating system code.
  • Generally speaking, the terminal, the shell’s default encoding, is consistent with the operating system.
  • In addition to encoding, the operating system also has the local property, which is related to encoding

Let’s do a little experiment

Enter the date command on both systems:

LANG= zh_cn.utF-8 $date, Saturday, December 18, 2021, 07:00.40sec CSTCopy the code
LC_ALL=en_US.UTF-8
LANG=zh_CN.UTF-8
# date
Sat Dec 18 07:00:56 CST 2021
Copy the code
  • In front of the LANG = zh_CN. Utf-8
  • LC_ALL= en_US.utF-8 LANG= zh_cn.utF-8
  • Zh_CN and en_US are locale symbols for the language and country.
  • A locale is the language habits, cultural traditions, and living habits of people in a region, including time, currency units, symbols, and words
  • For Chinese who use simplified Chinese, date shows 07:00pm 40s CST on Saturday, December 18, 2021 is more in line with our custom
  • For Americans using English, the date presentation Sat Dec 18 07:00:56 CST 2021 is more in line with their habits

Setting locale is to set the locale attributes of 12 categories, that is, 12 LC_. In addition to these 12 variables that can be set, there are two more for simplicity: LC_ALL and LANG. They have a priority relationship: LC_ALL > LC_ >LANG. In other words, LC_ALL is the highest or mandatory setting, while LANG is the default setting.

Files and program code

To view

The file command

The file command can also be used to view text files

# file test.js 
b.js: ASCII text
# file IAOYUANGUI/*
IAOYUANGUI/IP.png:                    PNG image data, 287 x 68, 8-bit/color RGBA, non-interlaced
XIAOYUANGUI/fs.excalidraw:          JSON data
Copy the code
  • As you can see, file does not necessarily see character encodings, or even if it does.
    • Display b.js as ASCII text
    • This test.js is actually utF-8 encoded
  • Although file names usually include a main file name and a suffix, they exist to make it easier for humans to understand the content and type of a file, many Linux systems do not require file names to have suffixes
  • In fact, there is such a thing as a Magic Number. It is the first byte or bytes of the file. The computer can determine the type of file from its value

You can use a hexadecimal editor to open the image file and then check the magic number table to determine the file type.

# hexdump -C Documents/XIAOYUANGUI/fs6.png |head00000000 89 50 4e 47 0d 0a 1a 0a 00 00 00 0d 49 48 44 52 |.PNG........ IHDR|Copy the code

PNG Magic Word is 89, 50, 4E 47

To display the fileencoding format, enter the command :set fileencoding in Vim, for example fileencoding=utf-8.

Code conversion

Because different systems use different encodings, sometimes some conversions are necessary to prevent garbled characters.

Some commands can help with file encoding conversion:

  • Iconv commandiconv -f UTF-8 -t GBK a.txt > test2.txt
    • -f Source code -t Target code
    • From vim test2.txt you can see that fileEncoding has changed to GB18030.

vim

To display the fileencoding format, enter the command :set fileencoding in Vim, for example fileencoding=utf-8.

  • In addition to fileEncoding, vim has parameters such as encoding and termencoding
    • Encoding is the character encoding of vim itself
    • Termencoding is the character encoding of the presentation file, usually the same as encoding
    • Fileencodings, this is the list of vim decodes

When we open a file using Vim, we automatically recognize the fileencoding by setting FileEncodings. Fileencodings is a comma-separated list in which each item is an encoded name. When we open the file, Vim tries to decode it sequentially using the encoding in FileEncodings,

  • If successful, it is decoded using this encoding and fileEncoding is set to this value;
  • If that fails, proceed to the next code.
  • The figure shows that the files encoded by GB18030 can be decoded. The files displayed in the encoding format not included in this list cannot be displayed normally by VIM.

Vim encoding related processing is shown in the figure below:

  • If the actual fileencoding is utf8, the first probe succeeds, fileencoding is set to utf-8, and the file contents are encoded inside vim using encoding
  • If the actual fileencoding is gb18030, the third probe succeeds, fileencoding is set to gb18030, and the file content is encoded inside vim through encoding (utf-8)
  • A UCS-BOM is a unicode-encoded file with a BOM, such as UTF-16, UTF-32, or UTF-8 BOM, each of which has its own distinct mark.
    • BOM (Byte Order Mark) is a Byte Order Mark that appears in the header of a text file and is used in the Unicode encoding standard to identify the format of the file
  • Be sure to put strict encodings like UTF-8 (with strict formatting requirements) first,
  • Leave loose coding behind. For example, latin1 is a very loose encoding method. Any text obtained by any encoding method will not fail to be decoded by latin1. Of course, the result of decoding is also likely to be garbled.
    • Therefore, if you put latin1 in the first place of FileEncodings, opening any Chinese file will show garbled characters.

Code encoding

The code is all text, and the way the program resolves variables is usually utF-8.

  • Generally, tags like XML or Python are also character encoded.
  • Note, however, that utF-8 files are bom marked for Windows development, which is different from Linux MAC.

The following is a screenshot of the File Encoding of Intellj Idea in MAC:

  • As can be seen, the source code is the use of UTF-8 encoding
  • By default, UTF-8 files do not carry BOM

The coding of the network protocol

The HTTP protocol

The HTTP protocol is a text protocol that uses UTF-8 encoding.

Url encoding

RFC 1738 specifies that only letters and numbers [0-9a-za-z] and special symbols “$-_.+! *'(),”[excluding double quotes], and some reserved words can be used unencoded in urls. Therefore, when other text characters want to be used for URL encoding, they need to be encoded, such as using UTF-8 encoding, with % in front of the encoding character to display.

Mysql coding

Mysql supports utF8 encoding to store characters, but there are some pitfalls here. MySQL utF8 encoding is not what we call utF-8 encoding implementation

  • Because MySQL utF8 encoding is only three bytes long,
  • True UTF-8 is up to four bytes per character. Emoji symbols account for 4 bytes, as do some complex characters and traditional characters, so MySQL cannot support emoji symbols.
  • If such an emoji is written, the decoding fails, so the writing fails.

To support these additional symbols, you need to use MySQL’s UTF8MB4 encoding.

emoji

😅 (Unicode: U+1F605) is an Emoji, and the Unicode code for 🐶 is 1F436.

For example, here are the emoji inputs supported by the iPhone:

conclusion

This article discusses the development and unified history of character encoding, as well as the coding problems encountered by some applications, especially when we use some software or tools, such as MySQL and Vim. Unicode supports more and more characters, even emoji and Martian. We try to use UTF-8 encoding, which can support various characters and save storage space.