Character encoding resolution

Original: Coding diary (wechat official ID: Codelogs), welcome to share, reprint please reserve the source.

Introduction to the

Modern programming languages have abstracted the concept of String. Note that it is a high-level abstraction, but when computers actually represent information, they use bytes, so there needs to be a mechanism for converting between strings and bytes, and the conversion mechanism is character encoding, such as GBK, Utf-8 So you can understand the relationship between strings and character encodings as follows:

A String is an abstraction, such as the String class in Java, which is conceptually coding-independent. It contains a String of characters, and you don’t need to care what encoding is used to implement it in memory, although encoding mechanisms are required to store strings in memory.
It’s byte that you care about encoding, and when you save a string to a file or send it to the network, you need to use a character encoding mechanism to convert the string to byte, because the bottom of the computer only reads bytes.

Common character encoding schemes

ASCII

American Standard Code for Information Interchange American Standard Code for Information Interchange is a Standard Code used for encoding English characters. Each character occupies one byte. It uses only the lower 7 bits of the byte, and the highest bit is always 0.

ISO8859-1

2^8=256 characters, when the highest bit is 0, the encoding method is ASCII. So ISO8859-1 is ASCII compatible.

GBK

Full name is Chinese Internal Code Specification, developed in 1995, used for encoding Chinese characters, a Chinese character encoding for two bytes, compatible with ASCII Code encoding scheme, ASCII English character encoding for one byte.

Unicode

The full name of Unicode is Universal character Encoding, which is used to define all characters in the world, avoiding the problem of incompatibility between local character sets designed by various countries. Unicode is sometimes called UCS because another organization defined a similar scheme, UCS, and later merged it with Unicode. Note that Unicode is a character set, not a character code. To understand what Unicode is, you need to understand the relationship between character sets and character codes. In general, a character set defines the correspondence between characters and codepoints. Character encodings define the correspondence between codepoints and bytes. For example, the ASCII character set specifies that A is represented by 65. The ASCII character set does not care what byte 65 is represented by on A computer, whereas the ASCII character code defines that 65 should be represented by A single byte, 01000001, in hexadecimal notation 0x41, which is one and only implementation of the ASCII character set. However, Unicode, as a character set, does not specify how characters in Unicode should be encoded into bytes. However, UTF-16, UTF-32, and UTF-8 are all Unicode character encoding schemes, which specify how to convert Unicode characters into corresponding bytes.

Utf-32 utF-32, also known as UCS-4, is the most direct encoding of Unicode. Utf-32 uses four bytes to represent the Code point in Unicode characters, for example, the four bytes corresponding to the letter A are 0x00000041. It is also the only fixed-length encoding in the UTF-* encoding family, which can quickly locate the NTH character and facilitate pointer operations. But four bytes to represent a character is too much space for the English alphabet. Utf-16 UTF-16, also known as UCS-2, can contain at least two bytes to represent the code point, for example, the two bytes corresponding to the letter A are 0x0041. Note that UTF-16 is a variable-length encoding, but only 2 bytes are required for code points up to 65535. However, many historical code libraries implement UTF-16 encoding directly using 2-byte storage, which causes some problems when handling code point characters beyond 65535. In addition, UTF-16 also wastes up to twice the storage space for pure English storage.

Computers with a different byte order than BOM store bytes in a different order. For example, U+4E2D can be stored as 4E2D or 2D 4E in UTF-16, depending on whether the computer is in big-endian or small-endian mode. Utf-32 is similar. To solve this problem, utF-32 and UTF-16 both introduce the BOM mechanism, which places a special character BOM(U+FEFF) at the beginning of the file. Utf-16 encoded files start with FF FE, which means they are in small-endian mode, and FEFF, which means they are in big-endian mode. Therefore, UTF-16 can be divided into utF-16BE (big end) and UTF-16LE(small end). The same is true for UTF-32.

Unicode notation We often see things of the form U+XXXX or \uXXXX, which is a way of representing Unicode characters, commonly known as Unicode notation, where XXXX is the hexadecimal representation for code point, For example, U+0041 or \u0041 represent the letter A in Unicode. At first glance, this looks a bit like UTF-16, but note that it is a way of referring to a Unicode character as an English string, not as a character encoding, which is a byte string referring to a Unicode character.

Utf-8 Because UTF-16 uses two bytes to encode English characters, it is a waste of space for pure English storage, so Ken Thompson, the father of Unix, invented a Unicode character encoding utF-8, which can be used for characters in the ASCII range. The encoding is exactly the same as ASCII. Other characters are stored in two, three, or even four bytes, so UTF-8 is a variable-length encoding. Utf-8 uses 3-byte storage for common Chinese characters.

Inclusion diagram

What about garbled code?

Garbled characters are essentially the result of different character encodings between the encoding and decoding programs. For example, when one program (encoding) uses UTF-8 to store strings into a file and another program (decoding) reads them using GBK, garbled characters will appear.

Practice – Java

String. GetBytes () with the new String (bytes)

String str = "Good";
// String to byte, using UTF-8
byte[] bytes = str.getBytes("UTF-8");       
//' good 'is encoded in UTF-8 as 3 bytes e5a5bd
System.out.println(Hex.encodeHexString(bytes));  
// Convert bytes to strings, using UTF-8
System.out.println(new String(bytes, "UTF-8"));  

// String to byte, no character encoding, default to use the operating system encoding, MY development machine is Windows, default encoding GBK
bytes = str.getBytes();            
//' good 'encoded in GBK is 2 bytes bac3
System.out.println(Hex.encodeHexString(bytes));  
// Byte to string, also using my current operating system default encoding GBK
System.out.println(new String(bytes));           
Copy the code

Java’s string.getBytes () and New String(bytes) methods are used to convert strings to bytes, but charset versions are recommended. GetBytes (” utF-8 “) and new String(bytes,”UTF-8”). Since there is no way to specify a character encoding, the default encoding on the operating system is used. On Windows, the default encoding is usually GBK. This results in programs that work perfectly well on Linux or The MAC but are garbled on Windows. In addition, InputStreamReader and OutputStreamWriter are also available with and without the charset version. It is best to use the charset version as well.

//InputStreamReader and OutputStreamWriter are the same. If no character encoding is specified, the operating system is used
InputStreamReader isr = new InputStreamReader(in, "UTF-8");
OutputStreamWriter osw = new OutputStreamWriter(out, "UTF-8");
Copy the code

Encoding = UTF-8. This allows the JVM to set the default ENCODING to UTF-8, avoiding the application’s inheritance of the operating system encoding, or manually set the encoding to UTF-8 on the first line of the project, as follows: This does not set JVM parameters students will not appear garbled.

// Set the default character encoding of the current JVM to UTF-8 to avoid inheriting the OS encoding
System.setProperty("file.encoding"."UTF-8");
Copy the code

Practice – Linux

Od and XXD od and XXD are tools for viewing bytes in hexadecimal, octal, binary, and decimal formats, as follows:

Echo is encoded in UTF-8
$ echoGood - n | XXD 00000000: e5a5 bdThe # -b option represents the output 01 binary
$ echoGood - n | XXD - 00000000 b: 11100101 10100101 10111101# od can also output hexadecimal
$ echo- good n | od - t x1 0000000 e5 a5 bdCheck the ASCII code table in Linux
$ man ASCII
$ printf "% 0.2 X"{0.. 127}| xxd -r -ps | od -t x1d1cCopy the code

Iconv Iconv is a good tool for converting character encodings, as follows:

# iconv converts utF-8 bytes of echo output into GBK bytes
$ echo- good n | iconv GBK - utf-8 f - t | XXD 00000000: bac3The utF-16 encoding in Chinese is usually 2 bytes
$ echoGood - n | iconv -f utf-8 -t utf - 16 be | XXD 00000000:597 dThe utF-32 encoding is 4 bytes, and the first 2 bytes are 0
$ echo- n good | iconv utf - 32 - utf-8 f - t be | XXD: 00000000 0000 597 dCopy the code

Other useful tools

Unicode notation to string
$ echo -e '\u597d'
好
# string to Unicode representation
$ echo -n 'good' | iconv -f utf-8 -t ucs-2be | od -A n -t x2 --endian=big | sed 's/\x20/\\u/g'
\u597d
# Guess the file encoding
$ enca -L zh_CN -g -i file.txt
UTF-8
Convert the file to UTF-8
$ enca -L zh_CN -c -x UTF-8 file.txt
Copy the code

What is utf8mb4?

Utf-8, as a character encoding scheme of Unicode, can encode all characters in Unicode. However, when mysql implemented UTF-8 in the early stage, the implementation itself restricted utF-8 to use a maximum of 3 bytes, also known as UTF8MB3. As a result, emoji commonly seen today cannot be stored. Since it takes four bytes to encode an emoji, mysql introduced UTF8MB4 to compensate for this.

conclusion

Thoroughly understand a character encoding is not easy, mainly is the never introduced on computer books, and in his first work, often encounter all kinds of garbage problem, then an online search randomly set up to solve the problem, but it hasn’t been clear why, until they know iconv after this command, truly understand clearly.

Content of the past

Linux text command tips (1) AwK is a real tool