This is my first article on getting started

In Java Web programs, you often encounter garbled characters, and if you don’t understand the character set and encoding, you will encounter a lot of headaches.

Character set

A character set is a collection of all abstract characters supported by a system. Character is the general name of various characters and symbols, including national characters, punctuation marks, graphic symbols, numbers and so on. It is a code table, which specifies the one – to – one correspondence between words and numbers. There is no necessary connection to the internal representation of the computer.

The ASCII character set, for example, specifies which characters the 128 digits from 0 to 127 correspond to. So this is A dictionary table, and I’m going to look up 0x41 is A capital letter ‘A’, 0x61 is A lowercase letter ‘A’. These are specified in the ASCII code table.

This article focuses on only one character set, the Unicode character set. The Unicode encoding system is designed to express any character in any language. It uses 4-byte numbers to represent each letter, symbol, or ideograph. Each number represents a unique symbol used in at least one language. Characters shared by several languages are usually encoded with the same number. Each character corresponds to a number, and each number corresponds to a character.

A character encoding

Character Encoding: A set of rules that can be used to pair a set of natural language characters (such as an alphabet or syllabary) with a set of other things. It is a basic technology of information processing to establish correspondence between symbol set and number system. People usually express information in a set of symbols. The information processing system based on computer uses the combination of different states of components to store and process information. The combination of different states of the component can represent the numbers of the digital system, so the character encoding is the conversion of the symbol into the number of the digital system that the computer can accept, called the digital code.

Here is a look at UTF-8 encoding, UTF-8 encoding method is relatively simple, can be described as follows:

  1. The value ranges from 0 to 127. For example, 0x61, in UTF-8, is represented by a single byte. The value is 0x61.

  2. The two-byte UTF-8 is encoded like this: 110XXXXX 10XXXXXX. This means that if you go beyond 127, you can no longer use one byte encoding, but expand to two bytes encoding. In the case of two-byte encoding, five bits are fixed. The remaining 11 bits can be used for coding.

The first three bits of the first byte are 110 and the first two bits of the second byte are 10. For example, 128, whose binary is 1000 0000, must have a utF-8 of two bytes. If you encode the next six zeros into the second byte, and the first ten into the first byte, you get 110 00010,10 000000

  1. The three-byte UTF-8 is encoded in this format: 1110 XXXX, 10 XXXXXX, 10 XXXXXX, and can encode 16 digits, covering most commonly used Chinese characters.

When a UTF-8-encoded Chinese character is read directly from the console using a byte stream, the top three bytes, the original UTF-8 encoding, are read. But if you read it using a character stream, you get Unicode code. There must be something going on here.

The source code is as follows:

public int decode(byte[] sa, int sp, int len, char[] da) { final int sl = sp + len; int dp = 0; int dlASCII = Math.min(len, da.length); ByteBuffer bb = null; // only necessary if malformed // ASCII only optimized loop while (dp < dlASCII && sa[sp] >= 0) da[dp++] = (char) sa[sp++]; while (sp < sl) { int b1 = sa[sp++]; if (b1 >= 0) { // 1 byte, 7 bits: 0xxxxxxx da[dp++] = (char) b1; } else if ((b1 >> 5) == -2 && (b1 & 0x1e) ! = 0) { // 2 bytes, 11 bits: 110xxxxx 10xxxxxx if (sp < sl) { int b2 = sa[sp++]; . // There is an illegal check here. da[dp++] = (char) (((b1 << 6) ^ b2)^ (((byte) 0xC0 << 6) ^ ((byte) 0x80 << 0))); } continue; } if (malformedInputAction() ! = CodingErrorAction.REPLACE) return -1; da[dp++] = replacement().charAt(0); return dp; } else if ((b1 >> 4) == -2) { // 3 bytes, 16 bits: 1110xxxx 10xxxxxx 10xxxxxx if (sp + 1 < sl) { int b2 = sa[sp++]; int b3 = sa[sp++]; if (isMalformed3(b1, b2, b3)) { ..... } else { char c = (char)((b1 << 12) ^ (b2 << 6) ^ (b3 ^ (((byte) 0xE0 << 12) ^ ((byte) 0x80 << 6) ^ ((byte) 0x80 << 0)))); . } continue; } if (malformedInputAction() ! = CodingErrorAction.REPLACE) return -1; if (sp < sl && isMalformed3_2(b1, sa[sp])) { da[dp++] = replacement().charAt(0); continue; } da[dp++] = replacement().charAt(0); return dp; } else if ((b1 >> 3) == -2) { // 4 bytes, 21 bits: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx if (sp + 2 < sl) { int b2 = sa[sp++]; int b3 = sa[sp++]; int b4 = sa[sp++]; int uc = ((b1 << 18) ^ (b2 << 12) ^ (b3 << 6) ^ (b4 ^ (((byte) 0xF0 << 18) ^ ((byte) 0x80 << 12) ^ ((byte) 0x80 << 6) ^ ((byte) 0x80 << 0)))); if (isMalformed4(b2, b3, b4) || // shortest form check ! Character.isSupplementaryCodePoint(uc)) { ....... } else { da[dp++] = Character.highSurrogate(uc); da[dp++] = Character.lowSurrogate(uc); } continue; } if (malformedInputAction() ! = CodingErrorAction.REPLACE) return -1; b1 &= 0xff; if (b1 > 0xf4 || sp < sl && isMalformed4_2(b1, sa[sp] & 0xff)) { da[dp++] = replacement().charAt(0); continue; } sp++; if (sp < sl && isMalformed4_3(sa[sp])) { da[dp++] = replacement().charAt(0); continue; } da[dp++] = replacement().charAt(0); return dp; } else { if (malformedInputAction() ! = CodingErrorAction.REPLACE) return -1; da[dp++] = replacement().charAt(0); } } return dp; }}Copy the code

In addition to a lot of illegal checks in the code, is to do shift assembly.

conclusion

Encodings are character set dependent, just as interface implementations in code are interface dependent, and a character set can have multiple encoding implementations, just as an interface can have multiple implementation classes.