Recently, when I used Java to develop enterprise wechat session storage, I encountered a string encoding problem. The cause is that the encoding environment and database encoding used in the development are UTF-8. There is no problem in the Linux environment, but when deploying in the Windows environment, Windows adopts GBK encoding by default. Garbled characters appear when parsing Chinese conversations.

After finding the reason, I thought that it would be enough to transfer the gbK-encoded string directly to UTF-8, but after searching online and realizing the encoding, I found that data loss occurred in Chinese after transcoding. But again on the Internet to find a circle, in-depth understanding of GBK coding and UTF-8 coding format found that the original GBK to UTF-8 coding will appear data loss problem, investigate the reason, is caused by the characteristics of its coding format.

The Chinese code of GBK is a Chinese character represented by [2] bytes. For example, the hexadecimal GBK code of the Chinese character “inside” is displayed as C4 DA B2 bf

The Chinese encoding of UTF-8 is a Chinese character represented by [3] bytes. For example, the hexadecimal utF-8 encoding of the Chinese character “inside” is displayed as E5 86 85 E9 83 A8

Obviously, GBK cannot be converted to UTF-8 directly, so the few bytes become many bytes. Who knows what the missing bytes are? !

There’s a problem. Can we fix it?

After some research, there are solutions, but none of them are perfect so far.

The first is to use lossy conversions
  1. The GBK string getBytes() is first taken to get two raw bytes, which are converted into a binary character stream of 16 bits.
  2. According to utF-8 encoding rules for Chinese characters, the first byte starts with 1110, the second byte starts with 10, and the third byte starts with 10. Inserts a flag bit into the original binary string. The final length ranges from 16– >16+4+2+2=24.
  3. After conversion, the actual situation needs to consider more factors, such as string is a mixture of Chinese characters and numbers, need to identify and process numbers.
1 public static String getUTF8StringFromGBKString(String gbkStr) { 2 try { 3 return new String(getUTF8BytesFromGBKString(gbkStr), "UTF-8"); 4 } catch (UnsupportedEncodingException e) { 5 throw new InternalError(); 6 } 7 } 8 9 public static byte[] getUTF8BytesFromGBKString(String gbkStr) { 10 int n = gbkStr.length(); 11 byte[] utfBytes = new byte[3 * n]; 12 int k = 0; 13 for (int i = 0; i < n; i++) { 14 int m = gbkStr.charAt(i); 15 if (m < 128 && m >= 0) { 16 utfBytes[k++] = (byte) m; 17 continue; 18 } 19 utfBytes[k++] = (byte) (0xe0 | (m >> 12)); 20 utfBytes[k++] = (byte) (0x80 | ((m >> 6) & 0x3f)); 21 utfBytes[k++] = (byte) (0x80 | (m & 0x3f)); 22 } 23 if (k < utfBytes.length) { 24 byte[] tmp = new byte[k]; 25 System.arraycopy(utfBytes, 0, tmp, 0, k); 26 return tmp; 27 } 28 return utfBytes; 29}Copy the code
The second method is to use the most primitive method, the character mapping table, the characters corresponding to GBK and UTF-8 to map. Currently there is a great god on github in C++ implementation of a version, see the linkGithub.com/DavidLiRemi…However, this method still has problems, because utF-8 character set is much larger than GBK, resulting in some characters in UTF-8 have no corresponding in GBK, and a few characters are lost.
The third is to directly transform the environment, the operating environment of the code format tuned into a consistent. In Linux, the default utF-8 environment is used. In this case, you need to ensure that the character set format of the code is consistent with that of the database table. In Windows, however, the default GBK environment is used.
  1. One is to adjust the code and database code to the GBK environment, but this way requires your code maintenance to do a good job of Linux and Windows environment recognition and switch
  2. The second is to switch the Windows Java runtime environment into UTF-8, there are several implementation methods summarized here:
    1. For programs running jar packages, add -dfile. encoding= UTF-8 to the vm parameters
    2. To solve the problem of log garbled characters, you can modify the logback parameter in Spring to remove the following parameter<charset>utf8</charset>
    3. In the CMD console, run the CHCP command to modify the code environment. CHCP 936 (the default system) is switched to Chinese GB2312 and CHCP 65001 is switched to UTF-8