On Windows notepad save coding format problem

When notepad saves text files under Windows, you can choose different encoding formats to save files. The binary of files saved by different encoding formats is different. For example:

We input 123 in Notepad and select the default encoding format, ANSI, which is also the default encoding format of the system. The default encoding format of the Simplified Chinese version is GBK. When we use the binary tool to open it, the binary format is as follows:

31 32 33

Unicode is a character set scheme, not an encoding scheme. The actual encoding scheme Windows uses for Unicode is UTF-16LE, which inserts a small byte order BOM (FFFE) at the beginning of the text. So its binary is:

FF FE 31 00 32 00 33 00

Windows actually uses UTF-16BE, which inserts the big-endian BOM (FEFF) at the beginning of the text, so its binary is:

FE FF 00 31 00 32 00 33

The binary of utF-8 encoding does not have a BOM label, while the binary of UTF-8 encoding on Windows has a UTF-8 BOM label (EF BB BF), so the binary is:

EF BB BF 31 32 33

Here is an example of a problem caused by a BOM header:

package test;

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Arrays;

public class Test1 {
    public static void main(String[] args) throws IOException {
        String myString = "";
        byte[] bytes = new byte[10];
        int readCount = 0;
        try (FileOutputStream outputStream = new FileOutputStream("D:\\test\\hello.txt")) {
            outputStream.write(new byte[] {-2, -1.0.0x31.0.0x32.0.0x33 });
            outputStream.flush();
            outputStream.close();
        } catch (Exception e) {
        }
        try (FileInputStream reader = new FileInputStream("D:\\test\\hello.txt")) {
            while ((readCount = reader.read(bytes, 0.10)) != -1) {
                myString += new String(bytes, 0, readCount, "UTF-16BE"); System.out.println(Arrays.toString(bytes)); System.out.println(myString); System.out.println(Integer.parseInt(myString)); }}catch(Exception e) { e.printStackTrace(); }}}Copy the code

In this example we write binary data programmatically:

FE FF 00 31 00 32 00 33

Read in utF-16BE mode. When we read the string to a number, we encountered an error. The output above is as follows:

[-2, -1, 0, 49, 0, 50, 0, 51, 0, 0] 123 java.lang.NumberFormatException: For input string: “123” at the Java. Lang. A NumberFormatException. ForInputString ats (a NumberFormatException. Java: 65) java.lang.Integer.parseInt(Integer.java:580) at java.lang.Integer.parseInt(Integer.java:615) at test.Test1.main(Test1.java:24)

The real reason is the BOM byte order. It’s usually hard to spot this error because the output string is “123” and the result looks no different from a normal string. In this case, we should think to look up the binary representation so that we can find the problem quickly.

Finally, the BOM is a Unicode character with a value of U+FEFF that maps to the same Unicode character set because of the different encoding methods.

The byte order mark (BOM) is a Unicode character, U+FEFF Byte order mark (BOM), whose appearance as a magic number at the start of a text stream can signal several things to a program consuming the The text.

Code as proof:

package test;

import java.util.Arrays;

public class Main {
    public static void main(String[] args) throws Exception {
        byte[] a = new byte[] { 0xEF - 256.0xBB - 256.0xBF - 256 };
        byte[] b = new byte[] { 0xFE - 256.0xFF - 256 };
        byte[] c = new byte[] { 0xFF - 256.0xFE - 256 };
        String aString = new String(a, 0.3."UTF-8");
        String bString = new String(b, 0.2."UTF-16BE");
        String cString = new String(c, 0.2."UTF-16LE");
        System.out.println(Arrays.toString(aString.getBytes("UTF-8")));
        System.out.println(Arrays.toString(bString.getBytes("UTF-8")));
        System.out.println(Arrays.toString(cString.getBytes("UTF-8"))); }}Copy the code

Output result:

[-17, -69, -65] [-17, -69, -65]