This article is based on the String source code for analysis, involving the knowledge of character coding from Unicode Tutorials – Herong’s Tutorial Examples, English material is very detailed, I recommend everyone to read it.
Is String really Immutable
Unicode strings in Java are stored in strings as byte arrays in Latin1 (when all characters are less than 0xFF) or UTF16 encoding:
private final byte[] value;
Copy the code
In general, Immutable means that final bytes do not change after a String is initialized, and all String operations create new copies instead of modifying the original array.
But array elements can theoretically be modified, for example by changing the string constant ABC to ABC via reflection:
public static void main(String[] args) {
setFirstValueToA("abc");
String replaced = new String("abc");
System.out.println(replaced); // Abc
}
private static void setFirstValueToA(String str) {
Class<String> stringClass = String.class;
try {
Field value = stringClass.getDeclaredField("value");
value.setAccessible(true);
byte[] bytes = (byte[]) value.get(str);
bytes[0] = 0x41; // A
} catch(NoSuchFieldException | IllegalAccessException e) { e.printStackTrace(); }}Copy the code
How is a string array saved as a byte array
Test several string arrays with the following code:
public static void main(String[] args) {
printString("abc");
printString("Chinese");
printString("ABC" in Chinese);
printString("ABC 😊");
}
private static void printString(String str) {
System.out.println("= = = = = = >" + str);
// return the UTF-16 char[] size
System.out.println("length: " + str.length());
// Use default Encoding (UTF-8)
System.out.println("getBytes: " + str.getBytes().length);
// Convert UTF-16 char[] to char
System.out.println("codePointCount: " + str.codePointCount(0, str.length()));
// Get the UTF-16 char[]
System.out.println("toCharArray: " + str.toCharArray().length);
// The UTF-16 char[] to bytes
System.out.println("internal value: " + getStringInternalValueLength(str));
}
Copy the code
The results are as follows:
“abc” | “Chinese” | “ABC” in Chinese | “ABC 😊” | |
---|---|---|---|---|
str.length | 3 | 2 | 5 | 5 |
str.getBytes().length | 3 | 6 | 9 | 7 |
str.codePointCount | 3 | 2 | 5 | 4 |
str.toCharArray().length | 3 | 2 | 5 | 5 |
str.value.length | 3 | 4 | 10 | 10 |
internal value
String value ();
- When all characters are less than 0xFF, the Unicode code point is held using Latin1 Character Encoding, that is, one byte for each Character. Such as “ABC”
- If the above conditions are not met, use UTF-16 Character Encoding, that is, each Character is saved in two or four bytes.
Unicode is a Coded Character Set that maps almost all human characters to code point symbols, usually in the format U+ XXXX, where XXXX is a hexadecimal integer and ranges from U+0000 to U+10FFFF. The code point symbol is a normalized symbol for text, but it must be saved as an array of bytes. These are Character Encoding, such as UTF-8, and UTF-16 used internally in Java Strings.
Utf-16 is an encoding that expresses Unicode code points as character arrays. For U+0000 to U+FFFF, utF-16 is an encoding that stores Unicode code points as 2-byte arrays. U+10000 ~ U+10FFFF are first converted into a pair of Code Points (Surrogate Pair) within the range of U+D800 ~ U+DFFF, and saved in accordance with the preceding rules. This range was chosen because the Unicode range has not yet been assigned a valid character and is therefore distinguishable from the previous rule.
The Unicode code point for “Chinese” is U+4E2d and U+6587, which are larger than 0xFF, so the byte length is 4. There are characters in ABC Chinese “that do not meet the condition, so all characters are saved in UTF-16. They are 2 bytes, so the length is 10.
The Unicode code point of 😊 is U+1F60A. According to utF-16 specifications, U+10000 ~ U+10FFFF need to be converted into surrogate pair and then saved into Byte, and U+D83D and U+DE0A. So “ABC 😊” has a length of 10 bytes.
toCharArray()
The size of a Char in Java is 2 bytes, which is just enough to represent a Unicode symbol from U+0000 to U+FFFF.
When Latin1 is encoded, the char array is the padding of the byte array, and the high byte is 0. Utf-16 encoding is equivalent to the Unicode encoding array after conversion of surrogate pair, in which characters within 0xD800 to 0xDFFF are surrogate characters.
“ABC” is Latin1 encoded, so the char array is equal to the bytes array; The “ABC Chinese” encoding is UTF-16, so the char array is half the size of the bytes array.
codePointCount()
The toCharArray method includes converted surrogate pairs, so the actual length may be greater than the character length. CodePointCount removes the influence of surrogate pairs and returns the original length of characters, counting two consecutive surrogate pairs only once.
String.length
The length of the toCharArray array, influenced by surrogate pairs, can be greater than the character length.
str.getBytes().length
String is an array of bytes stored in UTF-16. When returned with the getBytes method, the Encoding must be specified. By default, utF-8 is used, so utF-16 bytes are converted to UTF-8 bytes. Each Unicode symbol is 1 to 4 bytes long after being encoded in UTF-8.
System.out.println("abc".getBytes(UTF_8).length); / / 3
System.out.println("In".getBytes(UTF_8).length); / / 3
System.out.println("Wen".getBytes(UTF_8).length); / / 3
System.out.println("😊".getBytes(UTF_8).length); / / 4
Copy the code