This article was first published on wechat public account: Programmer Georgi

Public class testT {public static void main(String [] args){String A = "hi, you are Jorghry "; System.out.println(A.length()); }}Copy the code

The output of the above result is 7.

Select the string.length () function in win and press CTRL +B to enter the definition of String.length().

/** * Returns the length of this string. * The length is equal to the number of <a href="Character.html#unicode">Unicode  * code units</a> in the string. * * @return the length of the sequence of characters represented by this * object. */ public int length() { return value.length; }Copy the code

Return the length of the string equal to the number of Unicode code units in the string.

Xiaomeng: So what does that mean, Jogory? George: In an article I wrote the other day, the interviewer asks you a question about coding, just throw it at him! Java character encoding:

In Java, there is a distinction between internal and external code

  • Internal code: char or String encoding used in memory.
  • External code: All but the internal code can be considered as “external code”. (Including class file encoding)

Java internals: Unicode (UTF-16) uses UTF-16. Return the length of the string equal to the number of utF-16 code units in the string.

A Code Unit is the smallest partition in a UTF, called a Code Unit. Therefore, a conversion format will contain only an integer number of units. The number X in UTF-X is the number of bits in the respective code unit.

Utf-16 refers to a minimum of 16 bits per unit, that is, two bytes per unit. Utf-16 can contain one unit and two units, corresponding to two bytes and four bytes. We operate utF-16 in one unit.

Do you remember the Unicode knowledge you learned a few days ago when the interviewer asked me to tell you about Unicode? I talked about Unicode for 3 seconds, but the interviewer said you really love it. It mentioned that UTF-16 encodes a character in the range of U+0000-U+FFFF using 2 bytes. For characters whose code point is larger than U+FFFF, four bytes are used for encoding. The former is two bytes, i.e. a code unit; the latter is four bytes, i.e. two code units!

The Unicode value for the character in my example above is “U+1D11E”. This Unicode value is significantly larger than U+FFFF, so UTF-16 encodes this character in four bytes, which is two code units!

That’s why you saw my example above. The result represents a String with a length of 2!

Let’s see an example!

Public class testStringLength {public static void main(String [] args){String B = "𝄞"; // This is the note character, but because the current page does not support this encoding, so it is not displayed. String C = "\uD834\uDD1E"; // This is the utF-16 encoding of the note character system.out.println (C); System.out.println(B.length()); System.out.println(B.codePointCount(0,B.length())); // If you want to get this Java file for your own demonstration, you can reply 6666 to get it}}Copy the code

You can see that the codePointCount() function tells you that the musical character is a character!

A few questions: 0. What does codePointCount mean? 1. Why utF-16 is “uD834uDD1E” instead of “U+1D11E” 2. What are the code units for UTF-32 and UTF-8?

One solution at a time:

Question 0:

CodePointCount stands for code points, meaning that a character corresponds to one code point.

For example, if the string STR = “u1D11E”, instead of the note character, the machine recognizes a code point “/u1D11” and the character “E”. So you get 2 points of code and 2 units of code.

But if STR = “uD834uDD1E” is made, then the machine recognizes it as a proxy of two code units, but one code point (that note character), so length results in two code units and codePointCount() results in one code point.

Question number one

Here is the corresponding conversion rule:

  • First, U+1D11E-U+10000 = U+0D11E
  • Then convert U+0D11E to binary: 0000 1101 0001 0001 1110, the first 10 bits are 0000 1101 00 and the next 10 bits are 01 0001 1110
  • Then use the template: 110110YYyyYYYYYY 110111XXXXXXXXXX
  • The binary of U+0D11E is entered in the template from left to right: 110110 0000 1101 00 110111 01 0001 1110
  • The resulting binary is then converted to hexadecimal: D834DD1e, which you see as UTF-16 encoding

Question number two

  • In the same way, UTF-32 has a 32-bit cell, so it contains only one cell, which is four bytes.
  • Utf-8 can contain one cell, two cells, three cells, and four cells, corresponding to one, two, three, and four bytes.

reference

  • The emojis were posted on weibo by @Huang Xiaob
  • Blog.csdn.net/u012425381/…
  • xiaogd.net/
  • tool.oschina.net/hexconvert
  • Baike.baidu.com/item/Unicod…

This article was first published on wechat public account: Programmer Georgi

If you are a toutiao user, you can get 59998 yuan worth of programming and postgraduate entrance examination materials in the background of my toutiao number programmer Georgi reply resources. If you think the article is good, welcome to follow my WX public number: I am a background development engineer of BAT Factory, focusing on sharing technical dry products/programming resources/job interview/growth feelings, etc., paying attention to sending 5000G programming resources and organizing an offer to help many people to win Java with the answer attached, free download CSDN resources.