Thoroughly understand UTF-8, Unicode, wide characters, locale

conclusion

Wide character type wchar_t

locale

Why wide character types

Multibyte strings and wide strings are converted to each other

Wchar_t type has been used recently, so I plan to explore it in detail. I didn’t expect that the water is quite deep. Most of the materials on the Internet are copy and paste, with only one conclusion and no verification process. This paper records the process and conclusion of the inquiry, if there is any wrong please correct.

Unicode, UCS,

The Universal Character Set (UCS) is essentially a Character Set.

Unicode was developed in conjunction with ISO/IEC 10646, the universal Character Set (

Universal Character Set, UCS). Unicode works in The same way as ISO/IEC 10646, but The Unicode Standard contains more detailed implementation information and covers more detailed topics such as Bitwise encoding, proofreading, and rendering. From (Unicode)

Unicode and UCS are both character sets.

The length of the UCS code is 31 bits, which can be represented by 4 bytes, and can represent 2 to the 31st characters. If two characters have the same high level and only the lower 16 bits are different, they belong to the same plane, so a plane consists of 2 to the 16th characters. Currently most characters are located on the first plane called BMP. BMP encoding is usually expressed in the form U+ XXXX, where x is a hexadecimal number.

For example, the UCS code for “you” in Chinese is U+ 4F60, and the UCS code for “good” is U+ 597D. More Chinese encodings can be queried in the Unicode encodings table.

With UCS encoding, any character in a computer can be represented by up to four bytes, called code points.

UTF8

Now with the UCS character set, does a character really need to be stored as four bytes (UTF-32) on a computer?

The answer is no, on the one hand, it is a waste of space to store each character in four bytes, because most characters are in BMP, only the last 16 bits are valid, the first 16 bits are 0. On the other hand, this is incompatible with C, where 0 bytes indicate the end of a string, and functions such as strlen rely on this. If stored in UTF-32, there are a lot of 0 bytes that do not indicate the end of a string.

Ken Thompson invented UTF-8 encoding, which can solve the above problems well. The conversion relationship between Unicode and UTF-8 is shown below:

Start value End value byte sequence Byte1 Byte2 Byte3 Byte4 Byte5 Byte6

U+0000 U+007F 1 0xxxxxxx

U+0080 U+07FF 2 110xxxxx 10xxxxxx

U+0800 U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx

U+10000 U+1FFFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

U+200000 U+3FFFFFF 5 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

U+4000000 U+7FFFFFFF 6 111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The first byte has either the highest bit of 0 (ASCII) or the highest bit of 1. The number of 1s after the highest bit determines how many subsequent bytes are also in the current character code, such as 111110xx. There are four 1s after the highest bit, indicating that the next four bytes are in the current character code. Each subsequent byte has a top bit of 10, distinguishable from the first byte. The x in the following byte represents the UCS encoding. So UTF-8 is like a train. The first byte is the locomotive, which contains information about which subsequent bytes also belong to the current train, and the following bytes are the carriages, which carry the UCS code.

Take the Chinese character “you” as an example. The corresponding Unicode is “U+ 4F60 “and the binary representation is 0100 1111 0110 0000. Utf-8 encoded according to the rules in the table is 11100100 10111101 10100000 (0xe4 0xBD 0xA0).

conclusion

Unicode is essentially a character set in which any character can be represented as a four-byte character.

Utf-8 is an encoding rule that converts any byte corresponding to a character in a Unicode character set into another byte sequence. Utf-8 is only one of the encoding rules. Other encoding rules include UTF-16, UTF-32, etc.

Wide character type wchar_t

Before introducing wide characters, take a look at locale. Because the conversion between multi-byte strings and wide strings is locale dependent.

locale

What is the locale

A locale, also known as a “localization policy set” or “local environment,” is a software setting that expresses aspects of a program’s user area. Run the locale command on Linux to view the current locale Settings:

ubuntu@VM-0-16-ubuntu:~$ localeLANG=zh_CN.UTF-8LANGUAGE=LC_CTYPE=”zh_CN.UTF-8″LC_NUMERIC=”zh_CN.UTF-8″LC_TIME=”zh_CN.UTF-8″LC_COLLATE=”zh_CN.UTF- 8″LC_MONETARY=”zh_CN.UTF-8″LC_MESSAGES=”zh_CN.UTF-8″LC_PAPER=”zh_CN.UTF-8″LC_NAME=”zh_CN.UTF-8″LC_ADDRESS=”zh_CN.UTF-8″L C_TELEPHONE=”zh_CN.UTF-8″LC_MEASUREMENT=”zh_CN.UTF-8″LC_IDENTIFICATION=”zh_CN.UTF-8″LC_ALL=

You can think of a locale as a set of environment variables. The locale environment variable value is in the format of language_area.charset. Languag stands for language, such as English or Chinese; Area indicates the area where the language is spoken, such as the United States or mainland China. Charset indicates the character set encoding, such as UTF-8 or GBK.

These environment variables affect date formatting, number formatting, currency formatting, character processing, and more.

References:

locale wiki

Environment Variables

How do I set the default locale of the system

To modify the /etc/default/locale configuration file, for example, to set locale to zh_cn.utf-8, add the following statement: LANG= zh_cn.utf-8

Locale environment variables

Take LC_TIME, which affects functions such as strftime(). size_t strftime(char *str, size_t maxsize, const char *format, const struct tm *timeptr)

Strftime formats the time represented by the timeptr structure according to the formatting rules defined in format and stores it in STR.

#include<locale.h>#include<stdio.h>#include<time.h>intmain(){time_tcurrtime; structtm*timer; charbuffer[80]; time( &currtime ); timer = localtime( &currtime ); printf(“Locale is: %s\n”, setlocale(LC_TIME,”en_US.iso88591″)); strftime(buffer,80,”%c”, timer ); printf(“Date is: %s\n”, buffer); printf(“Locale is: %s\n”, setlocale(LC_TIME,”zh_CN.UTF-8″)); strftime(buffer,80,”%c”, timer ); printf(“Date is: %s\n”, buffer); printf(“Locale is: %s\n”, setlocale(LC_TIME,””)); strftime(buffer,80,”%c”, timer ); printf(“Date is: %s\n”, buffer); return(0); }

The result after compilation is as follows:

Localeis: en_us.iso88591dateis: Sun07Jul201904:08:39PM CSTLocaleis: zh_cn.utF-8dateis: Sun07Jul201904:08:39PM CSTLocaleis: zh_cn.utF-8dateis: Sun07Jul201916:08:39 PM Localeis: en_US.iso88591dateis: Sun07Jul201904:08:39PM CSTLocaleis: zh_cn.utF-8dateis: sun07jul201917:39 Zh_cn.utf-8dateis: 16:08:39 on Sunday, 07 July 2019

You can see that calling strftime() yields different results with different values for LC_TIME.

char* setlocale (int category, const char* locale); Can be used to set the region of the current program.

Category: used to specify the range of Settings to affect. LC_CTYPE affects character classification and character conversion, LC_TIME affects date and time formats, and LC_ALL affects all content.

Locale: specifies the value of a variable. In the preceding example, “en_US. Iso88591 “, “zh_CN. Utf-8 “, and an empty string “” are used respectively, “” indicates that the default locale of the current OPERATING system is used.

References:

setlocale()

Why wide character types

The Unicode for Hello is “U+ 4F60 “and “U+ 597D”, and the UTF-8 codes are “0xe4 0xBd 0xA0” and “0xe5 0xA5 0xBD”, respectively.

Multi-byte strings are stored in utF-8 encoding in compiled executables

#include<stdio.h>#include<string.h>intmain(void){chars[] =” hello “; size_tlen =strlen(s); printf(“len = %d\n”, (int)len); printf(“%s\n”, s); return0; }

The output is as follows:

len = 6

hello

Od Compiled executable file, you can find that “Hello” is stored in the UFT-8 code, namely, 0xe4 0xBD 0xA0 and 0xe5 0xA5 0xBD 6 bytes.

The strlen() function doesn’t care what’s in the string, so len is 6, which is the number of bytes encoded in “hello” UFT-8.

printf(“%s\n”, s); It is equivalent to writing six bytes oF 0xe4 0xBD 0xA0 and 0xe5 0xA5 0xBD to the device file of the current terminal. If the driver of the current terminal can recognize the UTF-8 encoding, Chinese characters can be printed. If the driver of the current terminal cannot recognize UTF-8, Chinese characters cannot be printed.

Wide strings are stored in Unicode in compiled executables

#include<wchar.h>#include<stdio.h>#include<locale.h>intmain(void){ setlocale(LC_ALL,”zh_CN.UTF-8″); // set localewchar_ts[] =L” Hello “; size_tlen = wcslen(s); printf(“len = %d\n”, (int)len); printf(“%ls\n”, s); return0; }

The output is as follows:

len = 2

hello

Run the od command on the compiled executable to find the following bytes:

193 0003020001\0002\0` O \0\0} Y \0\0\n \0\0\01940002000100004f600000597d0000000a

00004F60 is the Unicode for “you”, 0000597d is the Unicode for “good”. So wide strings are stored in the executable according to Unicode.

Wchar_t is a wide character type. A character constant or string is preceded by an L to indicate a wide character constant or string. So len is 2.

Wcslen (), unlike strlen(), does not end at the sight of a zero byte but at the sight of a UCS encoded zero character.

Currently, wide characters are held in memory in Unicode, but to write to a terminal you still need to output them in multi-byte encoding so that the terminal driver can recognize them, so printf internally converts the wide string to a multi-byte string and then writes it out. The conversion process is affected by locale, setlocale(LC_ALL, “zh_cn.utF-8 “); Set the LC_ALL of the current process to zh_cn.utF-8, so printf converts Unicode to the multi-byte UTF-8 encoding and then writes it to the terminal device. If setlocale(LC_ALL, “zh_cn.utF-8 “); Change to setlocale(LC_ALL, en_US.iso88591): “Hello” will not be printed.

Generally, programs are encoded in wide characters for in-memory computation and in multi-byte encoding for disk storage or network transmission.

Multibyte strings and wide strings are converted to each other

C language provides multi-byte string and wide string conversion function.

#include<stdlib.h>size_tmbstowcs(wchar_t*dest,constchar*src,size_tn); size_twcstombs(char*dest,constwchar_t*src,size_tn);

Mbstowcs () converts a multi-byte string to a wide string.

Wcstombs () converts a wide string to a multi-byte string.

Consider the following example:

#include<locale.h>#include<stdio.h>#include<time.h>#include<stdlib.h>#include<wchar.h>#include<string.h>wchar_t* str2wstr(constcharconst* s) {constsize_tbuffer_size =strlen(s) +1; wchar_t* dst_wstr = (wchar_t*)malloc(buffer_size *sizeof(wchar_t)); wmemset(dst_wstr,0, buffer_size); mbstowcs(dst_wstr, s, buffer_size); returndst_wstr; }voidprintBytes(constunsignedcharconst* s,intlen){for(inti =0; i < len; i++) {printf(“0x%02x “, *(s + i)); }printf(“\n”); } intMain (){chars[10] =” Hello “; 0xe4 0xBd 0xA0 0xe5 0xA5 0xBd 0x00 WChar_TWs [10] =L” Hello “; 0x60 0x4F 0x00 0x00 0x7D 0x59 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 %s\n”, setlocale(LC_ALL,”zh_CN.UTF-8″)); //Locale is: zh_CN.UTF-8printBytes(s,7); //0xe4 0xbd 0xa0 0xe5 0xa5 0xbd 0x00 printBytes((char*)ws,12); //0x60 0x4f 0x00 0x00 0x7d 0x59 0x00 0x00 0x00 0x00 0x00 0x00 printBytes((char*)str2wstr(s),12); //0x60 0x4f 0x00 0x00 0x7d 0x59 0x00 0x00 0x00 0x00 0x00 0x00 return(0); }

After compiling, the execution result is as follows:

Locale is: zh_CN.UTF-80xe40xbd0xa00xe50xa50xbd0x000x600x4f0x000x000x7d0x590x000x000x000x000x000x000x600x4f0x000x000x7d0x590x000x000 x000x000x000x00

The second line of output confirms that the multi-byte string is stored in memory as UTF-8. “0xe4 0xBd 0xA0 0xe5 0xA5 0xBD “is the UTF-8 encoding for” Hello “.

The third line of output confirms that the wide string is stored in memory in Unicode, “0x60 0x4f 0x00 0x00 0x7d 0x59 0x00 0x00 0x00″ is the Unicode equivalent of the wide string L” hello “.

Setlocale (LC_ALL, “zh_cn.utF-8 “) sets the locale, and the program decodes wide strings in UTF-8. After calling mbstowcs(), You can see that the UTF-8 encoding of “hello” “0xe4 0xBd 0xA0 0xe5 0xA5 0xBD 0x00” has indeed been converted to the Unicode equivalent of “hello “0x60 0x4F 0x00 0x00 0x7D 0x59 0x00 0x00 0x00 0x00 0 x00 0 x00 0 x00.”

If you change setlocale(LC_ALL, “zh_cn.utF-8 “) to setlocale(LC_ALL, “en_US.iso88591 “); The output of the last line will be different.

Thoroughly understand UTF-8, Unicode, wide characters, locale

Related Posts

Utility -API interface design

SpringBoot integrates Spring Security

How to install mysql5.7 on Windows