Copyright notice: original works are allowed to be reproduced. When reproduced, please be sure to indicate the original source of the article in the form of hyperlink. Otherwise, legal liability will be investigated. http://www.cnblogs.com/jiangxueqiao/p/7446229.html


Clearing character encoding fog series links:

  1. Cutting through the fog of character encodings – Character encodings overview
  2. Clearing the fog of character encodings – How does the compiler handle file encodings
  3. Clearing the fog of character encoding – character encoding conversion
  4. Clear the fog of character encoding –MySQL database character encoding

Why does {“data”:” breeze “}JSON parsing fail? Why are garbled characters displayed in Korean on the interface? What is the difference between ASCII and ANSI? I believe that many people in character coding fell over, this article for the development of character coding knowledge to understand a brief explanation, I hope to be helpful to you.

1. ASCII and its extension

1.1 What is an ASCII character set

A character set is a set of characters used for display. The ASCII character set was developed by the American National Standard Institute in 1968 as a collection of character maps.

ASCII uses 7-bit binary bits to represent a character, which can represent a total of 128 characters (i.e2 ^ 7The binary000 0000 ~ 111 1111The decimal system0 ~ 127).

Each number in the ASCII character set corresponds to a unique character, as follows:

Because the correspondence is so simple that no special encoding rules are required, ASCII is technically not a character encoding because there are no rules. We just used to call the ASCII character set ASCII, ASCII encoding.

1.2 EXTENSION of ASCII

1.2.1 Maximum bit extension – ISO/IEC 8859

The ASCII character set was invented by the Americans for whom the characters were tailor-made. But with the development and popularity of computer technology, spread to Europe (such as France, Germany) countries. Since many European countries used country-specific characters in addition to the 128 characters in the ASCII table, europeans found that the ASCII character set could not express what they wanted to express. What happened? They found that ASCII used only the lowest seven bits of a byte (eight bits), so the European countries started using the highest bit (zero bits) to make use of the highest bit, thus adding 128 more characters to meet the needs of the People of Europe. But because each country’s needs are different, each country has designed different schemes. To end this confusion, the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) jointly developed a series of standards for the 8-bit character set, collectively known as ISO 8859 (full ISO/IEC 8859). Note that this is a collective name for a series of character sets. For example, ISO/IEC 8859-1 supports Western European languages, and ISO/IEC 8859-4 (Latin-4) supports Nordic languages.

The full list is as follows (from Baidu Baike) : ISO/IEC 8859-1 (Latin-1) – Western European languages ISO/IEC 8859-2 (Latin-2) – Central European languages ISO/IEC 8859-3 (Latin-3) – Southern European languages, Esperanto can also be displayed with this character set. ISO/IEC 8859-4 (Latin-4) – Nordic LANGUAGES ISO/IEC 8859-5 (Cyrillic) – Slavic languages ISO/IEC 8859-6 (Arabic) – Arabic ISO/IEC 8859-7 (Greek) – Greek ISO/IEC 8859-8 (Hebrew) – Hebrew (visual order) ISO 8859-8-I – Hebrew (logical order) ISO/IEC 8859-9 (Latin-5 or Turkish) – It swapped out the Icelandic alphabet for Latin-1 and added the Turkish alphabet. ISO/IEC 8859-10 (Latin-6 or Nordic) – North Germanic branch used in place of Latin-4. ISO/IEC 8859-11 (Thai) – Thai, evolved from the Thai TIS620 standard word set. ISO/IEC 8859-13 (Latin-7 or Baltic Rim) – Baltic language family ISO/IEC 8859-14 (Latin-8 or Celtic) – Celtic language family ISO/IEC 8859-15 (Latin-9) – Western European languages, adding the Finnish alphabet and capital French accent missing from Latin-1, as well as the euro symbol. ISO/IEC 8859-16 (Latin-10) – Southeast European languages. Used mainly in Romanian with the addition of the euro symbol.

Latin-1, 2, 5, and 7 are the language-specific ASCII extended character sets mentioned above.

1.2.2 Multi-byte expansion – GB series

As mentioned earlier, European countries effectively extended the ASCII character set by using the unused highest bits. But what the Europeans did not expect (and of course they did not have to think so much) was that on the other side of the ocean there was a great people with a history of five thousand years and thousands of Chinese characters. One byte was not enough to express such profound cultural heritage. Thus, when computers were introduced to China, the State Bureau of Technical Supervision designed a GB series code scheme (GB= Guo biao). The GB encoding scheme uses two bytes to express a Chinese character. At the same time, in order to be compatible with ASCII code, the highest bit of each byte must be 1, so as to avoid the conflict with the HIGHEST bit of 0 ASCII character set.

The GB series character set has undergone several development processes:

Code name Release time The number of bytes Scope of Chinese characters
GB2312 In 1980, Variable bytes (ASCII 1 bytes, Chinese characters 2 bytes) 6763 characters
GB13000 First published in 1993 Variable bytes (ASCII 1 bytes, Chinese characters 2 bytes) 20902 characters
GBK In the Windows95 2 bytes 21886 Chinese characters and graphic symbols (including all characters in GB2312, BIG5)
GB18030 First published in 2000 Variable bytes (ASCII 1 byte, Chinese characters 2 or 4 bytes) 27484 characters

With each iteration, the number of characters supported increases, and each iteration retains the encodings supported by previous versions, so upward compatibility is achieved.

1.2.3 Full Angle and Half Angle

Because the width of Chinese characters on the display is twice as wide as the width of English characters, it is not very beautiful when typesetting together. So GB code not only added Chinese characters, but also included the ASCII character set of numbers, punctuation marks, letters and other characters. The width of these numbers, punctuation marks and letters encoded in GB is twice as wide as the width in the ASCII character set, so the former is called full corner characters and the latter is called half corner characters.

2. ANSI

2.1 ANSI and code pages

As mentioned above, the ASCII extension schemes (such as ISO/IEC 8859 in Europe, GB series in China, etc.) have the characteristics of these ASCII extension schemes: they are compatible with ASCII coding, but they are incompatible with each other. Microsoft refers to these coding schemes collectively as ANSI coding. Therefore, ANSI does not refer to a specific coding scheme, but only to the specific coding scheme in which the country and the language are known. On Windows, ANSI is used by default to save files. So how does the operating system know which encoding ANSI should represent, whether GBK, ASCII, or EUC-KR? Windows uses something called a “Code Page” to determine the system’s default Code. The default code page for the simplified Chinese operating system is 936, which indicates that ANSI uses the GBK encoding. The Windows code page corresponding to GB18030 code is CP54936.

You can use the commandchcpTo view the system’s default code page:

The Chinese character “𤭢” is only included in GB18030, GB2312, GB13000, and GBK. By default, Visual Studio uses CP936 (GBK) to save the code file, but if you type “𤭢” in the code file, Visual Studio pops up the following prompt asking the user to select the code page:



2.2 Change the default code page

2.2.1 CHCP command

You can use the CHCP command to change the default code page, such as CHCP 437 to change the default code page to 437 (U.S.).

2.2.2 Control Panel

Change the system default code page in Control Panel -> Regions and Languages -> Change System Locale.



2.2.3 Code modification

You can also change the default code page by code:

char *setlocale(
   int category,
   const char*locale );Copy the code

3. Unicode

3.1 Unicode generation background

Countries use different coding rules, and while they are all ASCII compatible, they are incompatible with each other. Imagine French Jack writes a letter called “love_you. TXT “and sends it to his German friend Rose. Rose wants to open the file on her Windows system. She needs to know that the character encoding used in Germany is Latin-1, and she needs to make sure that the encoding is installed on her computer. To open the file smoothly. If the above is tolerable, then with the development of the web, if you get a document from the Internet, you probably don’t know what country it comes from or what code it uses. That’s why Email was so often garbled in its early days, when the sender and recipient might have used different codes.

Then came The Unicode Standard, which was published by The Unicode Consortium in 1991 and is popularly known as The Unicode Character Set. The Unicode character set, like the ASCII character set, is just a set of characters that mark the mapping between characters and numbers. It does not contain any encoding rules or schemes. Unlike ASCII, there is no limit to the number of characters supported by the Unicode character set (see Unicode for details).

The common view that Unicode characters are fixed to two bytes is wrong. For example, the Unicode code of 𤭢 is D852 DF62.

So how are Unicode characters encoded as bytes in memory? It is implemented with Unicode Transformation Formats (UTF), commonly known as UTF-8 and UTF-16.

On Windows, CP936 (GBK encoding) is used by default, which is 2 bytes. The Unicode code value of most Unicode characters is also 2 bytes, so most people mistakenly assume that the memory value of the Character string is the Unicode value, which is wrong. You can query the Unicode code values of Chinese characters from the webmaster tool -Unicode.

3.3 Difference between character set and character encoding

From the point of view of the schemes adopted by ASCII, GB2312, GBK, GB18030, Big5, latin-1, etc., they only define the mapping relationship between a single character and binary data, and there is only one representation of a character in a scheme. So it doesn’t matter whether GB2312 is a character set or a character encoding. But Unicode is different. As a character set, Unicode can be encoded in many different ways, such as UTF-8, UTF-16, UTF-32, etc. So since the advent of Unicode, character sets and character encodings need to be clearly distinguished.

3.4 Disadvantages of UTF-16 encoding

Utf-16 encodings specify two or four bytes for all characters. ASCII characters remain unchanged, except that the original 7 bits are extended to 16 bits, and the top 9 bits are always 0. Such as the character ‘A’ :

ASCII: 100 0001
UTF-16: 0000 0000 0100 0001Copy the code

As you can see, utF-16 has twice as much storage space for ASCII characters, and UTF-16 is not fully compatible with the ASCII character set. This is unnecessary in western countries where the ASCII character set already meets the requirements, and ASCII characters encoded in UTF-16 always have a high byte of zero, causing many C functions (such as strcpy,strlen) to treat this byte as a string terminator ‘\0’, resulting in incorrect calculations. Also, UTF-16 suffers from size issues, with the “𤭢” Unicode code being D852 DF62 on big-endian systems and 52D8 62DF on small-endian systems. Therefore, utF-16 was rejected by many Western countries when it was first introduced, which affected the implementation of Unicode. Then utF-8 encoding was designed to solve these problems.

3.5. The common encoding of Unicode character sets is UTF-8

3.5.1 track of utf-8 overview

Utf-8 is the most widely used Unicode character set encoding on the Internet. The minimum unit of UTF-8 encoding consists of 8 bits (1 byte), and UTF-8 uses one to four bytes to represent Unicode characters. In addition, UTF-8 is perfectly compatible with the ASCII character set, as evidenced by the following UTF-8 encoding rules.

3.5.2 UTF-8 encoding rules

Utf-8 encoding rules are simple:

(1) ASCII (single-byte character) characters are encoded in the same way as ASCII, that is, only one byte is used, and the first byte is 0.

(2) For multi-byte characters, if the number of bytes is n (1<n<=4), set the first n bits to 1 and the n+1 bits to 0. The first two digits of the following n-1 bytes are all set to 10. All other bits not mentioned in all bytes are unicode codes for this symbol.

3.5.2 UTF-8 BOM

BOM (Byte Order Mark) literally marks byte order. The reason for this was that UTF-16 and UTF-32 encodings used two or four bytes to represent a character and faced size problems. To distinguish between Big Endian (BE) and Little Endian (LE), a specified byte is added to the front of the string. Utf-16 Big Endian is addedFE FF, small end joinsFF FE. For example, the utF-16 encoding of the string “ABC” is00 41 00 42 00 43, corresponding byte sequences are as follows:

Because UTF-8 and ASCII are single-byte sequences that are difficult to distinguish from each other, Microsoft uses the BOM (3 bytes EF BB BF) to mark UTF-8 encoded strings. Utf-8 BOM is mostly used in Windows, but rarely used in other platforms. For example, Linux adopts UTF-8 encoding, so there is no need to distinguish. The HTTP protocol contains content-type :text/ HTML. Statements such as charset= UTF-8 do not need to be distinguished.





Thank you for reading! Three people must have my teacher, welcome to the article correction.

Classification: C++, character encoding
Labels: Character encoding, ASCII, ANSI, Unicode, UTF-8, UTF-16, BOM
Good writing is the best
Pay attention to my
Collect the paper

jiangxueqiao



Attention – 1



Fans – 8 –

+ add attention

10
0

The «The last:
Anti-sql injection principle of the MySQL database



» Next up:
How do I generate a Dump (DMP) file – tool piece