Garbled problems are always a problem for Web beginners, if you want to avoid garbled problems. You need to understand the cause of the garble problem. I’ll start with a few common encodings.

Utf-8 (8-bit Unicode Transformation Format)

Utf-8 is a variable-length character encoding for Unicode and is also a prefix code. It can be used to represent any character in the Unicode standard, and the first byte of its encoding remains ASCII compatible, allowing original ASCII software to continue to be used with little or no modification. As a result, it is becoming the preferred encoding for E-mail, web pages and other applications that store or transmit text.

Utf-8 encodes each character using one to four bytes:

  • 128 US-ASCII characters require only one byte encoding (Unicode range from U+0000 to U+007F).
  • Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, and other alphabets with additional symbols require a two-byte encoding (Unicode range from U+0080 to U+07FF).
  • Characters in other basic multilingual planes (BMP), which contain most common words, are encoded in three bytes.
  • Other characters in the little-used Unicode auxiliary plane use four-byte encoding.





gb2312

Gb2312 is the national standard simplified Chinese character set of the People’s Republic of China. Its full name is “Chinese Coded Character Set for Information Interchange · Basic Set”, also known as GB0. It was issued by the General Administration of Standards of China and implemented on May 1, 1981. GB2312 code is widely used in Mainland China; Singapore and other places also use this code. Almost all Chinese language systems and international software in mainland China support GB 2312.

In the use of GB2312 programs usually use EUC storage method, in order to be compatible with

ASCII

.

The browser

“GB2312” on the code table is usually referred to as “EUC-CN” notation.

Each Chinese character and symbol is represented by two bytes. The first byte is called the “high byte” and the second byte is called the “low byte”.

0xA1-0xf7 is used for high byte (adding 0xA0 to the area code of 01-87), and 0xA1-0xFe is used for low byte (adding 01-94 to 0xA0). Since Level-1 Chinese characters start from block 16, the range of “high byte” in the Chinese character area is 0xB0-0xf7, and the range of “low byte” in the Chinese character area is 0xA1-0xFe, and the code point occupied by the Chinese character area is 72*94=6768. Among them, 5 vacancies are D7FA-D7Fe.

For example, the word “ah” in most programs is stored in two bytes, 0xB0 (the first byte) 0xA1 (the second byte). (compared with location code: 0xB0=0xA0+16,0xA1=0xA0+1).




To avoid garbled characters in Web pages, it is best to unify string encoding:


1. When creating a database, select utF-8 encoding

2. Select UTF-8 encoding when creating PHP files.

Mysql_query (‘set names utf8’);

4. Add it to static pages

Header (” Content-type: text/ HTML; charset=utf-8″);


The following are the causes and solutions of garbled codes in common cases:


1, the database uses UTF8 encoding, and the page declaration encoding is GB2312, which is the most common cause of garbled code. Mysql_query (“SET NAMES GBK”); mysql_query(“SET NAMES GBK”); To set the MYSQL connection code, ensure that the page declaration code is consistent with the connection code set here (GBK is GB2312 extension). Mysql_query (“SET NAMES UTF8”) can be used if the page is utF-8 encoded; Note that it is UTF8, not utF-8. If the page declared the same code as the database internal code can not set the connection code.


2. The page claim encoding is not the same as the file encoding itself, which is very rare, because if the encoding is not the same, the artist will see garbled characters in the browser when creating the page. More often than not, it’s a matter of fixing some minor bugs after release, opening the page with the wrong code and saving it. Or you can modify files online with some FTP software, such as CuteFTP, because the software code configuration is wrong, resulting in the conversion of the wrong code.


When ajax receives responseText or responseXML, it decodes the responseText or responseXML value in UTF-8 format, so if the data sent by the server is not in UTF-8 format, When receiving a responseText or responseXML value, it is possible to generate garbled characters. The solution is to ensure that the data passed from the server is in UTF-8 encoding format.

Such as:

response.setContentType(“text/html; charset=utf-8”); (JSP)

The header (” content-type: text/HTML. charset=utf-8″); (PHP)


4. When downloading files, Chinese file names are garbled in IE or Firefox. This is because the browser decodes the url of the downloaded file name. PHP can use urlencode($filename);