Integrated Oracle data into MaxCompute, garbled code problem analysis; Why, while Oracle data is not garbled, integration into MaxCompute is garbled? What’s the problem?

1.1 Garbled characters

DataWorks’ data offline integration (DataX) integrates Oracle data into MaxCompute, but the source library is not garbled.

Symptom: [Oracle; tool: plsqL-dev]

[MaxCompute, DataWorks]

select OP_USER from test.mdtsb where

uuid=’161A45E75BC88040E053441074848040′;

1.2 Problem Analysis

Use two Oracle functions:

【 DUMP; CONVERT;

Docs.oracle.com/database/12…

Docs.oracle.com/database/12…

In order to avoid the correct transcoding because the real character set of the column with garbled characters is the same as that of the operating system, the Linux environment is selected, the client character set is SET to UTF8, and SQLplus is used for query analysis.

1. Client environment:

SQL> select userenv(‘language’) from dual;

USERENV(‘LANGUAGE’)

—————————————————-

SIMPLIFIED CHINESE_CHINA.AL32UTF8

2. Use the dump function to print the encoding of the text:

SQL> select dump(OP_USER,1016) from test.mdtsb where

uuid=’161A45E75BC88040E053441074848040′ ;

DUMP(OP_USER,1016)

—————————————————-

Typ=1 Len=6CharacterSet=AL32UTF8: c0,ee,be,b0,ea,bb

This function outputs three messages

1. The length is 6 bytes

2. Character set is AL32UTF8

3. The character encoding is C0, EE,be, B0, EA,bb

Based on this information, we guess that the original character set may be GBK or GB18030 or GB2312. Because three characters, if it’s UTF8 their encoding is mostly one character for three bytes. The encoding of the GBK character set is double-byte.

3. View the GBK characters corresponding to C0, EE,be,b0, EA, and BB

www.qqxiuzi.cn/zh/hanzi-gb…

c0,ee

be,b0

ea,bb

Therefore, it can be identified that this character encoding belongs to GBK class.

4. Corresponding to the Man’s GBK class code

www.qqxiuzi.cn/bianma/zifu…

li

GB2312 code: C0EE

BIG5 code: A7F5

GBK code: C0EE

GB18030 code: C0EE

Unicode encoding: 674E

scene

GB2312 code: BEB0

BIG5 code: B4BA

GBK code: BEB0

GB18030 code: BEB0

Unicode encoding: 666F

son

GB2312 Code: EABB

BIG5 code: A9FE

GBK code: EABB

GB18030 Code: EABB

Unicode encoding: 660A

As you can see, the possible codes for these three characters are GB2312, GB18030, GBK.

5. Convert the string encoding to AL32UTF8 using the convert function

SQL> select convert(OP_USER,’AL32UTF8′,’ZHS16GBK’) from test.mdtsb where

uuid=’161A45E75BC88040E053441074848040′ ;

CONVERT(OP_USER,’ AL32UTF8′,’ZHS16GBK’)

——————————————————-

Li Jinghao

This result verifies our guess that the character set should be the GBK character set, but because GBK and GB18030 and GB2312 are not completely parent-child set relations, this can only be one of them. However, the GBK character set is currently more commonly used.

SQL> select convert(OP_USER,’AL32UTF8′,’ZHS16GBK’) from test.mdtsb where

uuid=’161A45E75BC88040E053441074848040′ ;

CONVERT(OP_USER,’ AL32UTF8′,’ZHS16CGB231280′)

——————————————————-

Li Jinghao

SQL> select convert(OP_USER,’AL32UTF8′,’ZHS16GBK’) from test.mdtsb where

uuid=’161A45E75BC88040E053441074848040′ ;

CONVERT(OP_USER,’ AL32UTF8′,’ZHS32GB18030′)

——————————————————-

Li Jinghao

6. Query the string encoding of UTF8 corresponding to these three Characters

Website: www.qqxiuzi.cn/bianma/Unic…

li

Unicode encoding: 0000674E

UTF8 encoding: E69D8E

UTF16: FEFF674E

UTF32:0000FEFF0000674E

scene

Unicode encoding: 0000666F

UTF8 encoding: E699AF

UTF16 code: FEFF666F

UTF32:0000FEFF0000666F

son

Unicode code: 0000660A

UTF8 encoding: E6988A

UTF16 code: FEFF660A

UTF32:0000FEFF0000660A

So, if the original string is encoding UTF8, theoretically we should through the dump function to obtain the string coding for: e6, 9 d, e, the e6, 99, af, e6, 98, 8 a

1.3 Troubleshooting

Convert from:

SQL> select dump(convert(OP_USER,’AL32UTF8′,’ZHS16GBK’),1016) from test.mdtsb

where uuid=’161A45E75BC88040E053441074848040′ ;

DUMP(CONVERT(OP_USER,’ AL32UTF8′,’ZHS16GBK’),1016)

——————————————————-

Typ = 1 9 characterset Len = = AL32UTF8: e6, 9 d, e, the e6, 99, af, e6, 98, 8 a

The UTF8 character set should display the encoding:

E6, 9 d, 8 e, the e6, 99, af, e6, 98, 8 a.

Conclusion:

The actual encoding of the column stored in the Oracle database is the GBK character set, which is the same as the character encoding in the client environment, so it just shows up. But because Oracle stores the character set as UTF8, it should actually store the character set as UTF8. Therefore, the code of the annotation is inconsistent with the actual code – garbled code.

To deal with:

The convert(OP_USER,’AL32UTF8′,’ZHS16GBK’) function is used to convert the real character set obtained by testing the character set on the corresponding column of the data synchronization task. After converting to ODPS, the data will no longer be garbled.

SELECT convert(KEY1,’ZHS16GBK’,’UTF8′) FROM MATDOC;

[MaxCompute, DataWorks]

select OP_USER from MaxCompute.MDTSB where uuid=’161A45E75BC88040E053441074848040′;

1.4 Target end identification method

In MaxCompute, there is a function that can parse and convert character encoding, but MaxCompute can only store one character set, “UTF-8”, so it is best to convert to UTF-8 at the source or during transmission. The following two functions are character-encoding related functions for MaxCompute.

ENCODE,

The command format

binary encode(string , string )

The command that

Encode STR in charset format.

Parameters that

STR: Mandatory. The STRING type. The string to be recoded.

Charset: Mandatory. The STRING type. Encoding format. The value can be UTF-8, UTF-16, UTF-16LE, UTF-16BE, ISO-8859-1, or US-ASCII.

Return Value Description

Return BINARY. If any input parameter is NULL, NULL is returned.

The sample

Example 1: Encode the string ABC in UTF-8 format. The following command is used as an example.

select encode(“abc”, “UTF-8”);

— return ABC.

· IS_ENCODING

The command format

boolean is_encoding(string , string , string )

The command that

Determines whether the input string STR can be converted from the specified character set from_encoding to another character set to_encoding. It can also be used to determine whether the input is garbled. In general, you can set from_encoding to UTF-8 and to_encoding to GBK.

Parameters that

STR: Mandatory. The STRING type. An empty string can be considered to belong to any character set.

From_encoding, TO_encoding: mandatory. STRING, source and target character sets.

Return Value Description

Return BOOLEAN type. Return True if STR can be successfully converted, False otherwise. If either input parameter is NULL, NULL is returned.

The sample

Example 1: Determine whether a character test or test can be converted from the UTF-8 character set to the GBK character set. The following command is used as an example.

Select is_encoding(‘ test ‘, ‘UTF-8 ‘,’ GBK ‘);

— Returns true.

Select is_encoding(‘ test ‘, ‘UTF-8 ‘,’ GBK ‘);

— Returns true.

Help.aliyun.com/document\_d…

1.5 the appendix

The Convert function

Purpose

CONVERT converts a character string from one character set to another. The datatype of the returned value is VARCHAR2.

  • The

    char

    argument is the value to be converted. It can be any of the datatypes CHAR, VARCHAR2, NCHAR, NVARCHAR2, CLOB, or NCLOB.

  • The

    dest_char_set

    argument is the name of the character set to which

    char

    is converted.

  • The

    source_char_set

    argument is the name of the character set in which

    char

    is stored in the database. The default value is the database character set.

Both the destination and source character set arguments can be either literals or columns containing the name of the character set.

For complete correspondence in character conversion, it is essential that the destination character set contains a representation of all the characters defined in the source character set. Where a character does not exist in the destination character set, a replacement character appears. Replacement characters can be defined as part of a character set definition.

Examples

The following example illustrates character set conversion by converting a Latin-1 string to ASCII. The result is the same as importing the same string from a WE8ISO8859P1 database to a US7ASCII database.

SELECT CONVERT(‘ A E I O ø A B C D E ‘, ‘US7ASCII’, ‘WE8ISO8859P1’)

FROM DUAL;

CONVERT (‘ AEIO Ø ABCDE ‘

———————

A E I ? ? A B C D E ?

‘US7ASCII’ is the current Oracle database character set, and ‘WE8ISO8859P1’ is the converted character set.

Common character sets include:

· US7ASCII: US 7-bit ASCII character set

· WE8DEC: West European 8-bit character set

· F7DEC: Digital French 7-bit character set

· WE8EBCDIC500: IBM West European EBCDIC Code Page 500

· WE8ISO8859P1: ISO 8859-1 West European 8-bit character set

· UTF8: Unicode 4.0 UTF-8 Universal character set, CESU-8 compliant

· AL32UTF8: Unicode 4.0 UTF-8 Universal character set

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.