One day brush nuggets, see such an article using Android source code, easy to achieve Chinese characters to pinyin function, very interested, spent more than two hours, read the blog and code, is to understand the principle. Then I wondered if I could port from Java to JavaScript.

This blog is about reading and fiddling, and reminds me that with the power of modern browsers (apis), hundreds of lines of code can easily convert Chinese characters to pinyin.


2017/05/12 update:

Tiny-pinyin, a Chinese character conversion library based on this blog, is now online with about 300 lines of code and is easy to read. The Online demo address creeperyang. Dead simple. IO/pinyin, can rest assured experience.

I. The current situation of Chinese characters to pinyin

First of all, it should be said that Chinese characters to pinyin is a strong demand, such as contacts by pinyin alphabetic sorting/screening; For example, destinations (typically purchased by air tickets) are classified by alphabetic alphabet and so on. But the solution to this requirement, which I haven’t heard of any clever implementation (especially on the browser side), probably requires a huge dictionary.

For JavaScript, check github and NPM. The best libraries for converting Chinese characters to pinyin are Pinyin and Pinyinjs, both of which come with huge dictionaries. These dictionaries can be in the tens or hundreds of kilobytes (or even megabytes), and it takes some courage to use them in a browser. So when we encounter the need to convert Chinese characters to pinyin, it is not surprising that our first reaction is to reject the demand (or server implementation).

Now, would it be unbelievable if I told you that you could convert Chinese characters to pinyin in 300 lines of code on the browser?

Let’s start with android 4.2.2 contact code

Emphasize this blog again – using Android source code, easy to achieve Chinese characters to pinyin function.

Today and we share a Chinese characters from the Android system source code into pinyin implementation scheme, as long as a class, more than 560 lines of code can let you easily achieve Chinese characters into pinyin function, and without any other third party dependence.

Does it break your mindset: is there a powerful algorithm that can get rid of dictionaries?

After reading the blog for the first time, I was a little disappointed. There was no algorithm analysis, just a few hundred lines of code discovered from android code. The second time I read the code with the idea of porting to JavaScript, I understood the principle, so I started the porting journey.

Source code in Android Git Reposities, interested can go to see.

Three. Hand to hand teach you 300 lines of code to achieve Chinese characters to pinyin

First of all, go straight to the core: why must there be a huge dictionary of Chinese characters into pinyin thinking set?

Since the arrangement of Chinese characters has nothing to do with pinyin, for example in the Chinese character interval \ u4e00-\ u9FFF, the former may be HA and the latter may be ze, there is no way to correlate the Unicode of Chinese characters to pinyin, so there is only a large dictionary recording the pinyin of each Chinese character (or common Chinese character).

However, suppose that we can put all the Chinese character by pinyin sort, such as press ‘A’, ‘AI’, ‘AN’, ‘ANG’, ‘AO’, ‘BA’,… So, we just need to remember the first character in each character queue with the same pinyin. Then the dictionary needed would be small (covering all pinyin, which is not much).

Now, the difficulty is to sort the characters by pinyin. Fortunately, the ICU/ localization API provides this sort API (without a convenient sort/comparison method, this article probably wouldn’t exist).

So, that’s why 300 lines can be converted from Chinese to Pinyin:

  1. Intl.CollatorAPI:Intl.CollatorInternally, localization-related string sorting is implemented. We’re throughIntl.Collator.prototype.compareCan put all Chinese charactersbasicSort by pinyin.
  2. Boundary Kanji table: records sorted boundary points. Each character in the list is the first character within the same pinyin (Each unihans is the first one within the same pinyin when collator is zh_CN).

At this point, there may still be some ambiguity, so let’s go straight to the code:

** [{* "hanzi": "ah ", // pinyin A * "unicode": "\ u963f," * "index" : 0 *}, * {* "hanzi" : "actinium, / / pinyin a *" unicode ":" \ u9515, "*" index ": 1 *}, * * {... *" hanzi ": "鿿", * "unicode": "\u9fff", * "index": 20991 * }] * */ const fs = require('fs') const FIRST_PINYIN_UNIHAN = 19968 const LAST_PINYIN_UNIHAN = 40959 function listAllHanziInOrder() { const arr = [] for(let i = FIRST_PINYIN_UNIHAN; i <= LAST_PINYIN_UNIHAN; i++) { arr.push(String.fromCharCode(i)) } const COLLATOR = new Intl.Collator(['zh-Hans-CN']) arr.sort(COLLATOR.compare) console.log(arr.length) fs.writeFileSync(`${__dirname}/sortedHanzi.json`, JSON.stringify( arr.map((v, i) => { return { hanzi: v, unicode: `\\u${v.charCodeAt(0).toString(16)}`, index: i } }), null, ' ' )) console.log('done') } listAllHanziInOrder()Copy the code

If you are interested, you can run the.js script on node — icU-data-dir =node_modules/full-icu to see if you can get a list of Chinese characters sorted by pinyin.

Here are a few things to note:

  1. I bold the word “basic” again, because we get a list of characters that are not sorted exactly by pinyin, and there are occasional interpolations of other pinyin characters, which should be taken care of when creating the boundary table.
  2. The resulting table in the script above is the order of all Chinese characters, some of which are in the Android codeHanziToPinyin.javaThe table is different and needs to be updatedHanziToPinyin.javaIn the table. (The biggest pit and effort to switch from Java to JavaScript: correcting the boundary table)
  3. I’m sure you’ve seen the core code:const COLLATOR = new Intl.Collator(['zh-Hans-CN']).Intl.CollatorThe locale specified here is Chinazh-Hans-CN) is the key to being able to sort Chinese characters by pinyin. It’s the locale specific locale API for ordering strings.
  4. Please execute the script firstnpm i full-icu, this dependency automatically installs missing Chinese support and prompts how to specify the ICU data file to execute the script.

1. ICU

ICU stands for International Components for Unicode and provides Unicode and internationalization support for applications.

ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.

And ICU provides a localized string comparison service (Unicode Collation Algorithm + locally specific comparison rules) :

Collation: Compare strings according to the conventions and standards of a particular language, region or country. ICU’s collation is based on the Unicode Collation Algorithm plus locale-specific comparison rules from the Common Locale Data Repository, a comprehensive source for this type of data.

For more information, go to Site.icu-project.org/. But all we need to know is that Node/Chrome and so on support internationalization through ICU, including the use of sorting characters according to local conventions and rules.

On modern browsers, the TYPICAL ICU has built-in support for the user’s native language, which we can use directly.

But for Node.js, ICU usually only contains a subset (usually English), so we need to add Chinese support ourselves. In general, missing Chinese support can be installed by installing full-ICU via NPM install. (See node above — icU-data-dir =node_modules/full-icu).

For full-ICU, see full-ICU – NPM for more information, as well as a discussion nodejs/node#3460.

In the meantime, more information about Node ICU can be found at github.com/nodejs/node… .

2. Intl API

The previous section should have covered the basics of internationalization/localization, but the use of built-in apis is supplemented here.

How to check whether the user language and Runtime support this language?

Intl. The Collator. SupportedLocalesOf (array | string) returns a support (don’t back to the default locale) array of locales, parameters can be an array or a string, For the Locales (BCP 47 Language Tag) that you want to test.

Construct the Collator object and sort string

Through Intl.Collator.prototype.com pare said, we can according to the language specified order to sort a string. In Chinese, it happens to be mostly alphabetical, ‘A’, ‘AI’, ‘AN’, ‘ANG’, ‘AO’, ‘BA’, ‘BAI’, ‘BAN’, ‘BANG’, ‘BAO’, ‘BEI’, ‘BEN’, ‘BENG’, ‘BI’, ‘BIAN’, ‘BIAO’, ‘BIE’, ‘BIN’, ‘BING’, ‘BO’, ‘BU’, ‘CA’, ‘CAI’, ‘CAN’, … This is the key to the transformation of Chinese characters into pinyin.

4. Boundary table correction

Using the same boundary table as android code, test the default common Chinese characters (6000+) and get the following results:

Obviously, there is something wrong with this boundary table and it needs to be corrected.

We can see that most of the Chinese characters have been changed to QING, so there is something wrong with the Chinese character that corresponds to qing pinyin.

  1. Find the Chinese character, yes'\ u72c5'/' 狅 'And add one word before and one word after,[' u4eB2 ', '\ U72c5 ', '\u828e']/[" u4eB2 ", '\u72c5', '\u828e'].
  2. Search,'\ u72c5'/' 狅 'You can readqingBut now read morekuangThat should be the reason for the mistake.
  3. Based on the original list of all the characters,qingThe first Chinese character of “is”'\ u9751'/' 靑 '.
  4. After the change, only 104 failed to convert.

The whole process of updating is as described above: constant testing, finding and correcting the wrong boundary characters.

Tiny-pinyin submitted history can see a large number of dictionary corrections, incidentally for commonly used Chinese characters pinyin dictionary (used for testing) correction of many pinyin, took about a day of work time, is hard.

In addition, it can be seen that all 7.x/6.x tests on Node.js pass, but there are some problems in pinyin after converting some 5.x/4.x characters. This can be solved by correcting the dictionary for a particular version of Node.js.

Finally, I hope you understand the principle of transferring Chinese characters to pinyin, and I welcome your questions for tiny-Pinyin.