Introduction to the

We know that the computer is the first to rise in foreign countries, considering the computer performance and foreign common characters, the beginning of computer use is ASCII, ASCII code can represent characters is limited after all, with the development of computer and the popularity of the world, need more able to express character encoding all over the world, This encoding is Unicode.

Of course, before the advent of Unicode, each country or region has developed its own coding standards according to its own character requirements. Of course, these coding standards are localized and not applicable to the whole world, so they have not been popularized.

Today we will discuss sorting and regex matching of Unicode encoded characters.

Sorting of ASCII characters

The full name for ASCII is the American Standard Code for Information Interchange, and as of now, ASCII is only 128 characters long. The composition of ASCII characters will not be discussed in detail here. For those who are interested, check out my previous article on Unicode.

ASCII characters contain 26 letters. Let’s look at how ASCII characters are encoded in javaScript:

const words = ['Boy', 'Apple', 'Bee', 'Cat', 'Dog'];
words.sort();
// [ 'Apple', 'Bee', 'Boy', 'Cat', 'Dog' ]

Copy the code

As you can see, the characters are sorted in the order we want the dictionary to be.

But if you change the characters into Chinese and sort them, you don’t get the desired result:

Const words = [' ai ', 'I ',' zhong ', 'hua ']; words.sort(); // [' zhong ', 'hua ',' I ', 'ai']Copy the code

Why is that?

The default sort is to convert strings to bytes and sort them lexicographically by byte. If it is Chinese, it will not be converted to local characters.

Sort of local characters

Since you can’t sort Chinese using ASCII characters, what we really want is to convert Chinese characters to pinyin and sort them in alphabetical order.

So the above “love my China” is actually to compare “ai”, “wo”, “zhong”, “hua” these pinyin order.

Is there an easy way to compare?

In some browsers provide Intl. Collator and String. Prototype. LocalCompare two ways to compare local characters.

For example, in Chrome 91.0:

Use the Intl. Collator is can get the result, and use String. Prototype. LocalCompare and no.

Take a look at firfox 89.0:

The results are consistent with Chrome.

The following is the result of nodeJS v12.13.1:

You can see that in NodeJS, there is no local character conversion or sorting.

Therefore, the above two methods are browser-specific, that is, implementation-specific. We can’t trust them completely.

So, sorting strings is a silly thing to do!

Why not use Unicode for sorting

So why not use Unicode for sorting?

First, for the average user, who does not know Unicode, all they need is to convert strings to the native language for lexicographical sorting.

Second, even sorting with native characters is difficult because browsers need to support localized sorting for different languages. This makes for a huge amount of work.

Emoji regular matches

Finally, we will talk about emoji regular matching.

Emoji is a series of expressions that we can use Unicode to express, but there are so many emoji, almost 3,521, that we need to write the following code for emoji regular matching:

(? :\ud83e\uddd1\ud83c\udffb\u200d\u2764\ufe0f\u200d\ud83d\udc8b\u200d\ud83e\uddd1\ud83c\udffc|\ud83e\uddd1\ud83c\udffb\u20 0d\u2764\ufe0f\u200d\ud83d [...Copy the code

To get a sense of how many emojis there are, use an image:

With so many emojis, is there an easy way to match them? The answer is yes.

In the TC39 proposal of ECMAScript, emoji regular matching has been added to the standard, which can be represented by {Emoji_Presentation}.

\p{Emoji_Presentation}

Copy the code

Isn’t that easy?

conclusion

This paper briefly introduces the local character sorting rules and emoji regular matching. Hope to be able to bring help to everyone in the actual work.

This article is available at www.flydean.com/04-unicode-…

The most popular interpretation, the most profound dry goods, the most concise tutorial, many tips you didn’t know waiting for you to discover!