Introduction

We know that computers first emerged in foreign countries. Due to the consideration of computer performance at the time and the consideration of commonly used characters in foreign countries, the computer used ASCII at the beginning. After all, the characters that ASCII encoding can represent are limited. With the development of computers and the whole world The worldwide popularity requires more encoding methods that can represent characters around the world. This encoding method is unicode.

Of course, before the emergence of unicode, various countries or regions have formulated their own coding standards according to their own character needs. Of course, these coding standards are localized and are not applicable to the world, so they have not been popularized.

Today we will discuss the sorting and regular matching of unicode-encoded characters.

Sorting of ASCII characters

The full name of ASCII is called American Standard Code for Information Interchange, which is the American Standard Code for Information Interchange. So far, ASCII has only 128 characters. The composition of ASCII characters is not discussed in detail here. Interested students can check the article I wrote about unicode.

ASCII characters contain 26 letters. Let's see how to encode ASCII characters in javaScript:

const words = ['Boy', 'Apple', 'Bee', 'Cat', 'Dog'];
words.sort();
// [ 'Apple', 'Bee', 'Boy', 'Cat', 'Dog' ]

As you can see, these characters are sorted in the order of the dictionary we want.

But if you modify these characters to Chinese, and then sort them, you will not get the result we want:

const words = ['爱', '我', '中', '华'];
words.sort();
// [ '中', '华', '我', '爱' ]

Why is this?

In fact, the default sort is to convert strings into bytes, and then sort them in lexicographical order according to the bytes. If it is Chinese, it will not be converted into local text.

Sorting of local characters

Since Chinese characters cannot be sorted using ASCII characters, we actually want to convert Chinese characters to pinyin, and then sort them in the order of the pinyin letters.

So the "Love Me Zhonghua" above is actually to compare the pinyin order of "ai", "wo", "zhong", and "hua".

Is there any easy way to compare?

In some browsers, two methods, Intl.Collator and String.prototype.localCompare, are provided to compare local characters.

For example, in chrome 91.0 version:

Using Intl.Collator can get the result, but using String.prototype.localCompare cannot.

Look at firfox 89.0 version again:

The result is consistent with chrome.

The following is the execution result of nodejs v12.13.1 version:

It can be seen that in nodejs, there is no conversion and sorting of local characters.

Therefore, the above two methods are related to the browser, that is to say, are related to the specific implementation. We cannot completely trust it.

Therefore, it is a very silly thing to sort the strings!

Why not use unicode for sorting

So why not use unicode for sorting?

First of all, for ordinary users, they do not know unicode, all they need is to convert the string to the local language for dictionary sorting.

Secondly, even using local characters for sorting is very difficult, because browsers need to support localized sorting for different languages. This makes the workload huge.

Emoji regular matching

At the end of the article, let's talk about the regular matching of emoji.

Emoji is a series of expressions, we can use unicode to express them, but there are many emoji expressions, there are almost 3521, if we want to match emoji regularly, we need to write the following code:

(?:\ud83e\uddd1\ud83c\udffb\u200d\u2764\ufe0f\u200d\ud83d\udc8b\u200d\ud83e\uddd1\ud83c\udffc|\ud83e\uddd1\ud83c\udffb\u200d\u2764\ufe0f\u200d\ud83d
[... 后面省略很多]

Use an image to visually see how many emoji expressions are:

With so many emojis, is there an easy way to match them regularly? The answer is yes.

As early as the TC39 proposal of ECMAScript, the regular matching of emoji has been added to the standard, and we can use {Emoji_Presentation} to represent it.

\p{Emoji_Presentation}

Is not it simple?

to sum up

This article briefly introduces the sorting rules of local characters and the regular matching of emoji expressions. I hope to help you in your actual work.

This article has been included in http://www.flydean.com/04-unicode-sorting/

The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to discover!


flydean
890 声望433 粉丝

欢迎访问我的个人网站:www.flydean.com