The story begins with a confession: I have been afraid of Unicode for a long time. Whenever a programming task requires Unicode knowledge, I am looking for a crackable solution without knowing in detail what I am doing.
My avoidance continued until I encountered a problem that required detailed Unicode knowledge, and could no longer avoid it.
After some effort, I read a bunch of articles-surprisingly, it is not difficult to understand. Hmm...Some articles need to be read at least 3 times.
Facts have proved that Unicode is a universal and elegant standard. This can be difficult because there are a lot of abstract terms that are difficult to adhere to.
If you have a gap in understanding Unicode, now is the time to face it! It's not that difficult. Make yourself a cup of delicious tea or coffee☕. Let us dive into the abstract and wonderful world.
This article explains the basic concepts of Unicode. This creates the necessary foundation.
Then clarify how JavaScript works with Unicode and the pitfalls it may encounter.
You will also learn how to apply the new ECMAScript 2015 features to solve some of the difficulties.
Are you ready? Let's be hilarious!
1. The idea behind Unicode
Let's start with a basic question. How do you read and understand the current article? Simply put: because you know what letters and words mean as a group of letters.
Why can you understand the meaning of the letters? Simply put: because you (the reader) and I (the author) have reached an agreement on the connection between graphic symbols (what you see on the screen) and English letters (meaning).
The same is true for computers. The difference is that computers don't understand the meaning of letters: they think these are just bytes.
Imagine a scenario, when user 1 sends a message "hello" to user 2 through the network.
User 1’s computer does not know the meaning of the letters. Therefore, it converts "hello" into a sequence of numbers 0x68 0x65 0x6C 0x6C 0x6F
, where each letter uniquely corresponds to a number: h is 0x68
, e is 0x65
, and so on. These numbers are sent to User 2’s computer.
When user 2's computer receives the number sequence 0x68 0x65 0x6C 0x6C 0x6F
, it will use the letter corresponding to the number and restore the message. Then it will display the correct message: "hello".
The agreement between the two computers on the correspondence between letters and numbers is standardized by Unicode.
According to Unicode, h is an abstract character Latin small letter This character has the corresponding number 0x68
, and the code point is represented as U+0068
.
The role of Unicode is to provide an abstract character list (character set), and assign a unique identifier code point (coded character set) to each character.
2. The basic terminology of Unicode
www.unicode.org mentioned: "Unicode provides a unique number for each character," regardless of platform, programming and language.
Unicode is a universal character set used to define lists of characters in most writing systems, and to associate each character with a unique number (code point).
Unicode includes characters from most languages today, punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, emoticons, etc.
The first Unicode version 1.0 was released in October 1991, with a total of 7161 characters. The latest version 9.0 (released in June 2016) provides a code of 128,172 characters.
The universality and inclusiveness of Unicode solved a major problem that existed before, when vendors implemented many character sets and encodings that were difficult to handle.
Creating an application that supports all character sets and encodings is very complicated.
If you think Unicode is difficult, then programming without Unicode will be even more difficult.
I still remember that I was reading the garbled characters in the contents of the file, just like buying a lottery ticket!
2.1 Characters and code points
"Abstract characters (or characters) are information units used to organize, control, or represent text data."
Unicode treats characters as abstract terms. Each abstract character has a related name, such as Latin letters ( LATIN SMALL LETTER ) A
. The presentation form (glyph) of this character is a
.
"A code point is a number assigned to a single character."
The range of code points is U+0000
to U+10FFFF
.
U+<hex>
is the format of the code point, where U+
is the prefix representing Unicode, and <hex>
is the hexadecimal number. For example, U+0041
and U+2603
.
Remember, the code point is a simple number. You should think of it this way, a code point is an index of an element in an array.
Because Unicode associates a code point with a character, it has a magical effect. For example, U+0041
corresponds to the character named Latin capital letter ( LATIN CAPITAL LETTER ) A
(rendered as A
), or U+2603
corresponds to the character named Snowman ( SNOWMAN ) (rendered as ☃
)
Not all code points have corresponding characters. 1114112 code points are available (range from U+0000 to U+10FFFF), but only 137929 code points have characters assigned (as of May 2019).
2.2 Unicode plane
"Plane (Plane) from aU+n0000
toU+nFFFF
, a total of 65,536 continuous range of Unicode code points, whereinn
ranges0x0
~0x10
"
The plane divides Unicode code points into 17 equal groups:
- Plane 0 contains code points
U+0000
toU+FFFF
- Plane 1 contains code points
U+10000
toU+1FFFF
- ...
- Plane 16 contains code points
U+100000
toU+10FFFF
Basic multilingual plane
Plane 0 is a special plane called Basic Multilingual Plane or BMP
. It contains characters and many symbols from most modern languages ( basic Latin ), Cyrillic ), Greek
As mentioned above, the code points of the basic multilingual plane are between U+0000
and U+FFFF
, and there can be up to 4 hexadecimal digits.
Developers usually deal with characters in BMP
It contains most of the required characters.
Some characters in BMP:
e
isU+0065
, named Latin small letter e|
isU+007C
, named vertical line■
isU+25A0
, named black square☂
isU+2602
, named Umbrella
Star plane
The other 16 planes that exceed the BMP (plane 1, plane 2, ... plane 16) are called star-shaped planes (astral planes) or auxiliary planes (supplementary planes).
The code points in the star plane are called star code points, and their range is from U+10000
to U+10FFFF
.
The star code point can have 5 to 6 hexadecimal digits, such as U+ddddd
, U+dddddd
.
Examples are as follows:
𝄞
isU+1D11E
, named music notation G clef𝐁
isU+1D401
, named mathematical bold capital letter B🀵
isU+1F035
, named Domino level title -00-04😀
isU+1F600
, named smiley
2.3 Code Unit
Computers do not use code points or abstract characters in memory. It needs a physical way to represent Unicode code points: code units.
"A code unit is a sequence of bits used to encode each character in a given encoding form."
character encoding converts abstract code points into physical bits: code units.
In other words, character encoding converts Unicode code points into a unique sequence of code units.
Popular encodings are UTF-8 , UTF-16 and UTF-32 .
Most JavaScript engines use UTF-16 encoding. This affects the way JavaScript uses Unicode. From now on, let us focus on UTF-16 .
UTF-16 (long name: 16-bit Unicode conversion format) is a variable length encoding :
- Code points in BMP are encoded using a single code unit of 16 bits
- Star code points are encoded using two 16-bit coding units.
Let's give a few examples.
Suppose you want to save the Latin lowercase letters a
to the hard drive. Unicode tells you that the small letters a
map to the code point U+0061
Now let us ask UTF-16 encoding U+0061 should be converted. The coding standard stipulates that for BMP code points, take the hexadecimal number U+0061 and store it in a 16-bit code unit: 0x0061
.
As you can see, BMP fit a single 16-bit code unit.
2.4 surrogate pair
Let us now study a complex case. Suppose you want to save a star code point (from a star plane): smiley face 😀
. This character maps to the code point U+1F600
Since the star code point requires 21 bits to store information, UTF-16 means that you need two 16-bit code units. The code point U+1F600
is divided into so-called surrogate pairs: 0xD83D
(high-surrogate code unit) and 0xDE00
(low-surrogate code unit).
Quote
" Surrogate pair (Surrogate pair) is a representation of a single abstract character, which consists of two 16-bit code unit code unit sequences, where the first value of the pair is the high surrogate code unit, and the second value is the low Proxy code unit."
A star code point requires two code units: proxy pair . As you can see in the previous example, to U+1F600
(😀) in 161866e50105c2 UTF-16 , the surrogate pair: 0xD83D 0xDE00
will be used.
console.log('\uD83D\uDE00'); // => '😀'
The value range of the high surrogate code unit is 0xD800
to 0xDBFF
. The value range of the low proxy code unit is 0xDC00
to 0xDFFF
.
The algorithm for converting surrogate pairs to star code points is as follows, and vice versa:
function getSurrogatePair(astralCodePoint) {
let highSurrogate =
Math.floor((astralCodePoint - 0x10000) / 0x400) + 0xD800;
let lowSurrogate = (astralCodePoint - 0x10000) % 0x400 + 0xDC00;
return [highSurrogate, lowSurrogate];
}
getSurrogatePair(0x1F600); // => [0xD83D, 0xDE00]
function getAstralCodePoint(highSurrogate, lowSurrogate) {
return (highSurrogate - 0xD800) * 0x400
+ lowSurrogate - 0xDC00 + 0x10000;
}
getAstralCodePoint(0xD83D, 0xDE00); // => 0x1F600
Dealing with the proxy pair is not comfortable. When handling strings in JavaScript, you must treat them as special cases, as described below.
However, UTF-16 is efficient in memory. 99% of the characters come from BMP , these characters only need one code unit.
Combination Mark
In the context of a specific writing system, a grapheme or symbol is the smallest unique unit of writing.
The glyph is to look at the character from the user's point of view. The specific image of a graphic displayed on the screen is called glyph .
In most cases, a single Unicode character represents a single graphic. For example, U+0066
Latin lowercase letters represent the English letters f
.
In some cases, a glyph contains a series of characters.
For example, å
is an atomic glyph in the Danish writing system. It uses U+0061
Latin lowercase letter A (presented as A) and special characters U+030A
(presented as ◌ ̊) COMBINING RING ABOVE ).
U+030A
modifies the prefix character and named it combining mark (combining mark).
console.log('\u0061\u030A'); // => 'å'
console.log('\u0061'); // => 'a'
" combination mark is a character applied to the previous basic character to create a glyph."
Combination marks include characters such as accents, diacritics, Hebrew dots, Arabic vowels, and Indian letters.
Combination marks are usually not used alone when there are no basic characters. You should avoid displaying them separately.
Like surrogate pairs, combined markup is difficult to handle in JavaScript.
Combining character sequences (basic characters + combining marks) are distinguished by users as single symbols (for example, '\u0061\u030A'
is 'å'
). But the developer must be sure to use the two code points U+0061
and U+030A
å
.
3. Unicode in JavaScript
ES2015 specification mentions that source code text is expressed in Unicode (version 5.1 and later). The source text is a sequence of code points U+0000
to U+10FFFF
The storage or exchange of source code has nothing to do with the ECMAScript specification, but it is usually in 161866e5010843 UTF-8 (web preferred encoding method).
I recommend using the Basic Latin Unicode block ) (or ASCII) to preserve the source code text. Characters other than ASCII should be escaped. This will ensure fewer coding problems.
Internally, at the language level, ECMAScript 2015 provides a clear definition of what a string in JavaScript is:
The string type is a collection of all ordered sequences of zero or more 16-bit unsigned integer values ("elements"), with a maximum length of (2 to the 53th power minus 1) elements. The string type is usually used to represent the text data in the running ECMAScript program. In this case, each element in the string is treated as a UTF-16 code unit value.
Each element of the string is interpreted as a code unit by the engine. The way the string is rendered cannot determine which code units (representing code points) it contains. See the following example:
console.log('cafe\u0301'); // => 'café'
console.log('café'); // => 'café'
'cafe\u0301'
and 'café'
texts are slightly different, but they all present the same symbol sequence café
.
length of the string the number of elements in which (16-bit code unit). [...] When the ECMAScript operation interprets string values, each element is interpreted as a single UTF-16 code unit.
As you know from the proxy pair and combination mark in the above section, some symbols require 2 or more code units to represent. So pay attention when counting the number of characters or accessing characters by index:
const smile = '\uD83D\uDE00';
console.log(smile); // => '😀'
console.log(smile.length); // => 2
const letter = 'e\u0301';
console.log(letter); // => 'é'
console.log(letter.length); // => 2
smile
string contains 2 code units: \uD83D (high surrogate) and \uDE00 (low surrogate). Since the string is a series of code units, the calculation result of smile.length
Even the rendered smile
has only one symbol '😀'
.
letter
string 061866e5010a0e is also the same. The combination mark U+0301
applied to the previous character e
, and the rendering result is the symbol 'é'
. But letter
contains 2 code units, so letter.length
is 2.
My suggestion: always treat strings in JavaScript as a series of code units . The way the string is presented does not make it clear which code units it contains.
The symbols and combined character sequences of the star plane need to encode 2 or more code units. But they are treated as a single grapheme.
If the string has proxy pair or combination mark , developers will feel confused when calculating the length of the string or accessing characters by index without remembering this point.
Most JavaScript string methods do not support Unicode. If the string contains compound Unicode characters, please take precautions when myString.slice()
, myString.substring()
3.1 Escape sequence
Escape sequences in JavaScript strings are used to represent code units based on code point numbers. There are 3 types of escaping in JavaScript, one of which was introduced in ECMAScript 2015.
Let us understand them in more detail.
Hexadecimal escape sequence
The shortest form is named the hexadecimal escape sequence: \x<hex>
, where \x
is a prefix followed by a fixed-length 2-digit hexadecimal number <hex>
.
For example, '\x30'
(symbol '0') or '\x5B'
(symbol'[').
The hexadecimal escape sequence in string literals or regular expressions is as follows:
const str = '\x4A\x61vaScript';
console.log(str); // => 'JavaScript'
const reg = /\x4A\x61va.*/;
console.log(reg.test('JavaScript')); // => true
The hexadecimal escape sequence can escape code points in a limited range: from U+00
to U+FF
, because only 2 digits are allowed. But the hexadecimal escape is good because it is short.
Unicode escape sequence
If you want to escape the entire BMP , please use the unicode escape sequence . The escape format is \u<hex>
, where \u
is the prefix followed by a fixed-length 4-digit hexadecimal number <hex>
. For example, '\u0051'
(symbol'Q') or '\u222B'
(integration symbol'∫').
Let's use unicode escape sequences:
const str = 'I\u0020learn \u0055nicode';
console.log(str); // => 'I learn Unicode'
const reg = /\u0055ni.*/;
console.log(reg.test('Unicode')); // => true
The Unicode escape sequence can escape a limited range of code points: from U+0000
to U+FFFF
(all BMP code points), because only 4 digits are allowed. In most cases, this is sufficient to represent commonly used symbols.
To indicate the symbol of the star plane in the JavaScript text, use two concatenated unicode escape sequences (high surrogate and low surrogate), which will create a surrogate pair:
const str = 'My face \uD83D\uDE00';
console.log(str); // => 'My face 😀'
Code point escape sequence
ECMAScript 2015 provides escape sequences that represent code points in the entire Unicode space: U+0000
to U+10FFFF
, that is, BMP and star plane .
The new format is called the code point escape sequence: \u{<hex>}
, where <hex>
is a hexadecimal number with a length of 1 to 6 digits.
For example, '\u{7A}'
(symbol'z') or '\u{1F639}'
(smiley cat symbol 😹).
const str = 'Funny cat \u{1F639}';
console.log(str); // => 'Funny cat 😹'
const reg = /\u{1F639}/u;
console.log(reg.test('Funny cat 😹')); // => true
Please note that the regular expression /\u{1F639}/u
has a special flag u
, which enables additional Unicode features. (For details, see 3.5 Regular Expression Matching .)
I like code point escape sequences to represent star symbols, not surrogate pairs.
Let's escape the smiley face symbol with halo 😇 U+1F607
code point.
const niceEmoticon = '\u{1F607}';
console.log(niceEmoticon); // => '😇'
const spNiceEmoticon = '\uD83D\uDE07'
console.log(spNiceEmoticon); // => '😇'
console.log(niceEmoticon === spNiceEmoticon); // => true
The string literal assigned to the variable niceEmoticon
has a code point escape character '\u{1F607}'
, which represents a star code point U+1F607
. Next, a surrogate pair (2 code units) is created. As you can see, spNiceEmoticon
is a surrogate pair that uses a unicode escape character '\uD83D\uDE07'
Created, it is equal to niceEmoticon
.
When using the RegExp
constructor to create a regular expression, in the string literal, you must replace each \
with \\
, indicating that this is a unicode escape. The following regular expression objects are equivalent:
const reg1 = /\x4A \u0020 \u{1F639}/;
const reg2 = new RegExp('\\x4A \\u0020 \\u{1F639}');
console.log(reg1.source === reg2.source); // => true
String comparison
Strings in JavaScript are sequences of code units. It can be reasonably expected that string comparison involves evaluating matching code units.
This method is fast and effective. It can handle "simple" strings well:
const firstStr = 'hello';
const secondStr = '\u0068ell\u006F';
console.log(firstStr === secondStr); // => true
firstStr
and secondStr
have the same code unit sequence. They are equal.
Suppose you want to compare two strings presented, they look the same but contain different sequences of code units. Then you may get an unexpected result, because strings that look the same in the comparison are not equal:
When rendered, str1
and str2
look the same, but have different code units.
This happens because the ç
glyph can be constructed in two ways:
- Use
U+00E7
, the Latin small letter c with diacritical marks - Or use the combining character sequence:
U+0063
Latin lowercase letter c, plus the combining markU+0327
combining diacritics.
How to deal with this situation and compare strings correctly? The answer is string normalization.
Normalization
Normalization (Normalization) is to convert a string to a canonical representation to ensure that the canonical equivalent (and/or compatibility equivalent) string has a unique representation.
In other words, when the string has a complex structure of combining character sequences or other compound structures, you can normalize it into a canonical form. Normalized strings can be easily compared or perform string operations such as text searches.
Unicode Standard Appendix #15 provides interesting details about the normalization process.
In JavaScript, to normalize a string, please call the myString.normalize([normForm])
method, which is provided in ES2015. normForm
is an optional parameter (the default is "NFC"), which can take one of the following standardized forms:
'NFC'
as a standard combination in a standardized form'NFD'
as a normalized form canonical decomposition'NFKC'
as a standardized form compatibility combination'NFKD'
as a standardized form compatibility decomposition
Let's improve the previous example by applying string normalization, which will allow the strings to be compared correctly:
const str1 = 'ça va bien';
const str2 = 'c\u0327a va bien';
console.log(str1 === str2.normalize()); // => true
console.log(str1 === str2); // => false
'ç'
and 'c\u0327'
are equivalent in specification.
When calling str2.normalize()
returned when str2
Specification version ( 'c\u0327'
is replaced 'ç'
). So the comparison str1 === str2.normalize()
true
as expected.
str1
not affected by normalization because it is already in normalized form.
It seems reasonable to normalize the two compared strings to obtain the canonical representation on the two operands.
3.3 String length
The common way to determine the length of a string is of course to access the myString.length
attribute. This attribute represents the number of code units the string has.
The calculation of the string length of code points belonging to BMP
const color = 'Green';
console.log(color.length); // => 5
color
string corresponds to a single character. The expected length of the string is 5
.
Length and surrogate pair
The situation becomes tricky when the string contains surrogate pairs to represent star code points. Since each proxy pair contains 2
code units (high proxy and low proxy), the length attribute is larger than expected.
Look at an example:
const str = 'cat\u{1F639}';
console.log(str); // => 'cat😹'
console.log(str.length); // => 5
When the str
is rendered, it contains 4
symbols cat😹
. However, str.length
calculation result is 5
, because U+1F639
is 2
code unit (Agent of) the star point code of the code.
Unfortunately, there is currently no native and high-performance solution to this problem.
At least ECMAScript 2015 introduced an algorithm to recognize star symbols. The star symbol is treated as a single character, even if 2 code units are used for encoding.
The string iterator String.prototype[@@iterator]()
supports Unicode. You can combine strings with the spread operator [...str]
or Array.from(str)
functions (both use string iterators). Then calculate the number of symbols in the returned array.
Please note that this solution may cause minor performance issues when widely used.
Let's improve the above example with the spread operator:
const str = 'cat\u{1F639}';
console.log(str); // => 'cat😹'
console.log([...str]); // => ['c', 'a', 't', '😹']
console.log([...str].length); // => 4
Length and combination marks
What about combining character sequences? Because each combination mark is a code unit, you may encounter the same difficulties.
This problem was resolved when the string was normalized. If you are lucky, the combining character sequence will be normalized to a single character. Let's try:
const drink = 'cafe\u0301';
console.log(drink); // => 'café'
console.log(drink.length); // => 5
console.log(drink.normalize()) // => 'café'
console.log(drink.normalize().length); // => 4
Drink
string contains 5
code units (so drink.length
is 5), even if it is rendered, it shows 4
symbols.
Unfortunately, normalization is not a universal solution. Long combining character sequences do not always have canonical equivalents in a symbol. Let's look at such a case:
const drink = 'cafe\u0327\u0301';
console.log(drink); // => 'cafȩ́'
console.log(drink.length); // => 6
console.log(drink.normalize()); // => 'cafȩ́'
console.log(drink.normalize().length); // => 5
Drink
There 6
code units, drink.length
calculation result is 6
. However, drink
has 4
symbols.
The canonicalization Drink.normalize()
converts the combined sequence 'e\u0327\u0301'
into the canonical form of two characters 'ȩ\u0301'
(by deleting only one combining mark). Unfortunately, drink.normalize().length
results for 5
, but still did not indicate the correct number of symbols.
Character positioning
Since a string is a series of code units, it is also difficult to access the characters in the string by index.
When the string contains only BMP
when characters (not including U+D800
to U+DBFF
high agent and from U+DC00
to U+DFFF
low agent), character positioning without any problems.
const str = 'hello';
console.log(str[0]); // => 'h'
console.log(str[4]); // => 'o'
Each symbol is encoded using a single code unit, so it is correct to access string characters by index.
Character positioning and surrogate pairs
When the string contains a star symbol, the situation changes.
The star symbol 2
code units (agent pairs). Therefore, accessing string characters by index may return a separated high surrogate or low surrogate, which are invalid symbols.
The following example accesses the characters in the star symbol:
const omega = '\u{1D6C0} is omega';
console.log(omega); // => '𝛀 is omega'
console.log(omega[0]); // => '' (unprintable symbol)
console.log(omega[1]); // => '' (unprintable symbol)
Because U+1D6C0
capital letter OMEGA (MATHEMATICAL BOLD CAPITAL OMEGA) is a star character, it uses a surrogate pair of 2 code units to encode. omega[0]
accesses the high surrogate code unit, omega[1]
accesses the low surrogate, thereby separating the surrogate pair.
There are 2 possibilities for correctly accessing the star symbol in a string:
- Use string iterator and generate symbol array
[…str][index]
- Use
number = myString.codePointAt(index)
get the code point number, and then useString.fromCodePoint(number)
(recommended option) to convert the number to a symbol.
Let's apply both options at the same time:
const omega = '\u{1D6C0} is omega';
console.log(omega); // => '𝛀 is omega'
// Option 1
console.log([...omega][0]); // => '𝛀'
// Option 2
const number = omega.codePointAt(0);
console.log(number.toString(16)); // => '1d6c0'
console.log(String.fromCodePoint(number)); // => '𝛀'
[…omega]
returns the array of symbols contained in the string omega
The calculation of the surrogate pair is correct, so the effect of accessing the first character is as expected. [...smile][0]
is '𝛀'
.
omega.codePointAt(0)
method call supports Unicode, so it returns 0x1D6C0
of the first character in the omega
string. The function String.fromCodePoint(number)
returns the symbol based on the code point number: '𝛀'
.
Character positioning and combining marks
The character positioning in the string with combination marks has the same problem as the length of the string.
Accessing characters through the index in the string is accessing code units. However, the combination mark sequence should be accessed as a whole, rather than divided into separate code units.
The following example demonstrates this problem:
const drink = 'cafe\u0301';
console.log(drink); // => 'café'
console.log(drink.length); // => 5
console.log(drink[3]); // => 'e'
console.log(drink[4]); // => ◌́
Drink[3]
only accesses basic characters e
, without combining marks U+0301
COMBINING ACUTE ACCENT (presented as ◌́
).
Drink[4]
Access the isolated combination mark ◌́
.
In this case, string normalization is applied. Combining character sequence U+0065
LATIN the SMALL LETTER e
and U+0301
COMBINING the ACUTE the ACCENT ◌́
a standard equivalent U+00E9
LATIN the SMALL LETTER the WITH E the ACUTE é
. Let's improve the previous code example:
const drink = 'cafe\u0301';
console.log(drink.normalize()); // => 'café'
console.log(drink.normalize().length); // => 4
console.log(drink.normalize()[3]); // => 'é'
Note that not all combining character sequences have the standard equivalent as a single symbol. So the normalized string scheme is not universal.
Fortunately, it should work in most cases of European/North American languages.
Regular expression matching
Regular expressions, like strings, are executed in code units. Similar to the previously described scenario, this creates difficulties when dealing with surrogate pairs and combining character sequences using regular expressions.
BMP
character matches as expected, because a single code unit represents a symbol:
const greetings = 'Hi!';
const regex = /.{3}/;
console.log(regex.test(greetings)); // => true
greetings
has 3 characters, that is, 3 code units. The regular expression /.{3}/
expression can be matched successfully.
When matching star symbols ( 2
code units), you may encounter difficulties:
const smile = '😀';
const regex = /^.$/;
console.log(regex.test(smile)); // => false
smile
string contains the star symbol U+1F600
GRINNING FACE . U+1F600
Use a proxy to 0xD83D 0xDE00
.
However, the regular expression /^.$/
expects to match a code unit, so it fails.
The situation is even worse when defining character classes with star symbols. JavaScript throws an error:
const regex = /[😀-😎]/;
// => SyntaxError: Invalid regular expression: /[😀-😎]/:
// Range out of order in character class
Star code points are encoded as surrogate pairs. Therefore, JavaScript uses the code unit /[\uD83D\uDE00-\uD83D\uDE0E]/
to represent regular expressions. Each code unit is treated as a separate element in the pattern, so regular expressions ignore the concept of surrogate pairs.
\uDE00-\uD83D
part of the character class is \uDE00
greater than \uD83D
. As a result, the regular expression produces an error.
Regular expression u flag
Fortunately, ECMAScript 2015 introduced a useful u
flag that enables regular expressions to recognize Unicode. This flag can handle star symbols correctly.
You can use unicode escape sequences in the regular expression /u{1F600}/u
/\uD83D\uDE00/
indicates the high proxy and low proxy pair.
Let's apply the u
flag and see how the .
operator (including the quantifiers ?, +, *
and {3}, {3,}, {2,3}
) matches the star symbol:
const smile = '😀';
const regex = /^.$/u;
console.log(regex.test(smile)); // => true
/^.$/u
regular expression, because of the u
flag, supports Unicode matching, now it can match the star character 😀
.
u
flag can also correctly handle the star symbol in the character class:
const smile = '😀';
const regex = /[😀-😎]/u;
const regexEscape = /[\u{1F600}-\u{1F60E}]/u;
const regexSpEscape = /[\uD83D\uDE00-\uD83D\uDE0E]/u;
console.log(regex.test(smile)); // => true
console.log(regexEscape.test(smile)); // => true
console.log(regexSpEscape.test(smile)); // => true
[😀-😎]
matches a range of star characters, which can match '😀'
.
Regular expressions and combining marks
Unfortunately, with or without the u flag, regular expressions treat the combining mark as a separate code unit.
If you need to match a sequence of combining characters, you must match the base character and the combining mark separately.
Take a look at the following example:
const drink = 'cafe\u0301';
const regex1 = /^.{4}$/;
const regex2 = /^.{5}$/;
console.log(drink); // => 'café'
console.log(regex1.test(drink)); // => false
console.log(regex2.test(drink)); // => true
The string is rendered as 4 characters café
.
However, the regular expression matches 'cafe\u0301'
as the 5-element sequence /^.{5}$/
.
4. Summary
Probably the most important concept about Unicode in JavaScript is to treat strings as sequences of code units, because they are actually like this.
When developers think that strings are composed of characters (or symbols), and ignore the concept of code unit sequence, confusion will arise.
It can cause misunderstandings when dealing with strings containing surrogate pairs or combining character sequences:
- Get string length
- Character positioning
- Regular expression matching
Please note that most string methods in JavaScript do not fully support Unicode: such as myString.indexOf()
, myString.slice()
etc.
ECMAScript 2015 introduced some nice features, such as the code point escape sequence \u{1F600}
in strings and regular expressions.
The new regular expression flag u
supports string matching that recognizes Unicode. It makes it easier to match star symbols.
The string iterator String.prototype[@@iterator]()
supports Unicode. You can use the spread operator [...str]
or Array.from(str)
create an array of symbols and calculate the string length or access characters by index without breaking the surrogate pair. Please note that these operations will have some impact on performance.
If you need a better way to handle Unicode characters, you can use the punycode library or the generate library to generate special regular expressions.
Hope this article will help you master Unicode!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。