This article introduces several new functions unicode
String.prototype.codePointAt
Function type:(index?: number)=> number|undefined
codePointAt
is a prototype function. It code point ( code point ) value of the character at that position in the string index
This method can identify UTF-16 in . 4 bytes code points, to support a range of functions than the prototype charCodeAt
wider, charCodeAt
only recognizes 2 bytes of substantially planar character (BMP). In addition, when index
boundary, codePointAt
returns undefined
, and charCodeAt
returns NaN
.
Except for these two points, codePointAt
and charCodeAt
are basically the same:
index
parameter default values are0
When the character is in the basic flat character set, the results returned by the two are the same.
const str = 'abc'; //字符 'a' 位于基本平面字符集中 console.log(str.codePointAt(0));//97 //index默认值为0 console.log(str.codePointAt());//97 //index越界时,返回undefined console.log(str.codePointAt(5));//undefined console.log(str.charCodeAt(0));//97 //index默认值为0 console.log(str.charCodeAt());//97 //index越界时,返回NaN console.log(str.charCodeAt(5));//NaN
When the character is in the auxiliary plane character set,
codePointAt
can be correctly recognized and the code point of the corresponding character is returned.charCodeAt
cannot be recognized correctly, and can only return 2-byte character at the current position.For example, for the treble character 𝄞 auxiliary plane, it is represented by two 2-byte basic plane characters
0xd834
and0xdd1e
. When we are about 𝄞When using
charCodeAt
, only the code point corresponding to the position can be obtained.const str = '\ud834\udd1e'; //辅助平面字符 高音字符 𝄞 console.log(str.charCodeAt(0).toString(16)); //d834 console.log(str.charCodeAt(1).toString(16)); //dd1e
当我们使用`codePointAt`时,可以得到**𝄞**的码点`0x1d11e`。
```js
console.log(str.codePointAt(0).toString(16)); //1d11e
//当index为1时,'\udd1e'后面没有另一个代码单元,被认为只是一个2字节的字符,而非是一对代码单元,所以此时只返回'\udd1e'的码点,而非'\ud834\udd1e'的码点
console.log(str.codePointAt(1).toString(16)); //dd1e
```
String.fromCodePoint
Function type:
(...codePoints: number[])=> string
The static function fromCodePoint
returns the corresponding string according to the incoming unicode fromCharCode
, it supports the direct input of the code point value of the auxiliary plane. Or in the treble clef 𝄞 example, the use fromCodePoint
can direct incoming code point values 0x1d11e
, and fromCharCode
values need to pass 0xd834
and 0xdd1e
.
console.log(String.fromCodePoint(0x1d11e)); //𝄞
console.log(String.fromCodePoint(0xd834, 0xdd1e)); //𝄞
console.log(String.fromCharCode(0x1d11e)); //턞 不能正确识别,乱码
console.log(String.fromCharCode(0xd834, 0xdd1e)); //𝄞
For basic plane characters, the fromCodePoint
and fromCharCode
are the same.
console.log(String.fromCodePoint(97)); //'a'
console.log(String.fromCodePoint(97, 98)); //'ab'
console.log(String.fromCodePoint()); //''
console.log(String.fromCharCode(97)); //'a'
console.log(String.fromCharCode(97, 98)); //'ab'
console.log(String.fromCharCode()); //''
String.prototype.normalize
Function type:
(form:'NFC'|'NFD'|'NFKC'|'NFKD')=>string
The prototype function normalize
accepts a specified normalization (if you don’t understand the meaning of NFC, NFD, etc., click it) parameters in the form of form
, form
default value ' (Canonical Composition 161a9d2371e3 Normalization is equivalent to To decompose, and then reorganize with standard equivalence), and return the string normalized to
unicode of combined symbol (alphabetic characters with diacritics in tone, etc.) provided two kinds representation, one is to use a unicode code point represents a synthesis of a character is the letter In combination with additional symbols, uses two code points . For example, ń
is a composite symbol. We can either use one code point 0x0144
represent it, or use two code points 0x006e
and 0x0301
represent it.
const str1 = '\u0144'; //ń
const str2 = '\u006e\u0301'; //ń
console.log({
str1,
str2,
});//{ str1: 'ń', str2: 'ń' }
These two representations are visually and semantically the same, and they are standard equivalents. However, they are different at the code level. str1
is one code point , str2
is two code points , which is likely to cause problems.
console.log(str1.length, str2.length);//1 2
console.log(str1 === str2);//false
normalize
function is to solve this problem, two strings by normalize
achieve function normalization After that, it will not happen again this problem.
let str1 = '\u0144'; //ń
let str2 = '\u006e\u0301'; //ń
//正规化
str1 = str1.normalize();
str2 = str2.normalize();
console.log({
str1,
str2,
}); //{ str1: 'ń', str2: 'ń' }
console.log(str1.length, str2.length); //1 1
console.log(str1 === str2); //true
new unicode representation method
Before, we said that unicode characters can pass
\u+code point, and ES6 has added a new representation, that is,
\u+{code point}.
The difference between these two methods is also easy to think of.
\u+{code point} supports 4-byte code points written in the auxiliary plane, while
\u+ code points only support 2-byte code points on the basic plane.
//对于基本平面的2字节码点,两种没有区别 const str1 = '\u{0144}'; const str2 = '\u0144'; console.log(str1 === str2); //true //高音符号 const str3 = '\u{1d11e}'; //错误的表示方法,被识别为了 \u1d11 和 e 两个字符 const str4 = '\u1d11e'; console.log(str4,str3===str4); //ᴑe false
Unicode is really a headache. If you have a friend who doesn’t know much about unicode, you can leave a message in the comment area. I will post another article that introduces unicode and JS in detail.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。