1

This article introduces several new functions unicode

  1. String.prototype.codePointAt
    Function type:

    (index?: number)=> number|undefined

codePointAt is a prototype function. It code point ( code point ) value of the character at that position in the string index This method can identify UTF-16 in . 4 bytes code points, to support a range of functions than the prototype charCodeAt wider, charCodeAt only recognizes 2 bytes of substantially planar character (BMP). In addition, when index boundary, codePointAt returns undefined , and charCodeAt returns NaN .

Except for these two points, codePointAt and charCodeAt are basically the same:

  • index parameter default values are 0
  • When the character is in the basic flat character set, the results returned by the two are the same.

    const str = 'abc'; //字符 'a' 位于基本平面字符集中
    console.log(str.codePointAt(0));//97
    //index默认值为0
    console.log(str.codePointAt());//97
    //index越界时,返回undefined
    console.log(str.codePointAt(5));//undefined
    console.log(str.charCodeAt(0));//97
    //index默认值为0
    console.log(str.charCodeAt());//97
    //index越界时,返回NaN
    console.log(str.charCodeAt(5));//NaN
  • When the character is in the auxiliary plane character set, codePointAt can be correctly recognized and the code point of the corresponding character is returned. charCodeAt cannot be recognized correctly, and can only return 2-byte character at the current position.

    For example, for the treble character 𝄞 auxiliary plane, it is represented by two 2-byte basic plane characters 0xd834 and 0xdd1e . When we are about 𝄞

    When using charCodeAt , only the code point corresponding to the position can be obtained.

    const str = '\ud834\udd1e'; //辅助平面字符 高音字符 𝄞
    console.log(str.charCodeAt(0).toString(16)); //d834 
    console.log(str.charCodeAt(1).toString(16)); //dd1e
 当我们使用`codePointAt`时,可以得到**𝄞**的码点`0x1d11e`。

 ```js
 console.log(str.codePointAt(0).toString(16)); //1d11e
 //当index为1时,'\udd1e'后面没有另一个代码单元,被认为只是一个2字节的字符,而非是一对代码单元,所以此时只返回'\udd1e'的码点,而非'\ud834\udd1e'的码点
 console.log(str.codePointAt(1).toString(16)); //dd1e
 ```
  1. String.fromCodePoint

    Function type:

    (...codePoints: number[])=> string

The static function fromCodePoint returns the corresponding string according to the incoming unicode fromCharCode , it supports the direct input of the code point value of the auxiliary plane. Or in the treble clef 𝄞 example, the use fromCodePoint can direct incoming code point values 0x1d11e , and fromCharCode values need to pass 0xd834 and 0xdd1e .

console.log(String.fromCodePoint(0x1d11e)); //𝄞
console.log(String.fromCodePoint(0xd834, 0xdd1e)); //𝄞
console.log(String.fromCharCode(0x1d11e)); //턞 不能正确识别,乱码
console.log(String.fromCharCode(0xd834, 0xdd1e)); //𝄞

For basic plane characters, the fromCodePoint and fromCharCode are the same.

console.log(String.fromCodePoint(97)); //'a'
console.log(String.fromCodePoint(97, 98)); //'ab'
console.log(String.fromCodePoint()); //''
console.log(String.fromCharCode(97)); //'a'
console.log(String.fromCharCode(97, 98)); //'ab'
console.log(String.fromCharCode()); //''
  1. String.prototype.normalize

    Function type:

    (form:'NFC'|'NFD'|'NFKC'|'NFKD')=>string

The prototype function normalize accepts a specified normalization (if you don’t understand the meaning of NFC, NFD, etc., click it) parameters in the form of form , form default value ' (Canonical Composition 161a9d2371e3 Normalization is equivalent to To decompose, and then reorganize with standard equivalence), and return the string normalized to

unicode of combined symbol (alphabetic characters with diacritics in tone, etc.) provided two kinds representation, one is to use a unicode code point represents a synthesis of a character is the letter In combination with additional symbols, uses two code points . For example, ń is a composite symbol. We can either use one code point 0x0144 represent it, or use two code points 0x006e and 0x0301 represent it.

const str1 = '\u0144'; //ń
const str2 = '\u006e\u0301'; //ń
console.log({
    str1,
    str2,
});//{ str1: 'ń', str2: 'ń' }

These two representations are visually and semantically the same, and they are standard equivalents. However, they are different at the code level. str1 is one code point , str2 is two code points , which is likely to cause problems.

console.log(str1.length, str2.length);//1 2
console.log(str1 === str2);//false

normalize function is to solve this problem, two strings by normalize achieve function normalization After that, it will not happen again this problem.

let str1 = '\u0144'; //ń
let str2 = '\u006e\u0301'; //ń
//正规化
str1 = str1.normalize();
str2 = str2.normalize();
console.log({
    str1,
    str2,
}); //{ str1: 'ń', str2: 'ń' }

console.log(str1.length, str2.length); //1 1
console.log(str1 === str2); //true
  1. new unicode representation method

    Before, we said that unicode characters can pass \u+code point, and ES6 has added a new representation, that is, \u+{code point}.

    The difference between these two methods is also easy to think of. \u+{code point} supports 4-byte code points written in the auxiliary plane, while \u+ code points only support 2-byte code points on the basic plane.

    //对于基本平面的2字节码点,两种没有区别
    const str1 = '\u{0144}';
    const str2 = '\u0144';
    console.log(str1 === str2); //true
    //高音符号
    const str3 = '\u{1d11e}';
    //错误的表示方法,被识别为了 \u1d11 和 e 两个字符
    const str4 = '\u1d11e';
    console.log(str4,str3===str4); //ᴑe false
Unicode is really a headache. If you have a friend who doesn’t know much about unicode, you can leave a message in the comment area. I will post another article that introduces unicode and JS in detail.

forceddd
271 声望912 粉丝

一名前端爱好者。