Do you know what the biangbiang side has to do with encoding? 4D long essay, hand in hand to take you step by step to understand the code

裤裆三重奏
中文

编码

Preface

This article is an expanded knowledge point when studying uploading files, because uploading files will involve file encoding and other content. As a front-end, I usually have less contact with these things. I don’t stop if I don’t fully understand the problem. People who are often involved in a bunch of problems because of a small problem, and then, the question brings new questions (No dolls!).

It took a long time to write this article. If you have after reading it, 160e4068934f99 160e4068934f9d a Star ✨ is my best encouragement.

ps: The article will not entangle some less important information, such as who proposed a certain agreement and in what year, these are basically forgotten after reading it once, it is not important, I will extract important information to share with everyone.

Pre-knowledge

Before we start, let's sort out the pre-knowledge we need. We only need to have a simple impression of these.

Commonly used computer conversion

Binary bit, bit, bit, bit, b, these are the smallest unit of computer storage "bit", which is binary 0 or 1 .

1B = 8b

Byte, Byte, B, these units also express the same meaning "byte", 1 "byte" is equal to 8 "bits".

1kb = 1024B

kb, mb, you should be familiar with these a lot, you can see these units when you get the file size.

1mb = 1024kb

单位换算

Base conversion in the front end

// 十六进制转十进制
parseInt(0x0f, 16);
// 十六进制转二进制
(0x0f).toString(2);

// 十进制转十六进制
(15).toString(16);
// 十进制转二进制
(15).toString(2);

// 二进制转十六进制
parseInt(1111, 2).toString(16);
// 二进制转十进制
parseInt(1111, 2);

At the beginning of the computer, a messy and inconsistent coding scheme

In order to facilitate understanding, the article is divided by the appearance of UNICODE as the timeline. UNICODE is the axe that Pangu pioneered.

The arrival of ASCII

As we all know, in computers, all data storage and operations must use binary representation, because the computer uses high and low levels to represent 1 and 0 . Then there is such a group of people who decided that uses 8 binary bits to form a code point to represent all characters .

8 binary bits, each position can be 0 or 1 , so there are a total of 2^8 = 256 code bits, which can represent 256 characters. These characters are divided into control characters, communication special characters and displayable characters.

Putting control characters and communication special characters together, from 0000 0000 ~ 0001 1111 , plus 0111 1111 , a total of 33 code points, that is, there are 33 characters. For example, 0000 0111 means ringing, and the computer at that time will ring the ring when it 0000 0111

0010 0000 ~ 0111 1110 characters, 060e406893527e, there are 95 in total, such as 0011 0000 , which means 0 , and another example is 0100 1010 , which means capital English letters J .

ASCII

Because the computer was only used in the United States at first, everyone was fine and used it well. The table generated by these characters in a prescribed order is the 160e40689352c6 ASCII table that we learned in the C language.

ps: Education is crazy now. The kid I know is learning programming in elementary school. He now knows the ASCII code 0.0

ASCII extension table and GBK encoding scheme

Later, due to the development of computers, some western countries began to add their own national characters and tabs behind the ASCII code table. This is the ASCII extension table , which occupies the 1000 0000 ~ 1111 1111 to represent their own characters.

ps: ASCII extension table in the code table is different for different system configurations, not go into here, the students want to know can refer this document

Later, computers spread farther and farther and came to the third world. We found that the ASCII code table of the tens of thousands of Chinese characters is impossible to put down our characters. 256 positions are not enough for us. What should we do?

The smart Chinese directly canceled the strange symbols after the 127 number position (that is, the content of the ASCII extension table), stipulating that a character less than 127 has the same meaning as the original, but when two characters greater than 127 are connected together, it is Represents a Chinese character, the first byte (high byte) is from 0xa1 ~ 0xf7 , and the latter byte (low byte) is from 0xa1 ~ 0xfe . (From here, the article does not use binary to represent code points. It is too long to write, and a bunch of 0 1 look at. I will convert the binary to its corresponding hexadecimal system. The 0x represents hexadecimal. system)

In this way, we can not only ensure that the English letters in the ASCII table will not display garbled characters, but also can combine more than 8,000 positions to put our own text, mathematical symbols, Japanese kana, etc., and we even put the original in the ASCII table The punctuation marks in there are all two-byte long characters. This is what we often call full-width characters, and the symbols below 127 are called half-width characters. (Do you see the difference between "," and ","? The former is a half-width character, and the latter is a full-width character)

GB

Later, this program worked well, we gave it the name GB2312 , the front GB means national standard.

But we have too many Chinese characters, and more than 8,000 positions are not enough. Then we simply do not require the low byte to be after 127. It is stipulated that as long as the high byte is greater than 127, it means that this is in the GBK encoding scheme. Characters represented. This coding scheme is called GBK . GBK not only includes all the contents of GB2312, but also adds a lot of Chinese characters and traditional characters.

Later, the ethnic minorities also need to use computers, and they want to add their characters, so they are expanded again, GBK expanded to GB18030 .

Interested students can go to the website query the code points of the corresponding characters of the national standard.

Does it sound perfect? GB18030 can cover almost all the Chinese that you can see, but we only solve the Chinese encoding and cannot display the characters of other countries or regions. For example, there was also a coding scheme called Big5 at that time, which was popularized with Taiwan, Hong Kong, The character sets of traditional Chinese passage areas such as Macau, Yitian Chinese system and window traditional Chinese systems are all based on Big5.

You see, the encoding methods of the mainland of China and Hong Kong, Macao and Taiwan are not the same, not to mention other countries. As a result, everyone is working behind closed doors. If you need to read the documents of other countries, you have to install and switch the encoding scheme of other countries.

The advent of UNICODE

If this chaos continues, it will not work. At this time, that includes characters from all over the world. At this time, two organizations started to unify the character set, the ISO developed by the International Organization for Standardization (ISO) 10646 character set, UNICODE character set developed by the Unicode Alliance. Later, they realized that what they should do is to unify the standard instead of repeating the mistakes, and finally we have the UNICODE character set we are using. Of course, the ISO 10646 character set still exists and coexists with UNICODE, and according to the statute, their respective codes The characters of the bits have the same meaning.

Because we are more familiar with UNICODE, I will focus on UNICODE in the following content.

The meaning of UNICODE plane and description of code bit section

At the beginning, UNICODE was very simple. Isn’t 1 byte (8 bits) of ASCII not enough, then 2 bytes (16 bits), 2^16 = 65536 code points, is this always enough?

The result was a slap in the face. After being ravaged by the characters of various countries, it didn’t work. This had to be changed. So I decided to take the two regions 0xd800 ~ 0xdbff (high proxy bit) and 0xdc00 ~ 0xdfff (low proxy bit) in UNICODE and use them To form a new code point. Both surrogate bits have 1024 code points, that is, 1024^2 = 1048576 code points have been added, plus the original 2^16 = 65536 code points, so logically speaking, UNICODE has a total of 1048576 + 65536 = 1114112 code points.

Now we have a normal 2-byte code point and a 4-byte code point composed of high surrogate bits and low surrogate bits. How do we distinguish between them? At this time, we need to introduce the concept of planes and divide these planes into different planes. The 2 bytes are the BMP plane and the 4 bytes are the auxiliary plane.

The 0th plane is called BMP and its range is 0x0000 ~ 0xffff .
The first auxiliary plane is called SMP or multilingual supplementary plane, and its range is 0x10000 ~ 0x1ffff .
The second auxiliary plane is called SIP, also called the ideographic supplementary plane, and its range is 0x20000 ~ 0x2ffff .
The third auxiliary plane is called TIP, also called the third ideographic plane, and its range is 0x30000 ~ 0x3ffff .
The 4th to 13th auxiliary planes have not been used.
The 14th auxiliary plane is called SSP, also called special purpose supplementary plane, and its range is 0xe0000 ~ 0xeffff .
The 15th auxiliary plane has a range of 0xf0000 ~ 0xfffff .
The 16th auxiliary plane has a range of 0x100000 ~ 0x10ffff .

In fact, the plane can be understood as taking different names for code bit sections of the same length. The characters we commonly use are all on the BMP plane. 0x0000 ~ 0xffff there are a total of 65536 code points, of which the high surrogate bit and the low surrogate bit are combined to form an auxiliary plane.

BMP

In addition, interested students can go to this website to view all the characters of UNICODE.

and many more! Are the operations of high surrogate and low surrogate very familiar? Students who read the article carefully should react right away, yes, yes, just like GBK we proposed for ASCII at the beginning, we use ASCII codes after 127 to combine tens of thousands of code points for Chinese characters, and UNICODE is the same. , I took two surrogate bits and combined them into an auxiliary plane, so why UNICODE has millions of code points, while our proposed GBK only has tens of thousands of code points, that is because ASCII has only 1 byte and UNICODE has 2 Bytes, nothing more.

After we have the UNICODE character set, will all the problems be solved? In fact, it did not. The situation at the time was that since UNICODE did not intend to be compatible with any previous character sets from the beginning, the promotion of UNICODE was not smooth.

At this time, some careful students may ask, in your picture above, did UNICODE take the first half of the ASCII characters? And the characters corresponding to each code point are the same, why are they incompatible?

This student looked more carefully, but ASCII is 1 byte, and the BMP plane of UNICODE is 2 bytes, which also expresses the English letter A , ASCII is 01000001 , and UNICODE is 00000000 01000001 . There are a lot of useless in front. 0 . Therefore, another reason why UNICODE promotion is blocked is that the space occupied by storing English documents with the same content will double, let alone other text.

UNICODE vs ASCII

Solve compatibility and space issues, smart UTF-8

At this time, in order to be compatible with ASCII, UTF-8 appeared, which is the encoding scheme of UNICODE. In terms of implementation, it is not only compatible with ASCII, but because it is a variable-length encoding scheme, it can reduce space usage by half for pure English documents.

But for Chinese documents, compared to the GBK encoding scheme, the space has increased by , because originally our GBK combined tens of thousands of code points with 2 bytes, but in UTF-8, even the Chinese characters on the BMP plane, That also requires 3 bytes. why? Let's look directly at the following example, how UTF-8 is encoded for the word "pants" in "pants".

First of all, here is a table, which is a UTF-8 replacement template. The template is also regular. If the beginning is 0, it occupies one byte. If the beginning is 1, then several 1s in a row means there are several bytes.

Hexadecimal rangeUTF-8 template
0x0000 ~ 0x007f0xxxxxxx
0x0080 ~ 0x07ff110xxxxx 10xxxxxx
0x0800 ~ 0xffff1110xxxx 10xxxxxx 10xxxxxx
0x10000 ~ 0x10ffff11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Attentive classmates may have to ask again (you can't quote the question in another way!?), look at the last template. If the position of x 1 , it is obviously greater than 0x10ffff . Why is the scope limited to 0x10ffff ? This is the problem caused by the surrogate bit, because UNICODE set 2 bytes at the beginning to indicate the character misunderstanding, and later used the surrogate bit as a remedy, but this caused the number of code points to be limited to 0x10ffff , even if UTF -8 design can express and accommodate more characters.

// 获取”裤“的 UNICODE 码位
const code = "裤".charCodeAt(); // 35044
// 获取 35044 的十六进制表示
// 查表可得 0x88e4 位于上表的第 3 行
const hex = code.toString(16); // 0x88e4
// 获取 35044 的二进制表示
const binary = code.toString(2); // 1000 1000 1110 0100

//          10001000 11100100   |   ”裤“的二进制序列
//     1000   100011   100100   |   ”裤“的二进制排序后
// 1110xxxx 10xxxxxx 10xxxxxx   |   模板中的第三行
// --------------------------   |   从低位到高位带入模板
// 11101000 10100011 10100100   |   获得编码后的二进制序列
//       e8       a3       a4   |   二进制序列转为十六进制

// 最终得到 e8 a3 a4 的编码结果

// 通过 node Buffer 来验证
const buffer = new Buffer.from("裤", "utf-8"); // <Buffer e8 a3 a4>

UTF-8

As we can see above, Chinese characters become 3 bytes after UTF-8 encoding, and GBK can be represented by 2 bytes. So, if your website or document only needs to be transmitted domestically, is it possible to use GBK? Can reduce a lot of document size.

In addition, let's also write about how many characters the English letters will get after UTF-8 encoding.

// 获取 A 的 UNICODE 码位
const code = "A".charCodeAt(); // 65
// 获取 65 的十六进制表示,0x41 位于上表的第 1 行
const hex = code.toString(16); // 0x41
// 获取 65 的二进制表示
const binary = code.toString(2); // 1000001

//  1000001   |   A 的二进制序列
// 0xxxxxxx   |   模板中的第一行
// --------   |   从低位到高位带入模板
// 01000001   |   获得二进制序列
//       41   |   二进制序列转为十六进制

// 最终得到 41 的编码结果

// 通过 node Buffer 来验证
const buffer = new Buffer.from("A", "utf-8"); // <Buffer 41>

It can be seen that for English letters, 00000000 01000001 becomes 01000001 , which reduces the space occupied by half.

UTF-16 encoding scheme

After talking about UTF-8, let's talk about UTF-16 and UCS-2, followed by UTF-32 and UCS-4. In addition, didn't we mention the ISO organization above? UCS is the encoding scheme proposed by the organization, and UCS-2 can be said to be the predecessor of UTF-16.

Although UTF-16 and UTF-8 look like doubles, they are not.

First of all, for the characters on the BMP side, UTF-16 is directly represented by 2 bytes, including English letters.

In addition, didn't we just mention the auxiliary plane formed by the combination of agency position and agency position? Students who have forgotten can look up. In fact, the surrogate bit is specially used for UTF-16. For auxiliary plane characters, UTF-16 is represented by 4 bytes, which is a combination of high surrogate bit and low surrogate bit. Said.

After reading the article for so long, students should be hungry. Let's talk about something delicious, the specialty of Shaanxi, biangbiang noodles.

biangbiang 面

The word biang, as shown below.

biang 字

The code point of biang in UNICODE is 0x30ede . You can open the website and search for 0x30ede to see where the biang word is included in the UNICODE character set. Although this word cannot be typed in the browser, we can use this word as an introduction to talk about how to implement UTF-16 encoding.

biangCharCodeHex = "0x30ede";
// 从 200414 这个码位就可以看出来,biang 字不在 BMP 平面
// 因为 BMP 平面只有 65536 个码位
parseInt("0x30ede"); // 200414

// 接下来演示 UTF-16 编码过程

// 先获取 200414 的二进制表示
(200414).toString(2); // 11 0000 1110 1101 1110

//            11 0000 1110 1101 1110   |   码位对应的二进制
//             1 0000 0000 0000 0000   |   减去 0x10000
//            10 0000 1110 1101 1110   |   得到的二进制
//          0010 0000 1110 1101 1110   |   把得到的二进制前补 0,补充到 20 位
//       0010000011       1011011110   |   整理一下,10 位一隔,方便阅读
// 1101100000000000 1101110000000000   |   左边是 0xd800,右边是 0xdc00
// ---------------------------------   |   还记得高代理位和低代理位的区间么?
//                                     |   高代理位从 0xd800 ~ 0xdbff,我们取 0xd800
//                                     |   低代理位从 0xdc00 ~ 0xdfff,我们取 0xdc00
// 1101100010000011 1101111011011110   |   直接把 10 位分别取代代理位后面的 0,从后面开始取代
//             d883             dede   |   把上述二进制转为 16 进制,最终获取到 UTF-16 编码

// 通过 node Buffer 来验证
// 因为 node 只支持 UTF-16 小端序
// http://nodejs.cn/api/buffer.html#buffer_buffers_and_character_encodings
// 所以表示为 dede 83d8,注意,这是从右往左读的
// 另外 "\u{30ede}" 这是个 ES6 用来表示辅助平面的字符的方法
const buffer = new Buffer.from("\u{30ede}", "utf16le"); // <Buffer 83 d8 de de>

UTF-16

Above, we have completed the UTF-16 encoding, and UNICODE@3.0 also gives the conversion formula for auxiliary plane characters

High = Math.floor((charCode - 0x10000) / 0x400) + 0xd800;
Low = ((charCode - 0x10000) % 0x400) + 0xdc00;

// 我们把刚才 biang 的码位代入试试
High = (Math.floor((0x30ede - 0x10000) / 0x400) + 0xd800).toString(16); // d883
Low = (((0x30ede - 0x10000) % 0x400) + 0xdc00).toString(16); // dede

In the above example, we have also seen that UTF-16 still has big-endian and little-endian, that is, the problem of byte order (BOM). In fact, we have to tell the program whether this code should be read from the left or Read from the right.

For example, the encoding result of "Dian" is 5960 , and 6059 . If the reading direction is not indicated, the displayed result will be problematic.

Therefore, UTF-16 has two branches, UTF-16BE (big endian) and UTF-16LE (little endian). And for files encoded with UTF-16, an endianness mark will be added to the header of the file, UTF-16BE puts 0xfeff , UTF-16LE puts 0xfffe , so you will find that it is saved with UTF-16 File, the occupied space will be 2 bytes more.

We know that UTF-16 can be combined with high and low surrogate bits to form new code points. The difference between UCS-2 and UTF-16 is here. UCS-2 also uses 2 bytes to represent the characters of the BMP plane, but it cannot represent Auxiliary plane, and, because ISO must ensure consistency with UNICODE, 0xd800 ~ 0xdbff and 0xdc00 ~ 0xdfff code points of UCS-2 are empty. You can understand UCS-2 as a subset of UTF-16.

UTF-32 encoding scheme

Next, we will come to UTF-32 and UCS-4 more quickly. Both of them directly use 4 bytes for each character, and when UCS-4 was first proposed, 4 bytes, a total of 32 The ones digit, but in the computer we generally regard the highest digit as the sign bit (whether this is the case, in doubt), then UCS-2 has 2^31 = 2147483648 code points, but because UCS-4 must comply with the UNICODE standard, the code point can only be 0x10ffff used, so UTF-32 was proposed. It is only used to represent the code point of 0x000000 ~ 0x10ffff

Here, have you found any pits in the agency position? Obviously UTF-8 and UTF-32 can represent more characters, but because of the use of surrogate bits, the upper limit of the number of UNICODE characters has been sanctioned to 0x10ffff . Although we still have a large number of code points that are not used, UNICODE is called It is a universal code, but if there is a cosmic code in the future, is it necessary to restart a set of codes or be forced to use brain wave transmission? (Dog head

Phase summary of character set and encoding scheme

At this point, the character set and encoding scheme come to an end, let’s make a summary

  1. At the beginning, the United States proposed the ASCII code table, and this is also the coding scheme of early computers, stipulating that 8 bits are 1 byte, and there are a total of 2^8 = 256 code bits.
  2. Later, the computer was transferred to other countries, so the ASCII extension table was released, which filled up the code points after 127.
  3. Later, the computer was transferred to China. We combined the two ASCII codes through specific rules to obtain the Chinese character encoding. At that time, the popular encoding schemes were GB2312, GBK, GB18030, Big5, etc.
  4. Finally, the International Organization for Standardization terminated the creation of their own coding schemes by various countries, and created UNICODE, trying to incorporate characters from all over the world.
  5. UNICODE directly stipulated the use of 2 bytes to express all characters at the beginning of its proposal, and did not consider compatibility with any encoding scheme, including ASCII, so it was very difficult to promote until the advent of the UTF-8 encoding scheme.
  6. UNICODE has a total of 0x10ffff code points, these code points are divided into 17 planes, we usually use the BMP plane.
  7. The content of the first 127 code points of UNICODE's BMP plane is directly copied from ASCII, and 0xd800 ~ 0xdbff and 0xdc00 ~ 0xdfff specified as high and low surrogate bits, and they combine to generate the required code points for the auxiliary plane.
  8. UTF-8 is a variable-length encoding scheme, and the encoding result is 1 to 4 bytes, which not only ensures that the encoding result of English characters is consistent with the original ASCII character set, but also can encode all UNICODE code points through specific rules.
  9. UTF-16 is also a variable-length encoding scheme, the encoding result is 2 or 4 bytes, the characters of the BMP plane are expressed by 2 bytes, and the auxiliary plane is expressed by 4 bytes after the combination of high and low surrogate bits. And because there are no special rules for UTF-8, there is a problem of endianness.
  10. UCS-2 is the predecessor of UTF-16, the encoding result is 2 bytes, and it does not support the character representation of the auxiliary plane. The coding space is 0x0000 ~ 0xffff .
  11. UTF-32 is a fixed-length encoding scheme, and the encoding result is a fixed 4 bytes. Like UTF-16, there is also a problem of endianness.
  12. UCS-4 is the predecessor of UTF-32, and the encoding result is a fixed 4 bytes. The coding space is 0x00000000 ~ 0x7fffffff .
  13. The encoding space of UTF series encoding schemes is 0x000000 ~ 0x10ffff .

During the verification phase, you will remember it after you actually see

Having said so many things before, some students may have forgotten in a blink of an eye, so let's actually demonstrate the impact of different encodings on file size.

First is the English document, create 3 files, the contents are all abc , and then compare the sizes of the three files.

abc

From the picture above, we can see:

GBK and UTF-8 both occupy 3 bytes of space for three English letters, because GBK keeps 1 byte for the encoding of English letters before No. 127, and UTF-8 also gets 1 for the encoding rules of English letters byte.

There is also UTF-16, which occupies 8 bytes. We mentioned above that UTF-16 will add 2 bytes at the beginning of the file to indicate the reading order in order to deal with the endianness problem. We remove these 2 bytes and get In UTF-16, each English letter occupies 2 bytes.

is the Chinese document, there are also 3 files, the content is 160e406893aa23 I love you, compare the sizes of the three files.

我爱你

From the picture above, we can see:

GBK occupies 6 bytes, because the GBK encoding scheme combines 2 ASCII into a new code point for Chinese use, and each Chinese character occupies 2 bytes.

UTF-16 occupies 8 bytes. First remove the 2 bytes that indicate the endianness of the file header. Then, because UTF-16 uses 2 bytes to display the characters on the BMP plane, Chinese characters are displayed in UTF-16. Occupies 2 bytes.

UTF-8 occupies 9 bytes. As we mentioned above, for Chinese characters on the BMP plane, after UTF-8 is encoded, each Chinese character occupies 3 bytes.

References & Suggestions

As a development, I don’t have much time for extracurricular reading when I usually have busy development tasks, but occasionally I have to pull myself out of the code. Although writing code to complete the function is very rewarding, if you have the opportunity to step into a world you have never set foot in, it is also a great joy in life.

Reference article

Web page coding is that thing
ASCII code table
Code Charts
What is the difference between Unicode and UTF-8?
character encoding notes: ASCII, Unicode and UTF-8
detail: Unicode, UTF-8, UTF-16, UTF-32, UCS-2, UCS-4

Suggested reading

What is ANSI code?
Chinese character encoding: GB2312, GBK, GB18030, Big5
detail: Unicode, UTF-8, UTF-16, UTF-32, UCS-2, UCS-4
UTF-8, UTF-16, UTF-32 & BOM
JavaScript’s internal character encoding: UCS-2 or UTF-16?

Preview of the series of articles

Of course, UNICODE is not only that simple, one code point is one character, such as the Indian language 060e406893ad68, which means hello, and there are नमस्ते

In addition kùdāng in the browser environment is 8 instead of 6, which involves the encoding format of javascript.

There is too much content for this study, I have to sort it out, and finally write an article based on my own understanding.

Footer

Code is life, and I am happy with it.

Technology keeps changing
Mind is always online
The front end is long
See you next time

by --- Crotch Trio

I am here gayhub@jsjzh Welcome everyone to come and play with me. If you have gained something after reading the article and want to a Star ✨ is the best encouragement for me.

Welcome friends to add me vx directly, pull you into the group and do things together, remember to note where you saw the article.

ps: If the picture is invalid, you can add me wechat: kimimi_king

<div align="center">
<image src="https://p1-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/53fb3e16b1f64ebbb8aee73734371257~tplv-k3u1fbpfcp-watermark.image" />
</div>

阅读 569

认真的灵魂会发光

785 声望
936 粉丝
0 条评论
你知道吗?

认真的灵魂会发光

785 声望
936 粉丝
宣传栏