Preface
This article is an expanded knowledge point when studying uploading files, because uploading files will involve file encoding and other content. As a front-end, I usually have less contact with these things. I don’t stop if I don’t fully understand the problem. People who are often involved in a bunch of problems because of a small problem, and then, the question brings new questions (No dolls!).
It took a long time to write this article. If you have after reading it, 160e4068934f99 160e4068934f9d a Star ✨ is my best encouragement.
ps: The article will not entangle some less important information, such as who proposed a certain agreement and in what year, these are basically forgotten after reading it once, it is not important, I will extract important information to share with everyone.
Pre-knowledge
Before we start, let's sort out the pre-knowledge we need. We only need to have a simple impression of these.
Commonly used computer conversion
Binary bit, bit, bit, bit, b, these are the smallest unit of computer storage "bit", which is binary 0
or 1
.
1B = 8b
Byte, Byte, B, these units also express the same meaning "byte", 1 "byte" is equal to 8 "bits".
1kb = 1024B
kb, mb, you should be familiar with these a lot, you can see these units when you get the file size.
1mb = 1024kb
Base conversion in the front end
// 十六进制转十进制
parseInt(0x0f, 16);
// 十六进制转二进制
(0x0f).toString(2);
// 十进制转十六进制
(15).toString(16);
// 十进制转二进制
(15).toString(2);
// 二进制转十六进制
parseInt(1111, 2).toString(16);
// 二进制转十进制
parseInt(1111, 2);
At the beginning of the computer, a messy and inconsistent coding scheme
In order to facilitate understanding, the article is divided by the appearance of UNICODE as the timeline. UNICODE is the axe that Pangu pioneered.
The arrival of ASCII
As we all know, in computers, all data storage and operations must use binary representation, because the computer uses high and low levels to represent 1
and 0
. Then there is such a group of people who decided that uses 8 binary bits to form a code point to represent all characters .
8 binary bits, each position can be 0
or 1
, so there are a total of 2^8 = 256
code bits, which can represent 256 characters. These characters are divided into control characters, communication special characters and displayable characters.
Putting control characters and communication special characters together, from 0000 0000 ~ 0001 1111
, plus 0111 1111
, a total of 33 code points, that is, there are 33 characters. For example, 0000 0111
means ringing, and the computer at that time will ring the ring when it 0000 0111
0010 0000 ~ 0111 1110
characters, 060e406893527e, there are 95 in total, such as 0011 0000
, which means 0
, and another example is 0100 1010
, which means capital English letters J
.
Because the computer was only used in the United States at first, everyone was fine and used it well. The table generated by these characters in a prescribed order is the 160e40689352c6 ASCII table that we learned in the C language.
ps: Education is crazy now. The kid I know is learning programming in elementary school. He now knows the ASCII code 0.0
ASCII extension table and GBK encoding scheme
Later, due to the development of computers, some western countries began to add their own national characters and tabs behind the ASCII code table. This is the ASCII extension table , which occupies the 1000 0000 ~ 1111 1111
to represent their own characters.
ps: ASCII extension table in the code table is different for different system configurations, not go into here, the students want to know can refer this document
Later, computers spread farther and farther and came to the third world. We found that the ASCII code table of the tens of thousands of Chinese characters is impossible to put down our characters. 256 positions are not enough for us. What should we do?
The smart Chinese directly canceled the strange symbols after the 127 number position (that is, the content of the ASCII extension table), stipulating that a character less than 127 has the same meaning as the original, but when two characters greater than 127 are connected together, it is Represents a Chinese character, the first byte (high byte) is from 0xa1 ~ 0xf7
, and the latter byte (low byte) is from 0xa1 ~ 0xfe
. (From here, the article does not use binary to represent code points. It is too long to write, and a bunch of 0
1
look at. I will convert the binary to its corresponding hexadecimal system. The 0x
represents hexadecimal. system)
In this way, we can not only ensure that the English letters in the ASCII table will not display garbled characters, but also can combine more than 8,000 positions to put our own text, mathematical symbols, Japanese kana, etc., and we even put the original in the ASCII table The punctuation marks in there are all two-byte long characters. This is what we often call full-width characters, and the symbols below 127 are called half-width characters. (Do you see the difference between "," and ","? The former is a half-width character, and the latter is a full-width character)
Later, this program worked well, we gave it the name GB2312 , the front GB means national standard.
But we have too many Chinese characters, and more than 8,000 positions are not enough. Then we simply do not require the low byte to be after 127. It is stipulated that as long as the high byte is greater than 127, it means that this is in the GBK encoding scheme. Characters represented. This coding scheme is called GBK . GBK not only includes all the contents of GB2312, but also adds a lot of Chinese characters and traditional characters.
Later, the ethnic minorities also need to use computers, and they want to add their characters, so they are expanded again, GBK expanded to GB18030 .
Interested students can go to the website query the code points of the corresponding characters of the national standard.
Does it sound perfect? GB18030 can cover almost all the Chinese that you can see, but we only solve the Chinese encoding and cannot display the characters of other countries or regions. For example, there was also a coding scheme called Big5 at that time, which was popularized with Taiwan, Hong Kong, The character sets of traditional Chinese passage areas such as Macau, Yitian Chinese system and window traditional Chinese systems are all based on Big5.
You see, the encoding methods of the mainland of China and Hong Kong, Macao and Taiwan are not the same, not to mention other countries. As a result, everyone is working behind closed doors. If you need to read the documents of other countries, you have to install and switch the encoding scheme of other countries.
The advent of UNICODE
If this chaos continues, it will not work. At this time, that includes characters from all over the world. At this time, two organizations started to unify the character set, the ISO developed by the International Organization for Standardization (ISO) 10646 character set, UNICODE character set developed by the Unicode Alliance. Later, they realized that what they should do is to unify the standard instead of repeating the mistakes, and finally we have the UNICODE character set we are using. Of course, the ISO 10646 character set still exists and coexists with UNICODE, and according to the statute, their respective codes The characters of the bits have the same meaning.
Because we are more familiar with UNICODE, I will focus on UNICODE in the following content.
The meaning of UNICODE plane and description of code bit section
At the beginning, UNICODE was very simple. Isn’t 1 byte (8 bits) of ASCII not enough, then 2 bytes (16 bits), 2^16 = 65536
code points, is this always enough?
The result was a slap in the face. After being ravaged by the characters of various countries, it didn’t work. This had to be changed. So I decided to take the two regions 0xd800 ~ 0xdbff
(high proxy bit) and 0xdc00 ~ 0xdfff
(low proxy bit) in UNICODE and use them To form a new code point. Both surrogate bits have 1024 code points, that is, 1024^2 = 1048576
code points have been added, plus the original 2^16 = 65536
code points, so logically speaking, UNICODE has a total of 1048576 + 65536 = 1114112
code points.
Now we have a normal 2-byte code point and a 4-byte code point composed of high surrogate bits and low surrogate bits. How do we distinguish between them? At this time, we need to introduce the concept of planes and divide these planes into different planes. The 2 bytes are the BMP plane and the 4 bytes are the auxiliary plane.
The 0th plane is called BMP and its range is0x0000 ~ 0xffff
.
The first auxiliary plane is called SMP or multilingual supplementary plane, and its range is0x10000 ~ 0x1ffff
.
The second auxiliary plane is called SIP, also called the ideographic supplementary plane, and its range is0x20000 ~ 0x2ffff
.
The third auxiliary plane is called TIP, also called the third ideographic plane, and its range is0x30000 ~ 0x3ffff
.
The 4th to 13th auxiliary planes have not been used.
The 14th auxiliary plane is called SSP, also called special purpose supplementary plane, and its range is0xe0000 ~ 0xeffff
.
The 15th auxiliary plane has a range of0xf0000 ~ 0xfffff
.
The 16th auxiliary plane has a range of0x100000 ~ 0x10ffff
.
In fact, the plane can be understood as taking different names for code bit sections of the same length. The characters we commonly use are all on the BMP plane. 0x0000 ~ 0xffff
there are a total of 65536 code points, of which the high surrogate bit and the low surrogate bit are combined to form an auxiliary plane.
In addition, interested students can go to this website to view all the characters of UNICODE.
and many more! Are the operations of high surrogate and low surrogate very familiar? Students who read the article carefully should react right away, yes, yes, just like GBK we proposed for ASCII at the beginning, we use ASCII codes after 127 to combine tens of thousands of code points for Chinese characters, and UNICODE is the same. , I took two surrogate bits and combined them into an auxiliary plane, so why UNICODE has millions of code points, while our proposed GBK only has tens of thousands of code points, that is because ASCII has only 1 byte and UNICODE has 2 Bytes, nothing more.
After we have the UNICODE character set, will all the problems be solved? In fact, it did not. The situation at the time was that since UNICODE did not intend to be compatible with any previous character sets from the beginning, the promotion of UNICODE was not smooth.
At this time, some careful students may ask, in your picture above, did UNICODE take the first half of the ASCII characters? And the characters corresponding to each code point are the same, why are they incompatible?
This student looked more carefully, but ASCII is 1 byte, and the BMP plane of UNICODE is 2 bytes, which also expresses the English letter A
, ASCII is 01000001
, and UNICODE is 00000000 01000001
. There are a lot of useless in front. 0
. Therefore, another reason why UNICODE promotion is blocked is that the space occupied by storing English documents with the same content will double, let alone other text.
Solve compatibility and space issues, smart UTF-8
At this time, in order to be compatible with ASCII, UTF-8 appeared, which is the encoding scheme of UNICODE. In terms of implementation, it is not only compatible with ASCII, but because it is a variable-length encoding scheme, it can reduce space usage by half for pure English documents.
But for Chinese documents, compared to the GBK encoding scheme, the space has increased by , because originally our GBK combined tens of thousands of code points with 2 bytes, but in UTF-8, even the Chinese characters on the BMP plane, That also requires 3 bytes. why? Let's look directly at the following example, how UTF-8 is encoded for the word "pants" in "pants".
First of all, here is a table, which is a UTF-8 replacement template. The template is also regular. If the beginning is 0, it occupies one byte. If the beginning is 1, then several 1s in a row means there are several bytes.
Hexadecimal range | UTF-8 template |
---|---|
0x0000 ~ 0x007f | 0xxxxxxx |
0x0080 ~ 0x07ff | 110xxxxx 10xxxxxx |
0x0800 ~ 0xffff | 1110xxxx 10xxxxxx 10xxxxxx |
0x10000 ~ 0x10ffff | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
Attentive classmates may have to ask again (you can't quote the question in another way!?), look at the last template. If the position of x
1
, it is obviously greater than 0x10ffff
. Why is the scope limited to 0x10ffff
? This is the problem caused by the surrogate bit, because UNICODE set 2 bytes at the beginning to indicate the character misunderstanding, and later used the surrogate bit as a remedy, but this caused the number of code points to be limited to 0x10ffff
, even if UTF -8 design can express and accommodate more characters.
// 获取”裤“的 UNICODE 码位
const code = "裤".charCodeAt(); // 35044
// 获取 35044 的十六进制表示
// 查表可得 0x88e4 位于上表的第 3 行
const hex = code.toString(16); // 0x88e4
// 获取 35044 的二进制表示
const binary = code.toString(2); // 1000 1000 1110 0100
// 10001000 11100100 | ”裤“的二进制序列
// 1000 100011 100100 | ”裤“的二进制排序后
// 1110xxxx 10xxxxxx 10xxxxxx | 模板中的第三行
// -------------------------- | 从低位到高位带入模板
// 11101000 10100011 10100100 | 获得编码后的二进制序列
// e8 a3 a4 | 二进制序列转为十六进制
// 最终得到 e8 a3 a4 的编码结果
// 通过 node Buffer 来验证
const buffer = new Buffer.from("裤", "utf-8"); // <Buffer e8 a3 a4>
As we can see above, Chinese characters become 3 bytes after UTF-8 encoding, and GBK can be represented by 2 bytes. So, if your website or document only needs to be transmitted domestically, is it possible to use GBK? Can reduce a lot of document size.
In addition, let's also write about how many characters the English letters will get after UTF-8 encoding.
// 获取 A 的 UNICODE 码位
const code = "A".charCodeAt(); // 65
// 获取 65 的十六进制表示,0x41 位于上表的第 1 行
const hex = code.toString(16); // 0x41
// 获取 65 的二进制表示
const binary = code.toString(2); // 1000001
// 1000001 | A 的二进制序列
// 0xxxxxxx | 模板中的第一行
// -------- | 从低位到高位带入模板
// 01000001 | 获得二进制序列
// 41 | 二进制序列转为十六进制
// 最终得到 41 的编码结果
// 通过 node Buffer 来验证
const buffer = new Buffer.from("A", "utf-8"); // <Buffer 41>
It can be seen that for English letters, 00000000 01000001
becomes 01000001
, which reduces the space occupied by half.
UTF-16 encoding scheme
After talking about UTF-8, let's talk about UTF-16 and UCS-2, followed by UTF-32 and UCS-4. In addition, didn't we mention the ISO organization above? UCS is the encoding scheme proposed by the organization, and UCS-2 can be said to be the predecessor of UTF-16.
Although UTF-16 and UTF-8 look like doubles, they are not.
First of all, for the characters on the BMP side, UTF-16 is directly represented by 2 bytes, including English letters.
In addition, didn't we just mention the auxiliary plane formed by the combination of agency position and agency position? Students who have forgotten can look up. In fact, the surrogate bit is specially used for UTF-16. For auxiliary plane characters, UTF-16 is represented by 4 bytes, which is a combination of high surrogate bit and low surrogate bit. Said.
After reading the article for so long, students should be hungry. Let's talk about something delicious, the specialty of Shaanxi, biangbiang noodles.
The word biang, as shown below.
The code point of biang in UNICODE is 0x30ede
. You can open the website and search for 0x30ede
to see where the biang word is included in the UNICODE character set. Although this word cannot be typed in the browser, we can use this word as an introduction to talk about how to implement UTF-16 encoding.
biangCharCodeHex = "0x30ede";
// 从 200414 这个码位就可以看出来,biang 字不在 BMP 平面
// 因为 BMP 平面只有 65536 个码位
parseInt("0x30ede"); // 200414
// 接下来演示 UTF-16 编码过程
// 先获取 200414 的二进制表示
(200414).toString(2); // 11 0000 1110 1101 1110
// 11 0000 1110 1101 1110 | 码位对应的二进制
// 1 0000 0000 0000 0000 | 减去 0x10000
// 10 0000 1110 1101 1110 | 得到的二进制
// 0010 0000 1110 1101 1110 | 把得到的二进制前补 0,补充到 20 位
// 0010000011 1011011110 | 整理一下,10 位一隔,方便阅读
// 1101100000000000 1101110000000000 | 左边是 0xd800,右边是 0xdc00
// --------------------------------- | 还记得高代理位和低代理位的区间么?
// | 高代理位从 0xd800 ~ 0xdbff,我们取 0xd800
// | 低代理位从 0xdc00 ~ 0xdfff,我们取 0xdc00
// 1101100010000011 1101111011011110 | 直接把 10 位分别取代代理位后面的 0,从后面开始取代
// d883 dede | 把上述二进制转为 16 进制,最终获取到 UTF-16 编码
// 通过 node Buffer 来验证
// 因为 node 只支持 UTF-16 小端序
// http://nodejs.cn/api/buffer.html#buffer_buffers_and_character_encodings
// 所以表示为 dede 83d8,注意,这是从右往左读的
// 另外 "\u{30ede}" 这是个 ES6 用来表示辅助平面的字符的方法
const buffer = new Buffer.from("\u{30ede}", "utf16le"); // <Buffer 83 d8 de de>
Above, we have completed the UTF-16 encoding, and UNICODE@3.0 also gives the conversion formula for auxiliary plane characters
High = Math.floor((charCode - 0x10000) / 0x400) + 0xd800;
Low = ((charCode - 0x10000) % 0x400) + 0xdc00;
// 我们把刚才 biang 的码位代入试试
High = (Math.floor((0x30ede - 0x10000) / 0x400) + 0xd800).toString(16); // d883
Low = (((0x30ede - 0x10000) % 0x400) + 0xdc00).toString(16); // dede
In the above example, we have also seen that UTF-16 still has big-endian and little-endian, that is, the problem of byte order (BOM). In fact, we have to tell the program whether this code should be read from the left or Read from the right.
For example, the encoding result of "Dian" is 5960
, and 6059
. If the reading direction is not indicated, the displayed result will be problematic.
Therefore, UTF-16 has two branches, UTF-16BE (big endian) and UTF-16LE (little endian). And for files encoded with UTF-16, an endianness mark will be added to the header of the file, UTF-16BE puts 0xfeff
, UTF-16LE puts 0xfffe
, so you will find that it is saved with UTF-16 File, the occupied space will be 2 bytes more.
We know that UTF-16 can be combined with high and low surrogate bits to form new code points. The difference between UCS-2 and UTF-16 is here. UCS-2 also uses 2 bytes to represent the characters of the BMP plane, but it cannot represent Auxiliary plane, and, because ISO must ensure consistency with UNICODE, 0xd800 ~ 0xdbff
and 0xdc00 ~ 0xdfff
code points of UCS-2 are empty. You can understand UCS-2 as a subset of UTF-16.
UTF-32 encoding scheme
Next, we will come to UTF-32 and UCS-4 more quickly. Both of them directly use 4 bytes for each character, and when UCS-4 was first proposed, 4 bytes, a total of 32 The ones digit, but in the computer we generally regard the highest digit as the sign bit (whether this is the case, in doubt), then UCS-2 has 2^31 = 2147483648
code points, but because UCS-4 must comply with the UNICODE standard, the code point can only be 0x10ffff
used, so UTF-32 was proposed. It is only used to represent the code point of 0x000000 ~ 0x10ffff
Here, have you found any pits in the agency position? Obviously UTF-8 and UTF-32 can represent more characters, but because of the use of surrogate bits, the upper limit of the number of UNICODE characters has been sanctioned to 0x10ffff
. Although we still have a large number of code points that are not used, UNICODE is called It is a universal code, but if there is a cosmic code in the future, is it necessary to restart a set of codes or be forced to use brain wave transmission? (Dog head
Phase summary of character set and encoding scheme
At this point, the character set and encoding scheme come to an end, let’s make a summary
- At the beginning, the United States proposed the ASCII code table, and this is also the coding scheme of early computers, stipulating that 8 bits are 1 byte, and there are a total of
2^8 = 256
code bits. - Later, the computer was transferred to other countries, so the ASCII extension table was released, which filled up the code points after 127.
- Later, the computer was transferred to China. We combined the two ASCII codes through specific rules to obtain the Chinese character encoding. At that time, the popular encoding schemes were GB2312, GBK, GB18030, Big5, etc.
- Finally, the International Organization for Standardization terminated the creation of their own coding schemes by various countries, and created UNICODE, trying to incorporate characters from all over the world.
- UNICODE directly stipulated the use of 2 bytes to express all characters at the beginning of its proposal, and did not consider compatibility with any encoding scheme, including ASCII, so it was very difficult to promote until the advent of the UTF-8 encoding scheme.
- UNICODE has a total of
0x10ffff
code points, these code points are divided into 17 planes, we usually use the BMP plane. - The content of the first 127 code points of UNICODE's BMP plane is directly copied from ASCII, and
0xd800 ~ 0xdbff
and0xdc00 ~ 0xdfff
specified as high and low surrogate bits, and they combine to generate the required code points for the auxiliary plane. - UTF-8 is a variable-length encoding scheme, and the encoding result is 1 to 4 bytes, which not only ensures that the encoding result of English characters is consistent with the original ASCII character set, but also can encode all UNICODE code points through specific rules.
- UTF-16 is also a variable-length encoding scheme, the encoding result is 2 or 4 bytes, the characters of the BMP plane are expressed by 2 bytes, and the auxiliary plane is expressed by 4 bytes after the combination of high and low surrogate bits. And because there are no special rules for UTF-8, there is a problem of endianness.
- UCS-2 is the predecessor of UTF-16, the encoding result is 2 bytes, and it does not support the character representation of the auxiliary plane. The coding space is
0x0000 ~ 0xffff
. - UTF-32 is a fixed-length encoding scheme, and the encoding result is a fixed 4 bytes. Like UTF-16, there is also a problem of endianness.
- UCS-4 is the predecessor of UTF-32, and the encoding result is a fixed 4 bytes. The coding space is
0x00000000 ~ 0x7fffffff
. - The encoding space of UTF series encoding schemes is
0x000000 ~ 0x10ffff
.
During the verification phase, you will remember it after you actually see
Having said so many things before, some students may have forgotten in a blink of an eye, so let's actually demonstrate the impact of different encodings on file size.
First is the English document, create 3 files, the contents are all abc
, and then compare the sizes of the three files.
From the picture above, we can see:
GBK and UTF-8 both occupy 3 bytes of space for three English letters, because GBK keeps 1 byte for the encoding of English letters before No. 127, and UTF-8 also gets 1 for the encoding rules of English letters byte.
There is also UTF-16, which occupies 8 bytes. We mentioned above that UTF-16 will add 2 bytes at the beginning of the file to indicate the reading order in order to deal with the endianness problem. We remove these 2 bytes and get In UTF-16, each English letter occupies 2 bytes.
is the Chinese document, there are also 3 files, the content is 160e406893aa23 I love you, compare the sizes of the three files.
From the picture above, we can see:
GBK occupies 6 bytes, because the GBK encoding scheme combines 2 ASCII into a new code point for Chinese use, and each Chinese character occupies 2 bytes.
UTF-16 occupies 8 bytes. First remove the 2 bytes that indicate the endianness of the file header. Then, because UTF-16 uses 2 bytes to display the characters on the BMP plane, Chinese characters are displayed in UTF-16. Occupies 2 bytes.
UTF-8 occupies 9 bytes. As we mentioned above, for Chinese characters on the BMP plane, after UTF-8 is encoded, each Chinese character occupies 3 bytes.
References & Suggestions
As a development, I don’t have much time for extracurricular reading when I usually have busy development tasks, but occasionally I have to pull myself out of the code. Although writing code to complete the function is very rewarding, if you have the opportunity to step into a world you have never set foot in, it is also a great joy in life.
Reference article
Web page coding is that thing
ASCII code table
Code Charts
What is the difference between Unicode and UTF-8?
character encoding notes: ASCII, Unicode and UTF-8
detail: Unicode, UTF-8, UTF-16, UTF-32, UCS-2, UCS-4
Suggested reading
What is ANSI code?
Chinese character encoding: GB2312, GBK, GB18030, Big5
detail: Unicode, UTF-8, UTF-16, UTF-32, UCS-2, UCS-4
UTF-8, UTF-16, UTF-32 & BOM
JavaScript’s internal character encoding: UCS-2 or UTF-16?
Preview of the series of articles
Of course, UNICODE is not only that simple, one code point is one character, such as the Indian language 060e406893ad68, which means hello, and there are नमस्ते
In addition kùdāng
in the browser environment is 8 instead of 6, which involves the encoding format of javascript.
There is too much content for this study, I have to sort it out, and finally write an article based on my own understanding.
Footer
Code is life, and I am happy with it.
Technology keeps changing
Mind is always online
The front end is long
See you next timeby --- Crotch Trio
I am here gayhub@jsjzh Welcome everyone to come and play with me. If you have gained something after reading the article and want to a Star ✨ is the best encouragement for me.
Welcome friends to add me vx directly, pull you into the group and do things together, remember to note where you saw the article.
ps: If the picture is invalid, you can add me wechat: kimimi_king
<div align="center">
<image src="https://p1-juejin.byteimg.com/tos-cn-i-k3u1fbpfcp/53fb3e16b1f64ebbb8aee73734371257~tplv-k3u1fbpfcp-watermark.image" />
</div>
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。