The past and present of computer character encoding

I. Introduction

Someone gave you the following picture. If you can clearly explain the relationship and purpose between them, then your understanding of character encoding must pass.

I don’t know if I feel confused after reading the picture above. This article tries to sort out and explain the connection between these isolated words...

2. What you need to know about character encoding

2.1 ASCII (the period of oligopoly)

Inside the computer, all information is ultimately a binary value. Each binary bit (bit) has two states of 0 and 1, and 8 binary bits are called 1 byte. Put the state of all the keys on the keyboard (as shown in the figure below).

A total of 256 representations in 8-bit binary are more than enough. All the spaces, punctuation marks, numbers, uppercase and lowercase letters are represented by continuous byte states, until the number 127, so that the computer can use 8 binary bits (1 byte) to store English text Now, this is the famous ASCII (American Standard Code for Information Interchange). At that time, all computers in the world used the same ASCII scheme to save English text.

2.2 Non-ASCII encoding (development of Chinese character encoding)

With the rise of the Internet and the development of computer technology, computers have begun to be used all over the world, but many countries do not use English, and many of the applicable letters are not in ASCII.

In order to save their text on the computer, I decided to use the space after No. 127 to represent these new letters and symbols, and added a lot of horizontal lines, vertical lines, crosses and other shapes that need to be used when drawing tables, and kept the serial number. Edited to the last state 255.

The character set on pages 128 to 255 is called "extended character set". After that, greedy humans have no new state to use.

With the popularity of computers in China, there is no byte state that can be used to represent Chinese characters, and there are more than 6000 commonly used Chinese characters that need to be saved. But this is not difficult for the wise Chinese people, so we directly cancel the strange symbols after 127, and stipulate: the meaning of a character less than 127 is the same as the original, but when two characters greater than 127 are connected together , Means a Chinese character, the first byte (he called the high byte) is used from 0xA1 to 0xF7, and the last byte (low byte) is from 0xA1 to 0xFE, so we can combine about 7000 simplified Chinese characters Chinese characters too.

In these codes, we have also included mathematical symbols, Roman and Greek letters, and even the numbers, punctuation, and letters that exist in ASCII have all been recoded into two-byte long codes. This is what we often say "Full-width" characters. The original ones below 127 are called "half-width" characters. The Chinese people see this very well, so they call this Chinese character plan " GB2312 ". GB2312 is a Chinese extension ASCII

(The picture comes from the Internet)

But there are too many Chinese characters in China, and we soon discovered that there are many people whose names cannot be typed here. We have to continue to find out the code points that are not used by GB2312 and use them honestly and unceremoniously. Later, it was not enough, so the low byte is no longer required to be the internal code after 127. As long as the first byte is greater than 127, it will always indicate that this is the beginning of a Chinese character, regardless of whether it is followed by an extended character set. Content.

As a result, the expanded coding scheme is called GBK standard, GBK includes all the contents of **GB2312**, and at the same time nearly 20,000 new Chinese characters (including traditional characters) and symbols have been added. Later, ethnic minorities also need to use computers, so we expanded and added thousands of new ethnic minority characters, GBK expanded to GB18030 .

2.3 Non-ASCII encoding

hundred flowers bloom, problems caused by their respective coding standards

At that time, various countries developed their own coding standards like China. As a result, no one understood or supported other people’s codes. Even the mainland and Taiwan were only 150 nautical miles apart. The same language also uses different DBCS encoding schemes.

At that time, if the Chinese wanted to display Chinese characters on the computer, they had to install a "Chinese character system" to deal with the display and input of Chinese characters. For example, the fortune-telling program written by the ignorant feudal man in Taiwan had to install another The set supports BIG5.

What kind of encoding "Yitian Chinese Character System" can be used, if the wrong character system is installed, the display will be messed up! How can this be done?

Moreover, there are still poor people in the forest of nations in the world who do not have access to computers for a while. What about their writing?

(The picture comes from Wikipedia)

2.4 Unicode

world is so chaotic, I have to take

ISO (International Organization for Standardization) The international organization decided to tackle this problem. The method adopted is very simple: abolish all regional coding schemes, and rebuild a code that includes all cultures, all letters and symbols on the earth!

They plan to call it "Universal Multiple-Octet Coded Character Set", referred to as UCS , commonly known as " Unicode ". Unicode is equivalent to an abstraction layer, giving each character a unique code point .

Use 0x000000-0x10FFFF so many numbers to correspond to all languages, formulas, and symbols in the world. Then divide these numbers into 17 parts, put the commonly used ones in 0x0000-0xFFFF, that is, 2 bytes, called the basic plane (BMP); divide it into other planes from 0x010000-0x10FFFF.

(The picture comes from Wikipedia)

Chestnut : "V dimension"

If the "v dimension" string is placed in the memory, it will be 0x767ef4. The question is, how does the computer know that a few bytes represent a character? Is it 0x76? Or is it 0x7ef4? Or is it 0x767ef4?

Unicode only encodes the information source, digitizes the character set, and solves the mapping from character to digitization. Next, how to solve the problem of storage and transmission.

Three, transmission and storage

The idea of communication theory can be understood as:

Unicode is the source code, which digitizes the character set;
UTF-32, UTF-16, and UTF-8 are channel codes for better storage and transmission.

3.1 UTF-32

UTF-32 encoding is simple and clear. The code point value is stored in memory as much as possible. The disadvantage of UTF-32 is obvious. The letter A originally only needs 1 byte to store, but now it uses 4 bytes. Save, most positions are 0.

Question: Why should we save so many zeros?

3.2 UTF-16

Question : "Bright" The code point value is 20142, which is 0x4eae when replaced with hexadecimal. The memory is addressed by byte. So do we save 4e first? Or ae?

A Unicode code point value corresponds to a number. For characters in the Basic plane, we directly store this number in the memory.

In UTF-16 encoding, in addition to its own bytes, in order to distinguish between big-endian and little-endian, there are two more bytes at the beginning, ff and fe. feff stands for big-endian and fffe stands for little-endian.

(BOM in Notepad)

Tips : feff and fffe are also called BOM, which can distinguish different codes. The smallest unit of UTF-16 encoding is two bytes, so there is a problem of endianness, so a BOM is added to distinguish whether it is big-endian or little-endian.

3.3 UTF-8

UTF-8, as the name suggests, is a set of variable-length encoding with 8 bits as a coding unit. One code point is encoded into 1 to 4 bytes. A code point value will generate 1 or more bytes, and then store these bytes in order.

Tips : UTF-8 has no BOM or UTF-8 BOM. The BOM of UTF-8 is EF BB BF. UTF-8 does not have the problem of endianness, because its smallest unit of encoding is byte.

UTF-8 does not need to distinguish between big-endian and little-endian, so there is no need for BOM. If a BOM is added, for some read operations, it may consider the read BOM as a character, causing some errors. So when we save UTF-8 encoded files, it is best to choose no BOM.

Chestnut: "Knowing"

According to the coding rules in the above table, the code point U+77E5 of the word "knowledge" belongs to the range of the third line:

This is the process of encoding U+77E5 into the byte sequence E79FA5 according to UTF-8, and vice versa.

3.4 ANSI

"ANSI" refers to the legacy code corresponding to the current system locale.

The "ANSI" in Windows is actually Windows code pages. This mode selects a specific code based on the current locale. For example, GBK is used in the locale in simplified Chinese. Calling these code pages "ANSI" is a bug in Windows. They should be consistent with ASCII in the ASCII range.

3.1 Extended thinking

Q: Can a Chinese character be stored in a char variable in java, and why?

Answer: The encoding symbol set used in java is Unicode (no specific encoding method is involved, each symbol is assigned a binary encoding, which currently accommodates more than 1 million symbols), and Chinese characters have been included in the Unicode character set, and the char type It occupies two bytes and is used to represent the Unicode encoding, so it can store Chinese characters

Q: The default ISO-8859-1 encoding in tomcat, how to solve the garbled problem in web projects?

answer:

Method 1: modify the conf/server.xml file under tomcat

 <Connector port="8080"  protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" URIEncoding="UTF-8 useBodyEncodingForURI="true"/>

URIEncoding="UTF-8": It allows Tomcat (default ISO-8859-1 encoding) to process request parameters in UTF-8 encoding.
useBodyEncodingForURI="true": means that the encoding method of the request parameter adopts the encoding method of the request body.

Method 2:

1) When using the character stream to send page information to the browser, the default query is the ISO-8859-1 code table

Setting 1: response.setCharacterEncoding("UTF-8")
Setting 2: response.setContentType("text/html;charset=UTF-8")

2) The solution to the Chinese garbled characters that the client requests from the server

POST request method: What encoding is currently used by the browser, and what encoding is the parameter of the form submission,
Server-side processing:
request.setCharacterEncoding("utf-8")。
GET request method:

String name=request.getParameter("name");//首先拿到参数的值

//得到的byte[] 再重新用utf-8去编码，即可得到正常的值

name=new String(name.getBytes("iso-8859-1")/**用参数的值用iso-8859-1来解码**/,"utf-8");String name=request.getParameter("name");//首先拿到参数的值//得到的byte[] 再重新用utf-8去编码，即可得到正常的值name=new String(name.getBytes("iso-8859-1")/**用参数的值用iso-8859-1来解码**/,"utf-8");

Note: Tomcat's j2ee implementation uses the default iso-8859-1 to process the processing parameters when the form is submitted, that is, the post method prompts. Tomcat uses the query-string for the request submitted by the get method to process the query-string differently from the post method. Way.

Four, summary

Going back to the question in the preface and sorting out the picture below, I wonder if you now have a clearer understanding of character encoding...

The idea of communication theory can be understood as:

Unicode is the source code, which digitizes the character set;
UTF-32, UTF-16, and UTF-8 are channel codes for better storage and transmission.

Reference

1. "In-depth understanding of computer systems"

Author: vivo internet server team-Zhu Wenjin