头图

By making some simple connections in the operating system, we can deepen our understanding and memory of the basics of Unicode encoding.

Under Windows10 operating system, create a new notepad file and enter 123ABCabc

The default encoding format is UTF8:

Open the notepad file with winhex, a hex file editor:

See 31 32 33 41 42 43 61 62 63 in the text area. What do these numbers mean?

UTF8 (Universal Character Set/Unicode Transformation Format) is a variable-length character encoding for Unicode. It can be used to represent any character in the Unicode standard, and the first byte in its encoding is still ASCII-compatible, so that software that originally handles ASCII characters can continue to be used with little or no modification.

ASCII is the abbreviation of American Standard Code for Information Interchange, designed for American English communication. It consists of 128 characters, including upper and lower case letters, numbers 0-9, punctuation, non-printing characters (linefeed, tab, etc. 4) and control characters (backspace, bell, etc.).

The ascii comparison table can be obtained from this link .

The UTF8 (ASCII) encoding of numbers 1, 2, and 3 are 31, 32, and 33, respectively:

The UTF8 (ANSI) encoding of uppercase ABC is 41 42 43, and lowercase is 61 62 63:

Change ENCODING to ANSI:

The content in winhex is unchanged.

After the Encoding of Notepad is changed to UTF8 with bom:

The front part of the winhex file content, there are three more EF BB BF

First of all, the meaning of BOM is byte order mark. BOM (byte order mark) is prepared for UTF-16 and UTF-32 to mark byte order. Microsoft uses the BOM in UTF-8 because it makes a clear distinction between encodings such as UTF-8 and ASCII.

This EF BB BF can be understood as a special marker used to explicitly indicate that the encoding of the file is UTF-8:

https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding

Correspondingly, after changing the encoding to UTF-16(BE) in Notepad, the file header becomes FE FF, and the previous 31 32 33 becomes double-byte 00 31 00 32 00 33:

Try Chinese again.

Enter a Chinese "wang" in Notepad:
WangUTF8

E6 B1 AA This is the three-byte Unicode encoding of the Chinese character Wang, from the website .

AA occupies one byte, 8 bits: 1010 1010

UTF16-LE 6A6C


3A stands for colon:

22 stands for quotation marks:

More Jerry's original articles, all in: "Wang Zixi":


注销
1k 声望1.6k 粉丝

invalid