Introduction

We know that the files in the computer can be divided into two types, one is a text file that is readable by the human eye, and the other is a binary file that is not readable by the naked eye. Generally speaking, binary files will display garbled characters if they are opened with a text editor, and the storage and transmission methods of binary files and text files are different. Is there any way to convert binary files into text files for transmission or storage? The answer is yes.

This encoding method is the Base64 encoding we are going to talk about today.

Base64 and its encoding principle

Base64 is a form of converting binary encoding format to text encoding. We know that binary encoding is in the form of 0 and 1, and its unit is usually one byte, which is 8 bits, and each bit represents 0 or 1.

There are many types of text encoding formats. The earliest and simplest encoding format is ASCII encoding. The full name of ASCII encoding is American Standard Code for Information Interchange, which is the American Standard Code for Information Interchange. It mainly represents some commonly used Western European character.

The encoding range of ASCII is 0x00-0x7F, which is 0-127 in decimal, a total of 128 characters, which is exactly the range represented by 7bits.

The ASCII encoding contains 33 control characters and 95 printable characters, as follows:

ASCII code meaning ASCII code meaning
hexadecimal 10 hex binary hexadecimal 10 hex binary
0x00 0 0 NUL empty 0x40 64 1000000 @
0x01 1 1 SOH title starts 0x41 65 1000001 A
0x02 2 10 STX text starts 0x42 66 1000010 B
0x03 3 11 End of ETX body 0x43 67 1000011 C
0x04 4 100 End of EOT transfer 0x44 68 1000100 D
0x05 5 101 ENQ query character 0x45 69 1000101 E
0x06 6 110 ACK acknowledgement 0x46 70 1000110 F
0x07 7 111 BEL alarm 0x47 71 1000111 G
0x08 8 1000 BS back one space 0x48 72 1001000 H
0x09 9 1001 HT horizontal tab 0x49 73 1001001 I
0x0A 10 1010 LF line feed 0x4A 74 1001010 J
0x0B 11 1011 VT vertical tab 0x4B 75 1001011 K
0x0C 12 1100 FF paper feed control 0x4C 76 1001100 L
0x0D 13 1101 CR Enter 0x4D 77 1001101 M
0x0E 14 1110 SO shift output 0x4E 78 1001110 N
0x0F 15 1111 SI shift input 0x4F 79 1001111 O
0x10 16 10000 DLE data link escape 0x50 80 1010000 P
0x11 17 10001 DC1 Device Control 1 0x51 81 1010001 Q
0x12 18 10010 DC2 Device Control 2 0x52 82 1010010 R
0x13 19 10011 DC3 Device Control 3 0x53 83 1010011 S
0x14 20 10100 DC4 Device Control 4 0x54 84 1010100 T
0x15 twenty one 10101 NAK Negative 0x55 85 1010101 U
0x16 twenty two 10110 SYN idle synchronization 0x56 86 1010110 V
0x17 twenty three 10111 End of ETB packet transfer 0x57 87 1010111 W
0x18 twenty four 11000 CAN void 0x58 88 1011000 X
0x19 25 11001 EM paper out 0x59 89 1011001 Y
0x1A 26 11010 SUB permutation 0x5A 90 1011010 Z
0x1B 27 11011 ESC escape 0x5B 91 1011011 [
0x1C 28 11100 FS literal separator 0x5C 92 1011100 \
0x1D 29 11101 GS group separator 0x5D 93 1011101 ]
0x1E 30 11110 RS record separator 0x5E 94 1011110 ^
0x1F 31 11111 US unit separator 0x5F 95 1011111 _
0x20 32 100000 (space) 0x60 96 1100000 `
0x21 33 100001 ! 0x61 97 1100001 a
0x22 34 100010 " 0x62 98 1100010 b
0x23 35 100011 # 0x63 99 1100011 c
0x24 36 100100 $ 0x64 100 1100100 d
0x25 37 100101 % 0x65 101 1100101 e
0x26 38 100110 & 0x66 102 1100110 f
0x27 39 100111 ' 0x67 103 1100111 g
0x28 40 101000 ( 0x68 104 1101000 h
0x29 41 101001 ) 0x69 105 1101001 i
0x2A 42 101010 * 0x6A 106 1101010 j
0x2B 43 101011 + 0x6B 107 1101011 k
0x2C 44 101100 , 0x6C 108 1101100 l
0x2D 45 101101 - 0x6D 109 1101101 m
0x2E 46 101110 . 0x6E 110 1101110 n
0x2F 47 101111 / 0x6F 111 1101111 o
0x30 48 110000 0 0x70 112 1110000 p
0x31 49 110001 1 0x71 113 1110001 q
0x32 50 110010 2 0x72 114 1110010 r
0x33 51 110011 3 0x73 115 1110011 s
0x34 52 110100 4 0x74 116 1110100 t
0x35 53 110101 5 0x75 117 1110101 u
36 54 110110 6 0x76 118 1110110 v
0x37 55 110111 7 0x77 119 1110111 w
0x38 56 111000 8 0x78 120 1111000 x
0x39 57 111001 9 0x79 121 1111001 y
0x3A 58 111010 : 0x7A 122 1111010 z
0x3B 59 111011 ; 0x7B 123 1111011 {
0x3C 60 111100 < 0x7C 124 1111100 \
0x3D 61 111101 = 0x7D 125 1111101 }
0x3E 62 111110 > 0x7E 126 1111110 ~
0x3F 63 111111 ? 0x7F 127 1111111 DEL delete

Base64 is to select 64 characters from ASCII encoding and map one byte 8bits in binary, which is the meaning of 64 in Base64. Why choose ASCII encoding? This is because ASCII encoding is the earliest form of encoding, which is fully supported by almost all computer applications, and there is no content conversion during data transmission, which is very safe.

Of course, Base64 encoding also has a variety of encoding forms. For example, in MIME, Base64 selects AZ, az, and 0-9 for a total of 62 characters, plus two other optional characters to form 64 encoded characters.

64 characters are represented by 6bits in binary, and the commonly used binary is represented by one byte, that is, 8bits, then the question is, how to represent 8bits binary with 6bits Base64 characters?

Very simple, we only need to connect 3 8bits to become 24bits, so that we can use 4 Base64 to represent.

Why must binary conversion be performed? This is because some transmission protocols in the Internet only support some specific character sets, and other character sets are not supported. For example, it is commonly used to send email attachments. Because the SMTP protocol was originally designed to support 7-bit ASCII characters, if we want to transmit a file, we need to encode the file and then transmit it.

Another use of Base64 is to embed images into web pages in HTML to display images.

Although Base64 is very useful, because it can only use 6bits character mapping set, it will cause the loss of data mapping, which will lead to the disadvantage that the size of the binary file becomes larger after encoding.

Variants of Base64

Base64 is simply the mapping from bit to bit, so there must be more than one mapping method. Let's take a look at the various variants of Base64 encoding. Generally speaking, the first 62 bits are basically the same, the difference is The last two characters, and the character used for padding (this may be mandatory in some protocols, or may be removed in others).

The following table is a common Base64 encoding variant:

encoding name coded character coded character coded character
62nd 63rd Completion
RFC 1421: Base64 for Privacy-Enhanced Mail (deprecated) + / = mandatory
RFC 2045: Base64 transfer encoding for MIME + / = mandatory
RFC 2152: Base64 for UTF-7 + / No
RFC 3501: Base64 encoding for IMAP mailbox names + , No
RFC 4648: base64 (standard) + / = optional
RFC 4648: base64url (URL- and filename-safe standard) - _ = optional
RFC 4880: Radix-64 for OpenPGP + / = mandatory

Base64 encoding details

In the previous section, we talked about the basic principles of Base64 encoding and some common variants, so how exactly is it mapped?

In this section, we will take the standard form of Base64, RFC 4648, as an example to explain in detail.

RFC 4648 chooses the two characters + and / as bits 62 and 63 in the encoding, and chooses = as the completion character.

Let's first observe the mapping table of RFC 4648:

index binary character index binary Char index binary Char index binary Char
0 000000 A 16 010000 Q 32 100000 g 48 110000 w
1 000001 B 17 010001 R 33 100001 h 49 110001 x
2 000010 C 18 010010 S 34 100010 i 50 110010 y
3 000011 D 19 010011 T 35 100011 j 51 110011 z
4 000100 E 20 010100 U 36 100100 k 52 110100 0
5 000101 F twenty one 010101 V 37 100101 l 53 110101 1
6 000110 G twenty two 010110 W 38 100110 m 54 110110 2
7 000111 H twenty three 010111 X 39 100111 n 55 110111 3
8 001000 I twenty four 011000 Y 40 101000 o 56 111000 4
9 001001 J 25 011001 Z 41 101001 p 57 111001 5
10 001010 K 26 011010 a 42 101010 q 58 111010 6
11 001011 L 27 011011 b 43 101011 r 59 111011 7
12 001100 M 28 011100 c 44 101100 s 60 111100 8
13 001101 N 29 011101 d 45 101101 t 61 111101 9
14 001110 O 30 011110 e 46 101110 u 62 111110 +
15 001111 P 31 011111 f 47 101111 v 63 111111 /
Completion =

Let's take the word man as an example to observe the encoding process of Base64.

The word man is represented by 77, 97, and 110 in ASCII, which translates to 01001101, 01100001, and 01101110 in binary.

Combining the above three binaries together becomes: 010011010110000101101110, a total of 24-bit, select the corresponding characters from the above table, so we can get man after base64 encoding and get: TWFu.

In the above example, man is exactly 3 characters, that is, 24 bits, which can be fully represented by base64. If we only have the two characters ma, how should we encode it?

As above, the binary of ma is 01001101, 01100001 respectively, and the combination is 0100110101100001.

But the above bits are only 16 bits, because a base64 is 6 bits, so it can be represented by 3 base64, because the original bits are two less, so it is filled with 0:

0100110101100001+00 = 010011010110000100.

Converting 010011010110000100 to base64 is TWE, because base64 encoding requires 4 characters, so the last character is completed with =, which means that me becomes TWE= after base64.

Summarize

The above is the basic meaning and conversion rules of Base64. In fact, the protocol is very simple. Convert the data to be converted into binary, and then convert and complete it according to the conversion table.

This article has been included in http://www.flydean.com/18-base64-encoding/

The most popular interpretation, the most profound dry goods, the most concise tutorials, and many tricks you don't know are waiting for you to discover!

Welcome to pay attention to my official account: "Program those things", understand technology, understand you better!


flydean
890 声望433 粉丝

欢迎访问我的个人网站:www.flydean.com