Introduction
We know that the files in the computer can be divided into two types, one is a text file that is readable by the human eye, and the other is a binary file that is not readable by the naked eye. Generally speaking, binary files will display garbled characters if they are opened with a text editor, and the storage and transmission methods of binary files and text files are different. Is there any way to convert binary files into text files for transmission or storage? The answer is yes.
This encoding method is the Base64 encoding we are going to talk about today.
Base64 and its encoding principle
Base64 is a form of converting binary encoding format to text encoding. We know that binary encoding is in the form of 0 and 1, and its unit is usually one byte, which is 8 bits, and each bit represents 0 or 1.
There are many types of text encoding formats. The earliest and simplest encoding format is ASCII encoding. The full name of ASCII encoding is American Standard Code for Information Interchange, which is the American Standard Code for Information Interchange. It mainly represents some commonly used Western European character.
The encoding range of ASCII is 0x00-0x7F, which is 0-127 in decimal, a total of 128 characters, which is exactly the range represented by 7bits.
The ASCII encoding contains 33 control characters and 95 printable characters, as follows:
ASCII code | meaning | ASCII code | meaning | |||||
---|---|---|---|---|---|---|---|---|
hexadecimal | 10 hex | binary | hexadecimal | 10 hex | binary | |||
0x00 | 0 | 0 | NUL empty | 0x40 | 64 | 1000000 | @ | |
0x01 | 1 | 1 | SOH title starts | 0x41 | 65 | 1000001 | A | |
0x02 | 2 | 10 | STX text starts | 0x42 | 66 | 1000010 | B | |
0x03 | 3 | 11 | End of ETX body | 0x43 | 67 | 1000011 | C | |
0x04 | 4 | 100 | End of EOT transfer | 0x44 | 68 | 1000100 | D | |
0x05 | 5 | 101 | ENQ query character | 0x45 | 69 | 1000101 | E | |
0x06 | 6 | 110 | ACK acknowledgement | 0x46 | 70 | 1000110 | F | |
0x07 | 7 | 111 | BEL alarm | 0x47 | 71 | 1000111 | G | |
0x08 | 8 | 1000 | BS back one space | 0x48 | 72 | 1001000 | H | |
0x09 | 9 | 1001 | HT horizontal tab | 0x49 | 73 | 1001001 | I | |
0x0A | 10 | 1010 | LF line feed | 0x4A | 74 | 1001010 | J | |
0x0B | 11 | 1011 | VT vertical tab | 0x4B | 75 | 1001011 | K | |
0x0C | 12 | 1100 | FF paper feed control | 0x4C | 76 | 1001100 | L | |
0x0D | 13 | 1101 | CR Enter | 0x4D | 77 | 1001101 | M | |
0x0E | 14 | 1110 | SO shift output | 0x4E | 78 | 1001110 | N | |
0x0F | 15 | 1111 | SI shift input | 0x4F | 79 | 1001111 | O | |
0x10 | 16 | 10000 | DLE data link escape | 0x50 | 80 | 1010000 | P | |
0x11 | 17 | 10001 | DC1 Device Control 1 | 0x51 | 81 | 1010001 | Q | |
0x12 | 18 | 10010 | DC2 Device Control 2 | 0x52 | 82 | 1010010 | R | |
0x13 | 19 | 10011 | DC3 Device Control 3 | 0x53 | 83 | 1010011 | S | |
0x14 | 20 | 10100 | DC4 Device Control 4 | 0x54 | 84 | 1010100 | T | |
0x15 | twenty one | 10101 | NAK Negative | 0x55 | 85 | 1010101 | U | |
0x16 | twenty two | 10110 | SYN idle synchronization | 0x56 | 86 | 1010110 | V | |
0x17 | twenty three | 10111 | End of ETB packet transfer | 0x57 | 87 | 1010111 | W | |
0x18 | twenty four | 11000 | CAN void | 0x58 | 88 | 1011000 | X | |
0x19 | 25 | 11001 | EM paper out | 0x59 | 89 | 1011001 | Y | |
0x1A | 26 | 11010 | SUB permutation | 0x5A | 90 | 1011010 | Z | |
0x1B | 27 | 11011 | ESC escape | 0x5B | 91 | 1011011 | [ | |
0x1C | 28 | 11100 | FS literal separator | 0x5C | 92 | 1011100 | \ | |
0x1D | 29 | 11101 | GS group separator | 0x5D | 93 | 1011101 | ] | |
0x1E | 30 | 11110 | RS record separator | 0x5E | 94 | 1011110 | ^ | |
0x1F | 31 | 11111 | US unit separator | 0x5F | 95 | 1011111 | _ | |
0x20 | 32 | 100000 | (space) | 0x60 | 96 | 1100000 | ` | |
0x21 | 33 | 100001 | ! | 0x61 | 97 | 1100001 | a | |
0x22 | 34 | 100010 | " | 0x62 | 98 | 1100010 | b | |
0x23 | 35 | 100011 | # | 0x63 | 99 | 1100011 | c | |
0x24 | 36 | 100100 | $ | 0x64 | 100 | 1100100 | d | |
0x25 | 37 | 100101 | % | 0x65 | 101 | 1100101 | e | |
0x26 | 38 | 100110 | & | 0x66 | 102 | 1100110 | f | |
0x27 | 39 | 100111 | ' | 0x67 | 103 | 1100111 | g | |
0x28 | 40 | 101000 | ( | 0x68 | 104 | 1101000 | h | |
0x29 | 41 | 101001 | ) | 0x69 | 105 | 1101001 | i | |
0x2A | 42 | 101010 | * | 0x6A | 106 | 1101010 | j | |
0x2B | 43 | 101011 | + | 0x6B | 107 | 1101011 | k | |
0x2C | 44 | 101100 | , | 0x6C | 108 | 1101100 | l | |
0x2D | 45 | 101101 | - | 0x6D | 109 | 1101101 | m | |
0x2E | 46 | 101110 | . | 0x6E | 110 | 1101110 | n | |
0x2F | 47 | 101111 | / | 0x6F | 111 | 1101111 | o | |
0x30 | 48 | 110000 | 0 | 0x70 | 112 | 1110000 | p | |
0x31 | 49 | 110001 | 1 | 0x71 | 113 | 1110001 | q | |
0x32 | 50 | 110010 | 2 | 0x72 | 114 | 1110010 | r | |
0x33 | 51 | 110011 | 3 | 0x73 | 115 | 1110011 | s | |
0x34 | 52 | 110100 | 4 | 0x74 | 116 | 1110100 | t | |
0x35 | 53 | 110101 | 5 | 0x75 | 117 | 1110101 | u | |
36 | 54 | 110110 | 6 | 0x76 | 118 | 1110110 | v | |
0x37 | 55 | 110111 | 7 | 0x77 | 119 | 1110111 | w | |
0x38 | 56 | 111000 | 8 | 0x78 | 120 | 1111000 | x | |
0x39 | 57 | 111001 | 9 | 0x79 | 121 | 1111001 | y | |
0x3A | 58 | 111010 | : | 0x7A | 122 | 1111010 | z | |
0x3B | 59 | 111011 | ; | 0x7B | 123 | 1111011 | { | |
0x3C | 60 | 111100 | < | 0x7C | 124 | 1111100 | \ | |
0x3D | 61 | 111101 | = | 0x7D | 125 | 1111101 | } | |
0x3E | 62 | 111110 | > | 0x7E | 126 | 1111110 | ~ | |
0x3F | 63 | 111111 | ? | 0x7F | 127 | 1111111 | DEL delete |
Base64 is to select 64 characters from ASCII encoding and map one byte 8bits in binary, which is the meaning of 64 in Base64. Why choose ASCII encoding? This is because ASCII encoding is the earliest form of encoding, which is fully supported by almost all computer applications, and there is no content conversion during data transmission, which is very safe.
Of course, Base64 encoding also has a variety of encoding forms. For example, in MIME, Base64 selects AZ, az, and 0-9 for a total of 62 characters, plus two other optional characters to form 64 encoded characters.
64 characters are represented by 6bits in binary, and the commonly used binary is represented by one byte, that is, 8bits, then the question is, how to represent 8bits binary with 6bits Base64 characters?
Very simple, we only need to connect 3 8bits to become 24bits, so that we can use 4 Base64 to represent.
Why must binary conversion be performed? This is because some transmission protocols in the Internet only support some specific character sets, and other character sets are not supported. For example, it is commonly used to send email attachments. Because the SMTP protocol was originally designed to support 7-bit ASCII characters, if we want to transmit a file, we need to encode the file and then transmit it.
Another use of Base64 is to embed images into web pages in HTML to display images.
Although Base64 is very useful, because it can only use 6bits character mapping set, it will cause the loss of data mapping, which will lead to the disadvantage that the size of the binary file becomes larger after encoding.
Variants of Base64
Base64 is simply the mapping from bit to bit, so there must be more than one mapping method. Let's take a look at the various variants of Base64 encoding. Generally speaking, the first 62 bits are basically the same, the difference is The last two characters, and the character used for padding (this may be mandatory in some protocols, or may be removed in others).
The following table is a common Base64 encoding variant:
encoding name | coded character | coded character | coded character |
---|---|---|---|
62nd | 63rd | Completion | |
RFC 1421: Base64 for Privacy-Enhanced Mail (deprecated) | + | / | = mandatory |
RFC 2045: Base64 transfer encoding for MIME | + | / | = mandatory |
RFC 2152: Base64 for UTF-7 | + | / | No |
RFC 3501: Base64 encoding for IMAP mailbox names | + | , | No |
RFC 4648: base64 (standard) | + | / | = optional |
RFC 4648: base64url (URL- and filename-safe standard) | - | _ | = optional |
RFC 4880: Radix-64 for OpenPGP | + | / | = mandatory |
Base64 encoding details
In the previous section, we talked about the basic principles of Base64 encoding and some common variants, so how exactly is it mapped?
In this section, we will take the standard form of Base64, RFC 4648, as an example to explain in detail.
RFC 4648 chooses the two characters + and / as bits 62 and 63 in the encoding, and chooses = as the completion character.
Let's first observe the mapping table of RFC 4648:
index | binary | character | index | binary | Char | index | binary | Char | index | binary | Char |
0 | 000000 | A | 16 | 010000 | Q | 32 | 100000 | g | 48 | 110000 | w |
1 | 000001 | B | 17 | 010001 | R | 33 | 100001 | h | 49 | 110001 | x |
2 | 000010 | C | 18 | 010010 | S | 34 | 100010 | i | 50 | 110010 | y |
3 | 000011 | D | 19 | 010011 | T | 35 | 100011 | j | 51 | 110011 | z |
4 | 000100 | E | 20 | 010100 | U | 36 | 100100 | k | 52 | 110100 | 0 |
5 | 000101 | F | twenty one | 010101 | V | 37 | 100101 | l | 53 | 110101 | 1 |
6 | 000110 | G | twenty two | 010110 | W | 38 | 100110 | m | 54 | 110110 | 2 |
7 | 000111 | H | twenty three | 010111 | X | 39 | 100111 | n | 55 | 110111 | 3 |
8 | 001000 | I | twenty four | 011000 | Y | 40 | 101000 | o | 56 | 111000 | 4 |
9 | 001001 | J | 25 | 011001 | Z | 41 | 101001 | p | 57 | 111001 | 5 |
10 | 001010 | K | 26 | 011010 | a | 42 | 101010 | q | 58 | 111010 | 6 |
11 | 001011 | L | 27 | 011011 | b | 43 | 101011 | r | 59 | 111011 | 7 |
12 | 001100 | M | 28 | 011100 | c | 44 | 101100 | s | 60 | 111100 | 8 |
13 | 001101 | N | 29 | 011101 | d | 45 | 101101 | t | 61 | 111101 | 9 |
14 | 001110 | O | 30 | 011110 | e | 46 | 101110 | u | 62 | 111110 | + |
15 | 001111 | P | 31 | 011111 | f | 47 | 101111 | v | 63 | 111111 | / |
Completion | = |
Let's take the word man as an example to observe the encoding process of Base64.
The word man is represented by 77, 97, and 110 in ASCII, which translates to 01001101, 01100001, and 01101110 in binary.
Combining the above three binaries together becomes: 010011010110000101101110, a total of 24-bit, select the corresponding characters from the above table, so we can get man after base64 encoding and get: TWFu.
In the above example, man is exactly 3 characters, that is, 24 bits, which can be fully represented by base64. If we only have the two characters ma, how should we encode it?
As above, the binary of ma is 01001101, 01100001 respectively, and the combination is 0100110101100001.
But the above bits are only 16 bits, because a base64 is 6 bits, so it can be represented by 3 base64, because the original bits are two less, so it is filled with 0:
0100110101100001+00 = 010011010110000100.
Converting 010011010110000100 to base64 is TWE, because base64 encoding requires 4 characters, so the last character is completed with =, which means that me becomes TWE= after base64.
Summarize
The above is the basic meaning and conversion rules of Base64. In fact, the protocol is very simple. Convert the data to be converted into binary, and then convert and complete it according to the conversion table.
This article has been included in http://www.flydean.com/18-base64-encoding/
The most popular interpretation, the most profound dry goods, the most concise tutorials, and many tricks you don't know are waiting for you to discover!
Welcome to pay attention to my official account: "Program those things", understand technology, understand you better!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。