Learn about iconv extension related functions in PHP

Presumably everyone has come into contact with the related functions of the iconv extension. As a default extension of PHP, it has existed for a long time, and it is also a function that we often use when manipulating character encoding. But besides the iconv() function, do you know any other functions? Today, let's learn about various fun functions in the iconv extension.

iconv set and get information

First, we can set the output and output character encoding format defined by default in the iconv extension.

iconv_set_encoding("internal_encoding", "UTF-8");
// Deprecated: iconv_set_encoding(): Use of iconv.internal_encoding is deprecated
iconv_set_encoding("output_encoding", "ISO-8859-1");
// Deprecated: iconv_set_encoding(): Use of iconv.output_encoding is deprecated
var_dump(iconv_get_encoding());
// array(3) {
//     ["input_encoding"]=>
//     string(5) "UTF-8"
//     ["output_encoding"]=>
//     string(10) "ISO-8859-1"
//     ["internal_encoding"]=>
//     string(5) "UTF-8"
//   }

iconv_set_encoding() receives two parameters, one is the attribute type to be set, and the other is the encoding format to be set. The attribute types include internal_encoding, input_encoding, and output_encoding, which represent internal, input, and output encoding formats, respectively. In this test code, we set internal_encoding to UTF8, output_encoding to ISO-8859-1, and then use iconv_get_encoding() to print out the relevant iconv property setting information in the current environment. As you can see, by default, the current The input_encoding in the environment is also in UTF8 format.

However, it should be noted that iconv_set_encoding() is already a deprecated function, or it is not recommended to use this function to set the above three attribute types, they will report outdated warning messages. Now it is more recommended to directly use the default_charset in php.ini for setting.

iconv obtains the character length, the specified position and the interception string according to the encoding

When facing the operation of Chinese strings, we use the default strlen() and other functions to return the Chinese character length that is incorrect, which involves encoding issues. Under normal circumstances, UTF8 occupies three bytes, and GBK occupies two bytes, so a Chinese character will return 3 if it is in the UTF8 environment for strlen(). Of course, in most cases, we will use the related functions of the MB library extension to deal with this problem, but iconv also provides us with several functions for string manipulation.

echo iconv_strlen("测试长度测试长度"), PHP_EOL; // 8
echo iconv_strlen("测试长度测试长度", 'ISO-8859-1'), PHP_EOL; // 24
echo iconv_strlen("测试长度测试长度", 'GBK'), PHP_EOL; // 12

echo '======', PHP_EOL;

echo iconv_strpos("测试长度测试长度", "长"), PHP_EOL; // 2
echo iconv_strpos("测试长度测试长度", "长", 0, 'ISO-8859-1'), PHP_EOL; // 6
echo iconv_strpos("测试长度测试长度", "长", 0, 'GBK'), PHP_EOL; // 

echo '======', PHP_EOL;

echo iconv_strrpos("测试长度测试长度", "长"), PHP_EOL; // 6
echo iconv_strrpos("测试长度测试长度", "长", 'ISO-8859-1'), PHP_EOL; // 18

echo '======', PHP_EOL;

echo iconv_substr("测试长度测试长度", 2, 4), PHP_EOL; // 长度测试
echo iconv_substr("测试长度测试长度", 6, 12, 'ISO-8859-1'), PHP_EOL; // 长度测试
echo iconv_substr("测试长度测试长度", 3, 6, 'GBK'), PHP_EOL; // 长度测试

iconv_strlen() is to get the length of the string. If the second parameter is not given, the default character set encoding is used to get the length of the string. It can be seen from the test code that the content of the same eight Chinese characters is different in the number returned using different codes. Here, we find that the Chinese in iconv for GBK is 1.5 bytes, that is, 8 Chinese characters occupy 12 bytes in length.

iconv_strpos() has the same function as iconv_strrpos() and strpos(). It returns the position of the first occurrence of a character, one is from front to back (from left to right), and the other is from back to front (from right to left) ). Their third parameter is the offset, which is to offset a few units after the specified character is found. From here we can see that there is a problem with the operation of GBK encoding, because in iconv, GBK is 1.5 bytes, which will cause the problem that a single character cannot be located.

iconv_substr() is obviously a function to intercept a string, and we also need to specify its interception position according to the encoding format.

iconv convert character encoding

Next is the use of the iconv() function of the deity. In fact, it has nothing to say. It only converts the specified code into another code. I believe this function is familiar to everyone.

$phone = file_get_contents('https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=13888888888');

print_r($phone);
// __GetZoneResult_ = {
//     mts:'1388888',
//     province:'����',
//     catName:'�й��ƶ�',
//     telString:'13888888888',
//         areaVid:'30515',
//         ispVid:'3236139',
//         carrier:'�����ƶ�'
// }

print_r(iconv('GBK', 'UTF-8', $phone));
// __GetZoneResult_ = {
//     mts:'1388888',
//     province:'云南',
//     catName:'中国移动',
//     telString:'13888888888',
//         areaVid:'30515',
//         ispVid:'3236139',
//         carrier:'云南移动'
// }

print_r(iconv('GBK', 'ISO-8859-1//IGNORE', $phone));
// __GetZoneResult_ = {
//     mts:'1388888',
//     province:'',
//     catName:'',
//     telString:'13888888888',
//         areaVid:'30515',
//         ispVid:'3236139',
//         carrier:''
// }

The open interface we found on Taobao is used to find information about mobile phone numbers, and the returned data is exactly GBK type data. When we print the result directly, it will output garbled information in the UTF8 environment. At this time, we can easily convert the encoding to UTF8 format through the iconv() function, and print the result correctly. In the third test, we added //IGNORE after the character set encoding type to be converted, the purpose is to ignore the content that cannot be converted, so it can be seen that when we finally converted to the wrong ISO-8859-1, The Chinese information is gone, because they cannot be converted and are ignored.

mime header operation

Finally, let's look at a very rarely used content, that is, iconv can also directly convert the encoded content information in the mime header. This mime header information is actually the mime type that indicates the current file or content. Usually we will judge whether the uploaded file is correct based on it. In addition to some, this mime header is also widely used in email sending. If you have done email sending and receiving related development and have captured the package, you must have seen the following content.

headers_string = <<<EOF
Subject: =?UTF-8?B?UHLDvGZ1bmcgUHLDvGZ1bmc=?=
To: example@example.com
Date: Thu, 1 Jan 1970 00:00:00 +0000
Message-Id: <example@example.com>
Received: from localhost (localhost [127.0.0.1]) by localhost
    with SMTP id example for <example@example.com>;
    Thu, 1 Jan 1970 00:00:00 +0000 (UTC)
    (envelope-from example-return-0000-example=example.com@example.com)
Received: (qmail 0 invoked by uid 65534); 1 Thu 2003 00:00:00 +0000
EOF;

The Subject character is the title of the email, and To is the email address of the sender. Here we mainly look at the content of Subject. At the beginning, there is a section describing the encoding information used in this field, ?UTF-8, and then there are a bunch of incomprehensible things at the end. In fact, we can simply see that this is a base64 encoded content, if you decode it under the corresponding encoded content, you can see the original information. However, at this time we can also use iconv to directly convert its encoding.

$headers =  iconv_mime_decode_headers($headers_string, 0, "ISO-8859-1");
var_dump($headers);
// array(5) {
//     ["Subject"]=>
//     string(15) "Pr�fung Pr�fung"
//     ["To"]=>
//     string(19) "example@example.com"
//     ["Date"]=>
//     string(30) "Thu, 1 Jan 1970 00:00:00 +0000"
//     ["Message-Id"]=>
//     string(21) "<example@example.com>"
//     ["Received"]=>
//     array(2) {
//       [0]=>
//       string(204) "from localhost (localhost [127.0.0.1]) by localhost with SMTP id example for <example@example.com>; Thu, 1 Jan 1970 00:00:00 +0000 (UTC) (envelope-from example-return-0000-example=example.com@example.com)"
//       [1]=>
//       string(57) "(qmail 0 invoked by uid 65534); 1 Thu 2003 00:00:00 +0000"
//     }
//   }

have you seen it? Not only converted the encoding directly, but also converted the mime header format into the array format in PHP. Of course, the code we tested here converted the normal content to ISO-8859-1, but garbled codes appeared instead. Let's take a look at an example of a Chinese email.

$headers_string = <<<EOF
Return-Path: <bluesky7810@163.com>
Delivered-To: bhw98@sina.com
Received: (qmail 75513 invoked by alias); 20 May 2002 02:19:53 -0000
Received: from unknown (HELO bluesky) (61.155.118.135)
    by 202.106.187.143 with SMTP; 20 May 2002 02:19:53 -0000
Message-ID: <007f01c3111c$742fec00$0100007f@bluesky>
From: "=?gb2312?B?wLbAtrXEzOwNCg==?=" <bluesky7810@163.com>
To: "bhw98" <bhw98@sina.com>
Cc: <bhwang@jlonline.com>
Subject: =?gb2312?B?ztK1xLbgtK6/2rPM0PI=?=
Date: Sat, 20 May 2002 10:03:36 +0800
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_NextPart_000_007A_01C3115F.80DFC5E0"

EOF;
$headers =  iconv_mime_decode_headers($headers_string, 0, "UTF-8");
var_dump($headers);
// array(11) {
//     ["Return-Path"]=>
//     string(21) "<bluesky7810@163.com>"
//     ["Delivered-To"]=>
//     string(14) "bhw98@sina.com"
//     ["Received"]=>
//     array(2) {
//       [0]=>
//       string(58) "(qmail 75513 invoked by alias); 20 May 2002 02:19:53 -0000"
//       [1]=>
//       string(101) "from unknown (HELO bluesky) (61.155.118.135) by 202.106.187.143 with SMTP; 20 May 2002 02:19:53 -0000"
//     }
//     ["Message-ID"]=>
//     string(40) "<007f01c3111c$742fec00$0100007f@bluesky>"
//     ["From"]=>
//     string(38) ""蓝蓝的天
//   " <bluesky7810@163.com>"
//     ["To"]=>
//     string(24) ""bhw98" <bhw98@sina.com>"
//     ["Cc"]=>
//     string(21) "<bhwang@jlonline.com>"
//     ["Subject"]=>
//     string(21) "我的多串口程序"
//     ["Date"]=>
//     string(31) "Sat, 20 May 2002 10:03:36 +0800"
//     ["MIME-Version"]=>
//     string(3) "1.0"
//     ["Content-Type"]=>
//     string(16) "multipart/mixed;"
//   }

The Subject of the mime header of this Chinese mail specifies GB2312. Through the iconv_mime_decode_headers() function, we convert the content of the entire header information into UTF8, and then all content information can be displayed normally. Of course, we can also transcode a single mime field.

echo iconv_mime_decode("Subject: =?gb2312?B?ztK1xLbgtK6/2rPM0PI=?=", 0, 'UTF-8'), PHP_EOL; // Subject: 我的多串口程序

In addition to encoding and converting the received information, we can also encode relevant content ourselves for sending and use.

$preferences = array(
    "input-charset" => "UTF-8",
    "output-charset" => "GBK",
    "line-length" => 76,
    "line-break-chars" => "\n"
);
$preferences["scheme"] = "Q";
echo iconv_mime_encode("Subject", "测试头", $preferences), PHP_EOL;
// Subject: =?GBK?Q?=B2=E2=CA=D4=CD=B7?=
$preferences["scheme"] = "B";
echo iconv_mime_encode("Subject", "测试头", $preferences), PHP_EOL;
// Subject: =?GBK?B?suLK1M23?=

The iconv_mime_encode() function is used to encode mime headers. The first parameter is the mime field name, the second parameter is the field value, and the third function is the parameter we encode. The content of the encoding parameter can be seen from the field name, from what encoding to what encoding, what is the length of the line, and what is the newline character. In addition, it also has a scheme field, which is used to specify the type of the encoding result. If B is set, then the encoding result will add a layer of base64 operations.

Summarize

Did the strange little postures increase again? That's right, I only knew one iconv before I updated the document. Even after studying these contents, I discovered that the email messages were originally coded in this way, and I felt that I suddenly grew taller. Okay, stop talking nonsense, try it yourself!

Test code:

https://github.com/zhangyue0503/dev-blog/blob/master/php/202011/source/2. Learn iconv extension related functions in PHP. php

Reference documents:

https://www.cnblogs.com/onelikeone/p/7865596.html

https://www.php.net/manual/zh/book.iconv.php

Searchable on their respective media platforms [Hardcore Project Manager]

Learn about iconv extension related functions in PHP

iconv set and get information

iconv obtains the character length, the specified position and the interception string according to the encoding

iconv convert character encoding

mime header operation

Summarize

https://www.php.net/manual/zh/book.iconv.php

硬核项目经理

引用和评论

一起学习PHP中的DS数据结构扩展（一）

在线考试答题系统（Web+H5+小程序）开发方案与实现附源代码

一个PHPer的偷懒哲学：如何用两套模板跳过重复造轮子

一文（加代码示例）说透在线客服系统技术难点

php+mysql 搭建一个在线游戏网站目前已有2000+游戏【代码解析一】

婚恋交友系统小程序+app+h5端多端同步 TP6+Uni-app框架

PHP-Casbin 在分布式服务中利用 Watcher 做策略同步

Learn about iconv extension related functions in PHP

iconv set and get information

iconv obtains the character length, the specified position and the interception string according to the encoding

iconv convert character encoding

mime header operation

Summarize

https://www.php.net/manual/zh/book.iconv.php

硬核项目经理

引用和评论

一起学习PHP中的DS数据结构扩展（一）

在线考试答题系统（Web+H5+小程序）开发方案与实现附源代码

一个PHPer的偷懒哲学：如何用两套模板跳过重复造轮子

一文（加代码示例）说透在线客服系统技术难点

php+mysql 搭建一个在线游戏网站目前已有2000+游戏【代码解析 一】

婚恋交友系统 小程序+app+h5端多端同步 TP6+Uni-app框架

PHP-Casbin 在分布式服务中利用 Watcher 做策略同步

php+mysql 搭建一个在线游戏网站目前已有2000+游戏【代码解析一】

婚恋交友系统小程序+app+h5端多端同步 TP6+Uni-app框架