In PHP, the internationalization functions are very rich, including many things that we may not know are actually very useful, such as the series of character sorting and comparison functions that we will introduce today.

Sort

Normally, if we sort the characters in the array, they are arranged in the order of the ASC2 table of the characters. If it is in English, it is fine, but for Chinese, the sorted result will be very confusing.

$arr = ['我','是','硬','核','项', '目', '经', '理'];
sort($arr);
var_dump( $arr );
// array(8) {
//     [0]=>
//     string(3) "我"
//     [1]=>
//     string(3) "是"
//     [2]=>
//     string(3) "核"
//     [3]=>
//     string(3) "理"
//     [4]=>
//     string(3) "目"
//     [5]=>
//     string(3) "硬"
//     [6]=>
//     string(3) "经"
//     [7]=>
//     string(3) "项"
//   }

According to our habit, Chinese characters will be sorted by Chinese pinyin. At this time, everyone will often choose to write their own sorting algorithm or find a suitable Composer package. In fact, PHP has prepared an object for us to deal with this kind of problem.

$coll = new Collator( 'zh_CN' );

$coll->sort($arr);
var_dump( $arr );
// array(8) {
//     [0]=>
//     string(3) "核"
//     [1]=>
//     string(3) "经"
//     [2]=>
//     string(3) "理"
//     [3]=>
//     string(3) "目"
//     [4]=>
//     string(3) "是"
//     [5]=>
//     string(3) "我"
//     [6]=>
//     string(3) "项"
//     [7]=>
//     string(3) "硬"
//   }

Yes, it is this Collator class. It needs to specify the current area when it is instantiated. For example, we specify it as zh_CN, which is the Chinese character area. At this time, use its sort() method to complete the pinyin sorting of Chinese characters.

$coll->sort($arr, Collator::SORT_NUMERIC );
var_dump( $arr );
// array(8) {
//     [0]=>
//     string(3) "核"
//     [1]=>
//     string(3) "经"
//     [2]=>
//     string(3) "理"
//     [3]=>
//     string(3) "目"
//     [4]=>
//     string(3) "是"
//     [5]=>
//     string(3) "我"
//     [6]=>
//     string(3) "项"
//     [7]=>
//     string(3) "硬"
//   }

$coll->sort($arr, Collator::SORT_STRING );
var_dump( $arr );
// array(8) {
//     [0]=>
//     string(3) "核"
//     [1]=>
//     string(3) "经"
//     [2]=>
//     string(3) "理"
//     [3]=>
//     string(3) "目"
//     [4]=>
//     string(3) "是"
//     [5]=>
//     string(3) "我"
//     [6]=>
//     string(3) "项"
//     [7]=>
//     string(3) "硬"
//   }

The sort() method of the Collator object also supports a second parameter, which is used to specify whether the current sort is sorted in character or number format. For pure Chinese content, this makes no difference.

In addition to the sort() method, it also has an assort() method, which has the same function as the ordinary assort() function, except that it also supports different regional languages.

$arr = [
    'a' => '100',
    'b' => '7',
    'c' => '50'
];
$coll->asort($arr, Collator::SORT_NUMERIC );
var_dump( $arr );
// array(3) {
//     ["b"]=>
//     string(1) "7"
//     ["c"]=>
//     string(2) "50"
//     ["a"]=>
//     string(3) "100"
//   }

$coll->asort($arr, Collator::SORT_STRING );
var_dump( $arr );
// array(3) {
//     ["a"]=>
//     string(3) "100"
//     ["c"]=>
//     string(2) "50"
//     ["b"]=>
//     string(1) "7"
//   }

$arr = [
    '中' => '100',
    '的' => '7',
    '文' => '50'
];
$coll->asort($arr, Collator::SORT_NUMERIC );
var_dump( $arr );
// array (
//     '的' => '7',
//     '文' => '50',
//     '中' => '100',
//   )

$coll->asort($arr, Collator::SORT_STRING );
var_dump( $arr );
// array (
//     '中' => '100',
//     '文' => '50',
//     '的' => '7',
//   )

The asrot() method is sorted by key and value together, so specifying SORT_STRING and SORT_NUMERIC here has obvious effects. We can see that if it is sorted by number, then the result is based on the digital content, if it is sorted by character, then the result is sorted based on the string part of the key value.

Both sort() and asrot() are essentially the same as the sort() and asrot() functions provided by normal PHP by default. It's just that they have more regional language functions.

In addition, the Collator object also provides a sortWithSortKeys() method, which is not available in ordinary PHP sort functions.

$arr = ['我','是','硬','核','项', '目', '经', '理'];
$coll->sortWithSortKeys($arr);
var_dump( $arr );
// array (
//     0 => '核',
//     1 => '经',
//     2 => '理',
//     3 => '目',
//     4 => '是',
//     5 => '我',
//     6 => '项',
//     7 => '硬',
//   )

It is similar to the sort() method, but uses ucol_getSortKey() to generate the ICU sort key, which is faster on large arrays.

The full name of ICU is International Components for Unicode, which is the international component of Unicode. It provides translation-related functions, which is the basis for our system and various programming languages to achieve internationalization capabilities.

compare

The next step is the comparison of strings. For example, we all know that "a" is larger than "A", because in the ASC2 code table, "A" is 65 and "a" is 97. Of course, this is only a comparison by default. When using the function of the Collator object to compare, it is compared according to the sort index in the dictionary library. For Chinese, it is basically compared in the order of pinyin.

var_dump($coll->compare('Hello', 'hello')); // int(1)
var_dump($coll->compare('你好', '您好')); // int(-1)

The compare() method is used to compare. If the two strings are equal, the return is 0, if the first string is greater than the second, the return is 1, otherwise it returns -1. From the code, we can see that "Hello" is greater than "hello", and "Hello" is less than "Hello" (because "you" has an extra g).

Property setting

Some properties of the object can also be set in the Collator object.

$coll->setAttribute(Collator::CASE_FIRST, Collator::UPPER_FIRST);
var_dump($coll->getAttribute(Collator::CASE_FIRST)); // int(25)
var_dump($coll->compare('Hello', 'hello')); // int(-1)

$coll->setAttribute(Collator::CASE_FIRST, Collator::LOWER_FIRST);
var_dump($coll->getAttribute(Collator::CASE_FIRST)); // int(24)
var_dump($coll->compare('Hello', 'hello')); // int(1)

$coll->setAttribute(Collator::CASE_FIRST, Collator::OFF);
var_dump($coll->getAttribute(Collator::CASE_FIRST)); // int(16)
var_dump($coll->compare('Hello', 'hello')); // int(1)

Here we specify the CASE_FIRST attribute for the object. The attribute value can specify uppercase first, lowercase first, etc. For English characters, this can affect the sorting and comparison results.

In addition, we can also obtain the current regional language information through a method.

var_dump($coll->getLocale(Locale::VALID_LOCALE)); // string(10) "zh_Hans_CN"
var_dump($coll->getLocale(Locale::ACTUAL_LOCALE)); // string(2) "zh"

These two parameters are to obtain effective regional setting information and actual regional information.

Sorting information

Of course, we can also see the specific sorting information, which is the encoding of the characters in the Collator.

var_dump(bin2hex($coll->getSortKey('Hello'))); // string(20) "b6b0bebec4010901dc08"
var_dump(bin2hex($coll->getSortKey('hello'))); // string(18) "b6b0bebec401090109"
var_dump(bin2hex($coll->getSortKey('你好'))); // string(16) "7b9b657301060106"
var_dump(bin2hex($coll->getSortKey('您好'))); // string(16) "7c33657301060106"

$coll = collator_create( 'en_US' );

var_dump($coll->compare('Hello', 'hello')); // int(1)
var_dump($coll->compare('你好', '您好')); // int(-1)

var_dump($coll->getLocale(Locale::VALID_LOCALE)); // string(5) "en_US"
var_dump($coll->getLocale(Locale::ACTUAL_LOCALE)); // string(4) "root"

var_dump(bin2hex($coll->getSortKey('Hello'))); // string(20) "3832404046010901dc08"
var_dump(bin2hex($coll->getSortKey('hello'))); // string(18) "383240404601090109"
var_dump(bin2hex($coll->getSortKey('你好'))); // string(20) "fb0b8efb649401060106"
var_dump(bin2hex($coll->getSortKey('您好'))); // string(20) "fba5f8fb649401060106"

It can be seen that the getSortKey() sort key information obtained without the same regional language is different, but they are all stored in hexadecimal, which is completely different from the default ASC2 code.

Error message

$coll = new Collator( 'en_US' );;
$coll->compare( 'y', 'k' ); 
var_dump($coll->getErrorCode()); // int(0)
var_dump($coll->getErrorMessage()); // string(12) "U_ZERO_ERROR"

Use getErrorCode() to get the error code, and use getErrorMessage() to get the error message. Regarding the returned U_ZERO_ERROR, no relevant information was found. I hope friends who are knowledgeable can reply to the explanation and learn together.

Sorting rule strength

In addition, the Collator object also has a sorting strength setting, but the effect I tested did not reflect.

$arr  = array( 'a', 'à' ,'A');
$coll = new Collator( 'de_DE' );

$coll->sort($arr);
var_dump($coll->getStrength());
var_dump( $arr ); // int(2)
// array(3) {
//     [0]=>
//     string(1) "a"
//     [1]=>
//     string(1) "A"
//     [2]=>
//     string(2) "à"
//   }

$coll->setStrength(Collator::IDENTICAL);
var_dump($coll->getStrength()); // int(15)
$coll->sort($arr);
var_dump( $arr );

$coll->setStrength(Collator::QUATERNARY);
var_dump($coll->getStrength()); // int(3)
$coll->sort($arr);
var_dump( $arr );

$coll->setStrength(Collator::PRIMARY);
var_dump($coll->getStrength()); // int(0)
$coll->sort($arr );
var_dump( $arr );

$coll->setStrength(Collator::TERTIARY);
var_dump($coll->getStrength()); // int(2)
$coll->sort($arr );
var_dump( $arr );

$coll->setStrength(Collator::SECONDARY);
var_dump($coll->getStrength()); // int(1)
$coll->sort($arr );
var_dump( $arr );

In the result of the test code in the official document, specifying different parameters will return a different sort order, but the results of my actual test are all the same. So I won't explain it here, because I didn't understand why. You can just get to know it. If you have a good knowledge of this aspect, please leave a message and reply to learn together!

Summarize

A very interesting object, in fact, this object also supports procedural function writing, and there are also procedural calls in the sample code. Generally speaking, the two functions of sorting and comparing by pinyin are still useful in actual development. You can try it out!

Test code:

https://github.com/zhangyue0503/dev-blog/blob/master/php/202011/source/3. Internationalized string comparison object in

Reference documents:

https://www.php.net/manual/zh/class.collator.php

Searchable on their respective media platforms [Hardcore Project Manager]


硬核项目经理
90 声望18 粉丝