For databases, appropriate character sets and rules can greatly improve the efficiency of user operation, maintenance and analysis. TiDB supports new collation rules from v4.0, and has been updated in TiDB 6.0. This article will deeply interpret the changes and applications of Collation rules in TiDB 6.0.
Ps. This article has been selected for TiDB 6.0 Book Rush . If you also want to share the experience of using 6.0 and get rich rewards, welcome to join us!
lead
The "introduction" here has two meanings. The first layer is the " introduction ". As you can see from the TiDB v6.0 release notes , TiDB 6.0 introduces many new features and new releases. Model , this article will take a look at the new features of TiDB 6.0.
The second layer of meaning is "throw a brick to attract jade". The power of the open source community is endless. I hope more people can participate in open source. So how to participate in open source, in fact, there are far more ways than submitting code. For example, in AskTUG community asks, answers, and interacts. Another example is finding bugs or incomplete information in TiDB official documents , and proposing issues and solutions. Another example, participating in TiDB 6.0 Book Rush! activities, doing version evaluations, case articles, etc.
New Collation rules enabled by default
TiDB supports new collation rules from v4.0, which are consistent with MySQL's behavior in case-insensitive, accent-insensitive, and padding rules.
TiDB 6.0 introduces new Collation rules and will enable the new Collation framework by default. The new Collation rule has been introduced in TiDB 4.0, but it has always been disabled by default and can only be changed when the cluster is initialized. The setting of the variable value can be seen through the system table.
TiDB [(none)] 18:45:27> select * from mysql.tidb where variable_name = 'new_collation_enabled';
+-----------------------+----------------+----------------------------------------------------+
| VARIABLE_NAME | VARIABLE_VALUE | COMMENT |
+-----------------------+----------------+----------------------------------------------------+
| new_collation_enabled | True | If the new collations are enabled. Do not edit it. |
+-----------------------+----------------+----------------------------------------------------+
1 row in set (0.003 sec)
Looking at the I\_S.collations table, you can see that gbk_bin
6.0 already supports 11 rules. Compared with the previous version without the new collation framework, 5 new rules have been gbk_chinese_ci
. gbk_chinese_ci
, utf8_general_ci
, utf8_unicode_ci
, utf8mb4_unicode_ci
.
Since many old systems use GBK character set, it is particularly important and practical to support GBK character set when doing system reconstruction projects, especially when data migration is involved. Of course, for new projects, UTF8mb4 is recommended.
TiDB [(none)] 18:45:51> select version()\G
*************************** 1. row ***************************
version(): 5.7.25-TiDB-v6.0.0
1 row in set (0.001 sec)
TiDB [(none)] 18:46:00> SELECT * FROM information_schema.collations;
+--------------------+--------------------+------+------------+-------------+---------+
| COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
+--------------------+--------------------+------+------------+-------------+---------+
| ascii_bin | ascii | 65 | Yes | Yes | 1 |
| binary | binary | 63 | Yes | Yes | 1 |
| gbk_bin | gbk | 87 | | Yes | 1 | (#28645)
| gbk_chinese_ci | gbk | 28 | Yes | Yes | 1 | (#28645)
| latin1_bin | latin1 | 47 | Yes | Yes | 1 |
| utf8_bin | utf8 | 83 | Yes | Yes | 1 |
| utf8_general_ci | utf8 | 33 | | Yes | 1 |
| utf8_unicode_ci | utf8 | 192 | | Yes | 1 | (#18678)
| utf8mb4_bin | utf8mb4 | 46 | Yes | Yes | 1 |
| utf8mb4_general_ci | utf8mb4 | 45 | | Yes | 1 |
| utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 1 | (#18678)
+--------------------+--------------------+------+------------+-------------+---------+
11 rows in set (0.001 sec)
TiDB [test] 19:05:14> SHOW CHARACTER SET;
+---------+-------------------------------------+-------------------+--------+
| Charset | Description | Default collation | Maxlen |
+---------+-------------------------------------+-------------------+--------+
| ascii | US ASCII | ascii_bin | 1 |
| binary | binary | binary | 1 |
| gbk | Chinese Internal Code Specification | gbk_chinese_ci | 2 |
| latin1 | Latin1 | latin1_bin | 1 |
| utf8 | UTF-8 Unicode | utf8_bin | 3 |
| utf8mb4 | UTF-8 Unicode | utf8mb4_bin | 4 |
+---------+-------------------------------------+-------------------+--------+
6 rows in set (0.001 sec)
In TiDB v5.4, the result of not enabling the new Collation is:
TiDB-v5.4 [test] 10:17:22> select version()\G
*************************** 1. row ***************************
version(): 5.7.25-TiDB-v5.4.0
1 row in set (0.001 sec)
TiDB-v5.4 [test] 10:19:39> SELECT * FROM information_schema.collations;
+----------------+--------------------+------+------------+-------------+---------+
| COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN |
+----------------+--------------------+------+------------+-------------+---------+
| utf8mb4_bin | utf8mb4 | 46 | Yes | Yes | 1 |
| latin1_bin | latin1 | 47 | Yes | Yes | 1 |
| binary | binary | 63 | Yes | Yes | 1 |
| ascii_bin | ascii | 65 | Yes | Yes | 1 |
| utf8_bin | utf8 | 83 | Yes | Yes | 1 |
| gbk_bin | gbk | 87 | Yes | Yes | 1 |
+----------------+--------------------+------+------------+-------------+---------+
6 rows in set (0.001 sec)
New Collation Notes
For versions prior to TiDB 6.0, the default value of this configuration item is always false, but it can be changed before the cluster is initialized, so that the new collation framework can be used after the cluster is initialized.
service_configs:
tidb:
new_collations_enabled_on_first_bootstrap: true
However, it should be emphasized here that when the TiDB cluster is upgraded across major versions, the configuration items need to be checked. In order to avoid the situation that the upstream and downstream cluster character verification rules are inconsistent, the data is not synchronized or the query results are inconsistent. In addition, when using BR for data backup and recovery, you also need to pay attention to the Collation settings to ensure that the cluster settings before backup and after recovery are the same to prevent errors due to different configuration items new_collations_enabled_on_first_bootstrap
.
Collation bug fix
In TiDB 6.0, two bugs about Collation have been fixed, which are related to the comparison function and JSON respectively. Here are two small cases to test them.
1. Fix the problem that the result of the greaterst or least function with collation is wrong #31789
Test case:
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (
c1 char(20) CHARACTER SET utf8 COLLATE utf8_bin,
c2 char(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin);
INSERT INTO t1 VALUES ('UUtJeaV','snRXXCZHBPW');
SET names utf8mb4 collate utf8mb4_bin;
SELECT greatest( c1, c2 ) as expr1 FROM t1;
SELECT least( c1, c2 ) as expr1 FROM t1;
SET names utf8mb4 collate utf8mb4_general_ci;
SELECT greatest( c1, c2 ) as expr1 FROM t1;
SELECT least( c1, c2 ) as expr1 FROM t1;
Test Results:
2. Fixed the problem that the json type deduces the collation error in builtin-func #31320
The main code to fix this problem is as follows, the expected behavior is consistent with MySQL, and the utf8mb4\_bin rule should always be used when using internal methods of the JSON type.
// The collation of JSON is always utf8mb4_bin in builtin-func which is same as MySQL
// see details https://github.com/pingcap/tidb/issues/31320#issuecomment-1010599311
if isJSON {
dstCharset, dstCollation = charset.CharsetUTF8MB4, charset.CollationUTF8MB4
}
Test case:
DROP TABLE IF EXISTS t2;
CREATE TABLE t2 (c1 json);
INSERT INTO t2 VALUES ('{\"测试\": \"你好\"}');
SELECT collation(c1), collation(upper(c1)), collation(elt(1, c1, 0x12)) FROM t2;
Test Results:
The test results in TiDB v5.4 are:
Added built-in function CHARSET()
TiDB 6.0 adds a new built-in function to determine the character set of the input parameter, which is related to Collation, so it is demonstrated together with an example. Note: From the Issue #3931 record, this requirement was proposed as early as 2017, but it was not merged into the main code until 6.0.
TiDB [test] 23:54:42> select version()\G
*************************** 1. row ***************************
version(): 5.7.25-TiDB-v6.0.0
1 row in set (0.001 sec)
TiDB [test] 00:03:51> set names utf8mb4;
Query OK, 0 rows affected (0.000 sec)
TiDB [test] 00:03:58> select charset(1);
+------------+
| charset(1) |
+------------+
| binary |
+------------+
1 row in set (0.001 sec)
TiDB [test] 00:04:03> select charset('1');
+--------------+
| charset('1') |
+--------------+
| utf8mb4 |
+--------------+
1 row in set (0.001 sec)
Documentation Supplements
As mentioned at the beginning, there are many ways to participate in open source. We benefit from open source, and naturally we have to give back to the community.
To give a practical example, when I checked the documentation for this article, I found that the query result set in the v6.0 (DMR) version of the Collations section was inconsistent with reality, so I gave feedback.
The issue was submitted on GitHub, and the processing speed was also very fast. The initial modification was completed the next day, and it is now in a state of waiting for Merge to master.
The relevant Issue link is: https://github.com/pingcap/docs/pull/8364
end
This article summarizes and explains the changes of the Collation feature in TiDB 6.0 and gives examples. Currently, several Collations provided by TiDB can already support most business scenarios and most system migration requirements. It is recommended to use universal character sets and rules at the beginning of project design. After all, efficiency is one of the important factors for improving productivity.
Upcoming Events
The newly released TiDB 6.0 greatly enhances the manageability of TiDB as an enterprise-level product, has more mature HTAP and disaster recovery capabilities, and adds a number of basic features required by cloud-native databases. Be the first to try a variety of new products and services such as TiDB 6.0, TiFlash, TiEM, and PingCAP Clinic , participate in Book Rush contributions, and have the opportunity to enjoy a number of rights including "TiDB Honorary Experience Officer"!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。