Technology sharing | tidb 2.1 upgrade to 4.0 operation document - 个人文章

Author: Mo Shan
Senior DBA of an Internet company.
Source of this article: original contribution
*The original content is produced by the open source community of Aikesheng, and the original content shall not be used without authorization. For reprinting, please contact the editor and indicate the source.

I. Introduction

The online tidb clusters are all 2.1.[5,7,8,17]. Because the version is too low, it faces many problems, such as difficult management, hot issues, invalid execution plans, performance bottlenecks, and other known/unknown and unsolvable problems. The problem now needs to be upgraded to version 4.0.13. After investigation, it is found that if the in-situ upgrade will require multiple upgrades [2.1--> 3.0 --> 4.0], I am worried that the in-situ upgrade will encounter irreversible failures, and I am more worried that it cannot be solved and will affect the business, so after testing and evaluation, and finally upgrade by means of data migration.

Because there are relatively few users using version 2.1, and they are not upgraded, there are almost no migration and upgrade documents that can be referred to. Many problems were encountered in the upgrade, and many pits were stepped on. This article organizes the upgrade operation process and marks each What problems and solutions are easy to encounter in the steps, the right to exchange experience and avoid pit guides. All contents/operation commands in this article are for reference only.

Because 5.0 is based on the MySQL 8.0 protocol, I was worried about incompatibility with the business, and because the minor versions of 5.0+ were still relatively small and worried about stability, so I did not consider it. At that time, 4.0.13 was the latest version of 4.0, so I chose this version .
24 tidb clusters have been upgraded from 2.1 to 4.0.13.

2. Environment introduction

1. Introduction to the old cluster environment

existing components

Role	quantity	port
pd	3	5017
tidb	3	4000
tikv	3	20117
alertmanager	1	9093
prometheus	1	9100
grafana	1	3000
vip	192.168.1.100	4000
dns	old.tdb.com	4000

A component that is not listed means that the component is not enabled. For historical reasons, the cluster does not have the pump component enabled.
There are no rules for port planning

Projected components to be added

Role	quantity	port
pump	3	23001
drainer	1	24001

2. Old cluster access information

dns	old.tdb.com
vip	192.168.1.100:4000	rs : 192.168.1.1:4000 192.168.1.2:4000 192.168.1.3:4000

3. Introduction to the new cluster environment

Role	quantity	port
pd	3	13002
tidb	3	15002
tikv	3	17002
ticdc	3	33002
alertmanager	1	21002
prometheus	1	19002
grafana	1	20002
vip	192.168.1.100	15002
dns	new.tdb.com	15002

The port adopts the format of 2+3, the first two digits are the component number, and the last three digits are the cluster number. That is, the same last three digits represent the same cluster, and the same first two digits represent the same component.

4. New cluster access information

dns	new.tdb.com
vip	192.168.1.100:15002	rs : 192.168.1.1:15002 192.168.1.2:15002 192.168.1.3:15002

3. Process introduction

1. dba prints the ip list currently connected to tidb to let the main business party confirm whether there is an ip other than this business. Make sure all businesses using the cluster are involved.
2. The dba confirms with the business whether there is a reconnection mechanism. (Opening binlog requires restarting the tidb component).
3. dba enables binlog. This step requires rolling restart of the tidb component, and a time window needs to be negotiated with the business.
4. dba deploys the 4.0 environment and imports full data.
5. dba synchronizes incremental data.
6. The dba verifies the consistency of the new and old cluster data.
7. dba delivers a new environment and provides a new domain name + port.
8. dba provides read-only accounts, business tests, and verification business scenarios (read only, not write).
9, dba synchronization permissions.
10. Switch traffic.

Fourth, the upgrade operation

1. Print the old cluster access list

 ansible # /opt/soft/mysql57/bin/mysql -u root -h old.tdb1 -P 4000 -ppassword
mysql> select distinct host from information_schema.processlist

 ansible # /opt/soft/mysql57/bin/mysql -u root -h old.tdb2 -P 4000 -ppassword
mysql> select distinct host from information_schema.processlist

 ansible # /opt/soft/mysql57/bin/mysql -u root -h old.tdb3 -P 4000 -ppassword
mysql> select distinct host from information_schema.processlist

Log in to all tidb nodes, append the output of each node to a file, and then sort and deduplicate to count client ip

2. Confirm whether there is a reconnection mechanism

slightly

3. Open binlog and make full backup

This step is performed on the ansible management machine

(1) Edit the configuration file

 ansible # vim /path/github/tidb-ansible-2.1.8/inventory.ini

Add monitoring of pump components

 [monitored_servers]
monitor-pump1 ansible_host=xxxx deploy_dir=/path/tidb-data/pump-23001
monitor-pump2 ansible_host=xxxx deploy_dir=/path/tidb-data/pump-23001
monitor-pump3 ansible_host=xxxx deploy_dir=/path/tidb-data/pump-23001

Add pump component

 [pump_servers]
#下面三个是pump组件的机器, 如果启用pump组件还需要打开 enable_binlog = True
pump1 ansible_host=xxxx deploy_dir=/path/tidb-data/pump-23001
pump2 ansible_host=xxxx deploy_dir=/path/tidb-data/pump-23001
pump3 ansible_host=xxxx deploy_dir=/path/tidb-data/pump-23001

Pump port settings and enable binlog parameters

 ## Global variables
[all:vars]
pump_port = 23001

## binlog trigger
enable_binlog = True
#enable_binlog = False

If you do not set enable_binlog = True, it will be ignored when deploying the pump. In addition, it should be noted that before the pump can provide services, the configuration of tidb cannot be reloaded and restarted, otherwise the business write operation will fail.

(2) Edit the configuration of the pump

 ansible # vim /path/github/tidb-ansible-2.1.8/conf/pump.yml

Modify binlog save cycle

 global:
  # a integer value to control expiry date of the binlog data, indicates for how long (in days) the binlog data would be stored. 
  # must bigger than 0
  gc: 14

Change it to 14 days to avoid the loss of incremental data (the binlog is cleared) due to the long time of importing full data.

(3) Log in to the target machine to create a directory

 ansible # ssh pump1
pump1   # mkdir -p /path/tidb-data/pump-23001
pump1   # chown -R tidb. /path/tidb-data/pump-23001

 ansible # ssh pump2
pump2   # mkdir -p /path/tidb-data/pump-23001
pump2   # chown -R tidb. /path/tidb-data/pump-23001

 ansible # ssh pump3
pump3   # mkdir -p /path/tidb-data/pump-23001
pump3   # chown -R tidb. /path/tidb-data/pump-23001

(4) Deploy pump and monitor on the ansible management machine

 ansible # ansible-playbook deploy.yml -l monitor-pump1,monitor-pump2,monitor-pump3,pump1,pump2,pump3 -i inventory.ini

(5) Start pump and monitor on the ansible management machine

 ansible # ansible-playbook start.yml -l monitor-pump1,monitor-pump2,monitor-pump3,pump1,pump2,pump3 -i inventory.ini

(6) Log in to tidb to see if the pump is deployed

 ansible # /opt/soft/mysql57/bin/mysql -u root -h old.tdb.com -P 4000 -ppassword
mysql> show pump status;
+------------+------------+--------+--------------------+---------------------+
| NodeID     | Address    | State  | Max_Commit_Ts      | Update_Time         |
+------------+------------+--------+--------------------+---------------------+
| xxxx:23001 | xxxx:23001 | online | 427138948355850245 | 2021-08-20 04:42:57 |
| xxxx:23001 | xxxx:23001 | online | 427138948395171844 | 2021-08-20 04:42:57 |
| xxxx:23001 | xxxx:23001 | online | 427138948408279045 | 2021-08-20 04:42:57 |
+------------+------------+--------+--------------------+---------------------+
3 rows in set (0.00 sec)

mysql>

It should be noted that versions before 2.1.6 do not support this query operation. You need to check the status of the pump through binlogctl, as shown in the following example.

 ansible #  /path/binlogctl -pd-urls=http://pd_host:pd_port -cmd pumps
INFO[0000] pump: {NodeID: xxxx:23001, Addr: xxxx:23001, State: online, MaxCommitTS: 432180280017551379, UpdateTime: 2021-08-20 04:45:57 +0800 CST} 
INFO[0000] pump: {NodeID: xxxx:23001, Addr: xxxx:23001, State: online, MaxCommitTS: 432180280004444167, UpdateTime: 2022-03-30 18:45:14 +0800 CST} 
INFO[0000] pump: {NodeID: xxxx:23001, Addr: xxxx:23001, State: online, MaxCommitTS: 432180280017551372, UpdateTime: 2022-03-30 18:45:14 +0800 CST}

(7) Rolling restart the tidb node on the ansible management machine

Before performing this operation, make sure that the pump component is working properly.

 ansible # ansible-playbook rolling_update.yml -t tidb -i inventory.ini

It should be noted that this operation may fail to start or close the ansible action (it will be stuck until the timeout). If this happens, you can log in to the target machine to manually start or stop it. The reference command is as follows:
Start cd /path/tidb/scripts && sudo -u tidb bash start_tidb.sh
stop cd /path/tidb/scripts && sudo -u tidb bash stop_tidb.sh

(8) Log in to tidb to check whether binlog is enabled

 ansible # /opt/soft/mysql57/bin/mysql -u root -h old.tdb.com -P 4000 -ppassword
mysql> show variables like 'log_bin';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| log_bin       | 1     |
+---------------+-------+
1 row in set (0.01 sec)

It should be noted that log_bin in versions before 2.1.6 is always equal to 0, which means that even if enable_binlog = True, it is also 0 through show variables like 'log_bin';, but the pump will record binlog.
It is recommended to check each tidb one by one.

(9) Update monitoring on the ansible management machine

 ansible # ansible-playbook rolling_update_monitor.yml -t prometheus -i inventory.ini

(10) Create a full backup

 ansible # /path/mydumper -u user -p pass -h old.tdb.com -P 4000 -t 2 -F 32 --skip-tz-utc -o /backup_path/4000 -B db_name

Note for backup:
Tools to get https://docs.pingcap.com/zh/tidb/v2.1/backup-and-restore
Back up at low business peaks, otherwise the network card may be full (especially tidb is a 10 Gigabit network card, and tikv is the architecture of a Gigabit network card)
The backup may fail because the gc time is too short (resolved by adjusting the gc time)
The backup may fail because the memory allocated by tidb is too small (solved by adjusting the tidb memory)
After the backup is completed, it is recommended to check the file of the table creation statement to see if there is an illegal time format ("0000-00-00"). If there is an error when importing a new cluster, you need to communicate with the business to change the default value.
mydumper does not support current-limited backup. By backing up to a machine with poor disk performance or network storage such as cfs, current-limited backup can be achieved to a certain extent.

4. Deploy the 4.0 environment and import full data

The pessimistic transaction model needs to be paid attention to. Although 4.0 supports the pessimistic transaction model, and the new cluster is also enabled by default, there are still certain restrictions in order to use pessimistic locks for an operation, that is, non- autocommit transactions. For details, please refer to [section 6.2.3.2] of this article https://book.tidb.io/session1/chapter6/pessimistic-txn.html

(1) Install tiup

 ansible # curl --proto '=https' --tlsv1.2 -sSf https://tiup-mirrors.pingcap.com/install.sh | sh
ansible # . /root/.bash_profile
ansible # tiup --version
ansible # tiup update --self

(2) Prepare the topology file

 ansible # vim topology-oltp-xxx.yaml

 global:
  user: "tidb"
  ssh_port: 22
  deploy_dir: "/tidb-deploy"
  data_dir: "/tidb-data"
  
monitored:
  node_exporter_port: 11000
  blackbox_exporter_port: 12000
  
pd_servers:
  - host: 10.0.1.4
  - host: 10.0.1.5
  - host: 10.0.1.6

tidb_servers:
  - host: 10.0.1.1
  - host: 10.0.1.2
  - host: 10.0.1.3

tikv_servers:
  - host: 10.0.1.7
  - host: 10.0.1.8
  - host: 10.0.1.9

cdc_servers:
  - host: 10.0.1.7
  - host: 10.0.1.8
  - host: 10.0.1.9

monitoring_servers:
  - host: 10.0.1.10

grafana_servers:
  - host: 10.0.1.10

alertmanager_servers:
  - host: 10.0.1.10

The above is the configuration template provided by the official, please modify it according to the actual situation.
It is recommended to deploy ticdc (pump) to avoid retroactive incremental data when rollback is required.
A separate machine for each component is recommended.

(3) Check whether the ssh channel from the tiup manager to each node is normal

slightly

(4) Deploy the cluster

 ansible # tiup cluster check tidb-oltp-xxx-v4.0.13 v4.0.13 topology-oltp-xxx.yaml
ansible # tiup cluster deploy tidb-oltp-xxx-v4.0.13 v4.0.13 topology-oltp-xxx.yaml
ansible # tiup cluster start tidb-oltp-xxx-v4.0.13
ansible # tiup cluster display tidb-oltp-xxx-v4.0.13

Check may report many exceptions, which can be repaired according to the prompts, and many exceptions can also be ignored. Please refer to https://docs.pingcap.com/en/tidb/v4.0/tiup-component-cluster-check#tiup-cluster-check

(5) Authority maintenance

 ansible # /opt/soft/mysql57/bin/mysql -u root -h new.tdb.com -P 15002 -p
mysql> create user if not exists root@"192.168.1.%" IDENTIFIED BY 'password';
mysql> GRANT ALL PRIVILEGES ON *.* TO 'root'@'192.168.1.%' WITH GRANT OPTION;
mysql> GRANT ALL PRIVILEGES ON *.* TO 'root'@'pd1' WITH GRANT OPTION;
mysql> GRANT ALL PRIVILEGES ON *.* TO 'root'@'pd2' WITH GRANT OPTION;
mysql> GRANT ALL PRIVILEGES ON *.* TO 'root'@'pd3' WITH GRANT OPTION;

Here you can log in with an empty password.
The authorization of the pd node needs to be added here, and the root user is required (let’s give the all permission, the test found that the select permission is not enough, and a more detailed permission test was not done), otherwise the dashboard cannot be used normally.

 ansible # /opt/soft/mysql57/bin/mysql -u root -h new.tdb.com -P 15002 -ppassword
mysql> drop user if exists root@"%";

It is recommended to delete the empty password user root@'%'.

(6) Import full data

 ansible # /path/loader -d /backup_path/4000 -h new.tdb.com -u user -p password -P 15002 -t 2 -status-addr ":9299"

Recovery requires attention:
Tools to get https://docs.pingcap.com/zh/tidb/v2.1/backup-and-restore
It is recommended to recover during low business peaks
The loader may fail due to emoticons. If you encounter it, you can try Dumpling
In the scenario of multiple loader tasks, it is recommended to avoid the default port, otherwise it may fail due to port conflict

5. Synchronize incremental data

This step is performed on the ansible management machine

(1) Obtain the backup point on the backup machine (in this example, the ansible management machine is used for backup)

View metadata file from backup directory

 ansible # cd /backup_path/xxx
ansible # cat metadata 

Started dump at: 2021-08-29 15:34:30
SHOW MASTER STATUS:
    Log: tidb-binlog
    Pos: 425971435565482001
    GTID:

Finished dump at: 2021-08-29 15:34:33
ansible #

(2) Modify the configuration file

 ansible # vim /path/github/tidb-ansible-2.1.8/inventory.ini

Add monitoring of drainer components

 [monitored_servers]
monitor-drainer1 ansible_host=xxxx deploy_dir=/path/tidb-data/drainer-24001

Add drainer component

 [drainer_servers]
drainer1 ansible_host=xxxx deploy_dir=/path/tidb-data/drainer-24001 initial_commit_ts="425971435565482001"

drainer port settings

 ## Global variables
[all:vars]
drainer_port = 24001

(3) Prepare the configuration file of the drainer

 ansible # vim /path/github/tidb-ansible-2.1.8/conf/drainer1_drainer.toml

The configuration file name naming rule is [alias_drainer.toml], otherwise the custom configuration file cannot be found during deployment.

 # drainer Configuration.

# the interval time (in seconds) of detect pumps' status
detect-interval = 10

# syncer Configuration.
[syncer]

# disable sync these schema
ignore-schemas = "INFORMATION_SCHEMA,PERFORMANCE_SCHEMA,mysql"

# number of binlog events in a transaction batch
txn-batch = 2000

# work count to execute binlogs
worker-count = 32

disable-dispatch = false

# safe mode will split update to delete and insert
safe-mode = false

# downstream storage, equal to --dest-db-type
# valid values are "mysql", "pb", "tidb", "flash", "kafka"
db-type = "tidb"

# the downstream MySQL protocol database
[syncer.to]
host = "new.tdb.com"
user = "user"
password = "xxxx"
port = 15002

The configuration of txn-batch and worker-count should be 1 by default in the configuration file. It is recommended to increase it according to the actual situation. If it is too small, the incremental data may not catch up.

(4) Deploy drainer and monitor

 ansible # ansible-playbook deploy_drainer.yml -i inventory.ini -l drainer1
ansible # ansible-playbook deploy.yml -i inventory.ini -l monitor-drainer1

(5) Log in to the tidb of the new cluster and authorize the drainer node

 ansible # /opt/soft/mysql57/bin/mysql -u root -h new.tdb.com -P 15002 -ppassword
mysql> create user if not exists user@"drainer_host" IDENTIFIED BY 'xxxx';
mysql> GRANT ALL PRIVILEGES ON *.* TO 'user'@'drainer_host';

Note: The new cluster (4.0) must authorize the host where the drainer is located, otherwise an error will be reported when the drainer is started. For the convenience of demonstration, all permissions are directly given here

(6) Start drainer and monitor

Before starting the drainer, it is recommended to determine whether the tidb_binlog library already exists in the target library. If it exists, and you need to start incremental synchronization from the backup point, you need to delete it manually, otherwise the drainer will start to synchronize the data from the checkpoint. (Generally, it is necessary to re-import the full amount after the import of the full amount fails, and then forget to clean the tidb_binlog library)

 ansible # ansible-playbook start_drainer.yml -i inventory.ini -l drainer1
ansible # ansible-playbook start.yml -i inventory.ini -l monitor-drainer1

(7) Log in to tidb to check the drainer status

 ansible # /opt/soft/mysql57/bin/mysql -u root -h old.tdb.com -P 4000 -ppassword
mysql> show drainer status;
+------------+------------+--------+--------------------+---------------------+
| NodeID     | Address    | State  | Max_Commit_Ts      | Update_Time         |
+------------+------------+--------+--------------------+---------------------+
| xxxx:24001 | xxxx:24001 | online | 431972431138127904 | 2021-08-25 16:42:57 |
+------------+------------+--------+--------------------+---------------------+
1 rows in set (0.00 sec)

mysql>

It should be noted that versions before 2.1.6 do not support this query operation. You need to check the status of the drainer through binlogctl, as shown in the following example.

 ansible #  /path/binlogctl -pd-urls=http://pd_host:pd_port -cmd drainers
INFO[0000] drainer: {NodeID: xxxx:24001, Addr: xxxx:24001, State: online, MaxCommitTS: 432180589478543384, UpdateTime: 2021-08-25 16:45:57 +0800 CST}

(8) Update monitoring

 ansible # ansible-playbook rolling_update_monitor.yml -t prometheus -i inventory.ini

(9) Log in to grafana to view the synchronization progress

Note: If the synchronization lag is relatively large, you can disable the warning of the drainer in the alertmanager first.

6. Verify data consistency between old and new clusters

(1) Download tool

 ansible # git clone https://gitee.com/mo-shan/check_data_for_mysql.git
ansible # cd check_data_for_mysql

(2) Modify the configuration

Edit configuration file

 ansible # cd /path/check_data_for_mysql
ansible # vim conf/check.conf

 mysql_user="xxxx"
mysql_passwd="xxxx"
mysql_port1="6666"
mysql_port2="6666"
mysql_host1="192.168.1.1"
mysql_host2="192.168.1.2"
max_count=10000
threads=5
max_threads_running=30
mysql_path="/opt/soft/mysql57/bin/mysql"
log_partition="/dev/sda3"
log_par_size="10"
skip_check_table=""
skip_check_db="INFORMATION_SCHEMA,METRICS_SCHEMA,PERFORMANCE_SCHEMA,mysql,sys,tidb_binlog,test,tidb_loader,dm_meta" #不建议改

Please configure the relevant configuration according to the comments according to the actual situation.

Modify work path

 ansible # sed -i 's#^work_dir=.*#work_dir=\"/check_data_for_mysql_path\"#g' start.sh #将这里的check_data_for_mysql_path改成check_data_for_mysql的家目录的绝对路径

(3) Test case

Forcibly clear the log directory each time the verification task is performed, so please make a backup of the verification results
When performing verification tasks It is strongly recommended to enable screen
There is a need for network card monitoring. It is also strongly recommended to open the screen separately for monitoring when executing the monitoring script

The first step: first open a screen to monitor the network

 ansible # screen -S check_net_4000
ansible # bash manager.sh -a start
[ 2022-01-18 11:55:34 ] [ 1000 Mb/s ] [ RX : 2    MB/S ]  [ TX : 2    MB/S ]
[ 2022-01-18 11:55:35 ] [ 1000 Mb/s ] [ RX : 2    MB/S ]  [ TX : 4    MB/S ]
[ 2022-01-18 11:55:36 ] [ 1000 Mb/s ] [ RX : 2    MB/S ]  [ TX : 2    MB/S ]
[ 2022-01-18 11:55:37 ] [ 1000 Mb/s ] [ RX : 2    MB/S ]  [ TX : 3    MB/S ]
[ 2022-01-18 11:55:38 ] [ 1000 Mb/s ] [ RX : 1    MB/S ]  [ TX : 2    MB/S ]
[ 2022-01-18 11:55:39 ] [ 1000 Mb/s ] [ RX : 1    MB/S ]  [ TX : 2    MB/S ]
[ 2022-01-18 11:55:41 ] [ 1000 Mb/s ] [ RX : 1    MB/S ]  [ TX : 2    MB/S ]
[ 2022-01-18 11:55:42 ] [ 1000 Mb/s ] [ RX : 2    MB/S ]  [ TX : 8    MB/S ]

Step 2: Open a new screen to perform the verification task

 ansible # screen -S check_data_4000
ansible # bash start.sh -d dba -t dbatest1 -f true 
[ 2022-01-17 20:32:19 ] [ 成功 ] [ 192.168.1.1 ] [ start.sh/start.sh ] [ f_prepare:130 ] [ 本次数据一致性检查开始 ]
[ 2022-01-17 20:32:19 ] [ 警告 ] [ 192.168.1.1 ] [ start.sh/start.sh ] [ f_main:185 ] [ 本次数据一致性检查将检查如下库 : [dba] ]
[ 2022-01-17 20:32:19 ] [ 成功 ] [ 192.168.1.1 ] [ start.sh/start.sh ] [ f_main:203 ] [ 正在检查dba库 ]

[ 2022-01-17 20:32:19 ] [ 成功 ] [ 192.168.1.1 ] [ func/f_check_diff_for_mysql.sh ] [ f_check_diff_for_mysql:249 ] [ dba.dbatest1 ] [ 表结构一致 ]

[ 2022-01-17 20:32:19 ] [ 成功 ] [ 192.168.1.1 ] [ func/f_check_diff_for_mysql.sh ] [ f_check_diff_for_mysql:491 ] [ dba.dbatest1 ] [ 1,1 ] [ 00 d 00 h 00 m 00 s ] [ 9.09%, (0:0)/1 ] [ 数据一致 ]
[ 2022-01-17 20:32:19 ] [ 成功 ] [ 192.168.1.1 ] [ func/f_check_diff_for_mysql.sh ] [ f_check_diff_for_mysql:491 ] [ dba.dbatest1 ] [ 2,11 ] [ 00 d 00 h 00 m 00 s ] [ 100.00%, (0:0)/1 ] [ 数据一致 ]
[ 2022-01-17 20:32:19 ] [ 成功 ] [ 192.168.1.1 ] [ func/f_check_diff_for_mysql.sh ] [ f_check_diff_for_mysql:504 ] [ dba.dbatest1 ] [ 检查完毕 ]

[ 2022-01-17 20:32:19 ] [ 成功 ] [ 192.168.1.1 ] [ start.sh/start.sh ] [ f_main:242 ] [ 本次数据一致性检查完成 ] [ 通过 ]

ansible #

After the inspection is completed, it will prompt the inspection to pass, otherwise the inspection will fail.
For the tool implementation logic, please refer to https://mp.weixin.qq.com/s/PPGSnPL-2FgRWftDzqxAOA

7. Deliver the new environment

dba provides a new domain name and port to the business. Here, a read-only account is provided for the business.

 ansible # /opt/soft/mysql57/bin/mysql -u root -h new.tdb.com -P 15002 -ppassword
mysql> create user if not exists read_only@"host" IDENTIFIED BY 'xxxx';
mysql> GRANT SELECT ON *.* TO 'read_only'@'host';

It is important to note that the permissions table (mysql.user) should not be synchronized before delivering the new environment.

8. Business verification

Please fully verify the business.

9. Sync permission table

The permissions table structure of tidb2.1 and 4.0 is inconsistent, so it is impossible to synchronize permissions by exporting and importing. In addition, it is not possible to use the pt tool after testing. The following provides a script for synchronizing permissions. Versions 2.1 to 4.0 are valid for testing, and other versions are not yet available. test.

 #!/bin/bash
port=4000
mysql_comm="/opt/soft/mysql57/bin/mysql -u root -h old.tdb.com -P ${port} -ppassword"

for user in $(${mysql_comm} -NBe "select concat(user,'@','\"',host,'\"',':::',password) from mysql.user;" 2>/dev/null)
do
    user_tmp="$(awk -F::: '{print $1}' <<< "${user}")"
    pass_tmp="$(awk -F::: '{print $2}' <<< "${user}")"
    create_user="create user if not exists ${user_tmp} IDENTIFIED BY PASSWORD '${pass_tmp}';"
    drop_user="drop user if exists ${user_tmp};"
    grep -q "^root@" <<< "${user_tmp}" && {
            grant_user="$(${mysql_comm} -NBe "show grants for ${user_tmp}" 2>/dev/null|sed 's/$/ WITH GRANT OPTION;/g')"
        } || {
            grant_user="$(${mysql_comm} -NBe "show grants for ${user_tmp}" 2>/dev/null|sed 's/$/;/g')"
            echo "${drop_user}"
        }
    echo "${create_user}"
    echo "${grant_user}"
done

The script will type out the permissions of the old cluster, and can write to the new cluster after confirmation.

 ansible # bash show_grant.sh | /opt/soft/mysql57/bin/mysql -u root -h new.tdb.com -P 15002 -ppassword

After the permissions are synchronized, Please do not authorize the business. If you need to authorize a new host or create a new user, ask dba for assistance.
Non-essential Please don't do ddl operations for business. If necessary, please ask dba for assistance.

10. Switch traffic

Before the service cuts traffic, it is recommended to restart the tidb of the new cluster one by one to release the auto_increment cache. After the restart, you need to check the status and delay of the drainer task, and then contact the service to switch if there is no delay.
If you do not restart, after switching to a new cluster, a large number of [Duplicate entry '' for key 'PRIMARY'] may be reported for the auto-incremented primary key of .

This process is actually very simple. You can directly replace the tidb host of the new cluster with the original rs list of the vip, or apply for a new vip and resolve the original domain name to the new vip. However, due to historical problems, the original domain name and tidb port did not meet the management specifications, so the business needs to access tidb through the new domain name/port.

Note: resolve the domain name to the new vip, this only works for new incoming connections.

In view of the particularity of the environment, dba provides two solutions to allow the business to access tidb through the new domain name port.

No matter which solution is adopted, it is not recommended to drop the old domain name immediately after cutting traffic. The recommended practice is to delete the rs list of the vip corresponding to the old domain name, and attach the tidb node of the new cluster to the rs list of the vip corresponding to the old domain name (it should be noted that the new tidb port may be inconsistent with the old vip port). This is to avoid business If it is missed, observe the dns log for a few days and confirm that no business is using the old domain name before downloading it.

(1) Brutal scheme

The business directly modifies the connection information and uses new.tdb.com:15002 to connect to tidb.

Because there may be multiple businesses using this library, and each service may have multiple business machines, it is impossible for all services to switch to the new library at the same time, so the following situations will occur:

1) When writing a new library, reading the old library will not be able to read, because the new library has no synchronization link with the old library.

2) Write the old library and read the new library, but it may not be read because there is a delay between the old library and the new library.

Double writing cannot be avoided, which may lead to the following problems.

A. Two connections that update the same row of data execute in a very short time (less than the synchronization delay from the old library to the new library). The two connections are executed in the old library/new library respectively. At this time, the final state of the row of data is not based on who executes it last. For example, first execute [update t set name = 1 where id = 2;] in the old database, and then execute [update t set name = 2 where id = 2;] in the new database. In theory, the record of this data should eventually be It is name=2, but considering the delay in synchronization between the new library and the old library, this data may be overwritten by the data of the old library and become name=1. If it is reversed, the new library is written first, and then the old library is written, which has no effect on the data.
B. The two connections of the business insert the same row of data (data with the same primary key or the same unique key) into the new database and the old database respectively. If the new database is written first, and then the old database is written, the submission will be successful on the business side, but it will lead to The synchronization of the old library to the new library fails, because the data written by the old library is synchronized to the new library, and a primary key conflict (unique key conflict) will be reported. At this time, manual intervention by the dba is required to repair it. If it is reversed, the old library is written first and then the new library is written (regardless of the delay from the old library to the new library), then the session writing the new library will report an error, which has no effect on the data.

For the above situation, needs to be fully evaluated by the business. If it is unacceptable, you can suggest that the business use the following smoothing scheme, which will have less impact.

(2) Smoothing scheme

The business continues to use old.tdb.com:4000 to connect to tidb.

The dba needs to add the tidb of the new cluster to the rs list of the vip of the old cluster, but in order to avoid writing data to the old and new clusters at the same time, the rs of the vip should be dropped first, and then the tidb ip of the new cluster should be added to the vip rs list.

There are two actions involved here:

Clear the rs list of the vip of the old cluster (offline rs), it is recommended to actively release the connection (restart/close the tidb of the old cluster), otherwise it may appear after the rs is dropped (depending on the implementation mechanism of vip), connect will not release.
Add the ip of the tidb of the new cluster to the rs list of the vip of the old cluster.

These two operations need to be confirmed with the business, because there is a time difference (expected to be within 30s) after dropping rs and then rejoining, the cluster is unavailable during this process.

After completing the above operations, the access information of the old cluster will become the following table:

dns	old.tdb.com
vip	192.168.1.100:4000	rs : 192.168.1.1:15002 192.168.1.2:15002 192.168.1.3:15002

At this time, the business needs to update the configuration of the business code one by one, replace the old domain name and port with the new domain name and port (you need to replace old.tdb.com:4000 with new.tdb.com:15002 ), then modify the configuration to restart the business The impact will be relatively small.

5. Write at the end

This document is only for experience sharing and pit avoidance guide. Due to different usage scenarios and different environments, other problems may be encountered during the migration process. If you need to operate in the online environment, please fully test it in the test environment.

Technology sharing | tidb 2.1 upgrade to 4.0 operation document

I. Introduction

2. Environment introduction

1. Introduction to the old cluster environment

2. Old cluster access information

3. Introduction to the new cluster environment

4. New cluster access information

3. Process introduction

Fourth, the upgrade operation

1. Print the old cluster access list

2. Confirm whether there is a reconnection mechanism

3. Open binlog and make full backup

4. Deploy the 4.0 environment and import full data

5. Synchronize incremental data

6. Verify data consistency between old and new clusters

7. Deliver the new environment

8. Business verification

9. Sync permission table

10. Switch traffic

5. Write at the end

爱可生开源社区

引用和评论

七步定位 OceanBase 登录报错

TiDB 优化器 | 执行计划管理及实践

基于 AutoFlow 快速搭建基于 TiDB 向量搜索的本地知识库问答机器人

墨天轮国产数据库排行榜年终总结-2024年

2024年12月中国数据库排行榜：群雄竞逐显风采，GoldenDB摘探花

2024年12月国产数据库大事记-墨天轮

2024年11月中国数据库排行榜：OB连续四月居榜首，腾讯云TDSQL升第九