Author: Mo Shan
Senior DBA of an Internet company.
Source of this article: original contribution
*The original content is produced by the open source community of Aikesheng, and the original content shall not be used without authorization. For reprinting, please contact the editor and indicate the source.
I. Introduction
The online tidb clusters are all 2.1.[5,7,8,17]. Because the version is too low, it faces many problems, such as difficult management, hot issues, invalid execution plans, performance bottlenecks, and other known/unknown and unsolvable problems. The problem now needs to be upgraded to version 4.0.13. After investigation, it is found that if the in-situ upgrade will require multiple upgrades [2.1--> 3.0 --> 4.0], I am worried that the in-situ upgrade will encounter irreversible failures, and I am more worried that it cannot be solved and will affect the business, so after testing and evaluation, and finally upgrade by means of data migration.
Because there are relatively few users using version 2.1, and they are not upgraded, there are almost no migration and upgrade documents that can be referred to. Many problems were encountered in the upgrade, and many pits were stepped on. This article organizes the upgrade operation process and marks each What problems and solutions are easy to encounter in the steps, the right to exchange experience and avoid pit guides. All contents/operation commands in this article are for reference only.
- Because 5.0 is based on the MySQL 8.0 protocol, I was worried about incompatibility with the business, and because the minor versions of 5.0+ were still relatively small and worried about stability, so I did not consider it. At that time, 4.0.13 was the latest version of 4.0, so I chose this version .
- 24 tidb clusters have been upgraded from 2.1 to 4.0.13.
2. Environment introduction
1. Introduction to the old cluster environment
- existing components
Role | quantity | port |
---|---|---|
pd | 3 | 5017 |
tidb | 3 | 4000 |
tikv | 3 | 20117 |
alertmanager | 1 | 9093 |
prometheus | 1 | 9100 |
grafana | 1 | 3000 |
vip | 192.168.1.100 | 4000 |
dns | old.tdb.com | 4000 |
- A component that is not listed means that the component is not enabled. For historical reasons, the cluster does not have the pump component enabled.
- There are no rules for port planning
- Projected components to be added
Role | quantity | port |
---|---|---|
pump | 3 | 23001 |
drainer | 1 | 24001 |
2. Old cluster access information
dns | old.tdb.com | ||
---|---|---|---|
vip | 192.168.1.100:4000 | rs : 192.168.1.1:4000 192.168.1.2:4000 192.168.1.3:4000 |
3. Introduction to the new cluster environment
Role | quantity | port |
---|---|---|
pd | 3 | 13002 |
tidb | 3 | 15002 |
tikv | 3 | 17002 |
ticdc | 3 | 33002 |
alertmanager | 1 | 21002 |
prometheus | 1 | 19002 |
grafana | 1 | 20002 |
vip | 192.168.1.100 | 15002 |
dns | new.tdb.com | 15002 |
- The port adopts the format of 2+3, the first two digits are the component number, and the last three digits are the cluster number. That is, the same last three digits represent the same cluster, and the same first two digits represent the same component.
4. New cluster access information
dns | new.tdb.com | ||
---|---|---|---|
vip | 192.168.1.100:15002 | rs : 192.168.1.1:15002 192.168.1.2:15002 192.168.1.3:15002 |
3. Process introduction
- 1. dba prints the ip list currently connected to tidb to let the main business party confirm whether there is an ip other than this business. Make sure all businesses using the cluster are involved.
- 2. The dba confirms with the business whether there is a reconnection mechanism. (Opening binlog requires restarting the tidb component).
- 3. dba enables binlog. This step requires rolling restart of the tidb component, and a time window needs to be negotiated with the business.
- 4. dba deploys the 4.0 environment and imports full data.
- 5. dba synchronizes incremental data.
- 6. The dba verifies the consistency of the new and old cluster data.
- 7. dba delivers a new environment and provides a new domain name + port.
- 8. dba provides read-only accounts, business tests, and verification business scenarios (read only, not write).
- 9, dba synchronization permissions.
- 10. Switch traffic.
Fourth, the upgrade operation
1. Print the old cluster access list
ansible # /opt/soft/mysql57/bin/mysql -u root -h old.tdb1 -P 4000 -ppassword
mysql> select distinct host from information_schema.processlist
ansible # /opt/soft/mysql57/bin/mysql -u root -h old.tdb2 -P 4000 -ppassword
mysql> select distinct host from information_schema.processlist
ansible # /opt/soft/mysql57/bin/mysql -u root -h old.tdb3 -P 4000 -ppassword
mysql> select distinct host from information_schema.processlist
Log in to all tidb nodes, append the output of each node to a file, and then sort and deduplicate to count client ip
2. Confirm whether there is a reconnection mechanism
slightly
3. Open binlog and make full backup
This step is performed on the ansible management machine
(1) Edit the configuration file
ansible # vim /path/github/tidb-ansible-2.1.8/inventory.ini
Add monitoring of pump components
[monitored_servers] monitor-pump1 ansible_host=xxxx deploy_dir=/path/tidb-data/pump-23001 monitor-pump2 ansible_host=xxxx deploy_dir=/path/tidb-data/pump-23001 monitor-pump3 ansible_host=xxxx deploy_dir=/path/tidb-data/pump-23001
- Add pump component
[pump_servers]
#下面三个是pump组件的机器, 如果启用pump组件还需要打开 enable_binlog = True
pump1 ansible_host=xxxx deploy_dir=/path/tidb-data/pump-23001
pump2 ansible_host=xxxx deploy_dir=/path/tidb-data/pump-23001
pump3 ansible_host=xxxx deploy_dir=/path/tidb-data/pump-23001
- Pump port settings and enable binlog parameters
## Global variables
[all:vars]
pump_port = 23001
## binlog trigger
enable_binlog = True
#enable_binlog = False
If you do not set enable_binlog = True, it will be ignored when deploying the pump. In addition, it should be noted that before the pump can provide services, the configuration of tidb cannot be reloaded and restarted, otherwise the business write operation will fail.
(2) Edit the configuration of the pump
ansible # vim /path/github/tidb-ansible-2.1.8/conf/pump.yml
- Modify binlog save cycle
global:
# a integer value to control expiry date of the binlog data, indicates for how long (in days) the binlog data would be stored.
# must bigger than 0
gc: 14
Change it to 14 days to avoid the loss of incremental data (the binlog is cleared) due to the long time of importing full data.
(3) Log in to the target machine to create a directory
Log in to each pump node to create directories and change permissions
ansible # ssh pump1
pump1 # mkdir -p /path/tidb-data/pump-23001
pump1 # chown -R tidb. /path/tidb-data/pump-23001
ansible # ssh pump2
pump2 # mkdir -p /path/tidb-data/pump-23001
pump2 # chown -R tidb. /path/tidb-data/pump-23001
ansible # ssh pump3
pump3 # mkdir -p /path/tidb-data/pump-23001
pump3 # chown -R tidb. /path/tidb-data/pump-23001
(4) Deploy pump and monitor on the ansible management machine
ansible # ansible-playbook deploy.yml -l monitor-pump1,monitor-pump2,monitor-pump3,pump1,pump2,pump3 -i inventory.ini
(5) Start pump and monitor on the ansible management machine
ansible # ansible-playbook start.yml -l monitor-pump1,monitor-pump2,monitor-pump3,pump1,pump2,pump3 -i inventory.ini
(6) Log in to tidb to see if the pump is deployed
ansible # /opt/soft/mysql57/bin/mysql -u root -h old.tdb.com -P 4000 -ppassword
mysql> show pump status;
+------------+------------+--------+--------------------+---------------------+
| NodeID | Address | State | Max_Commit_Ts | Update_Time |
+------------+------------+--------+--------------------+---------------------+
| xxxx:23001 | xxxx:23001 | online | 427138948355850245 | 2021-08-20 04:42:57 |
| xxxx:23001 | xxxx:23001 | online | 427138948395171844 | 2021-08-20 04:42:57 |
| xxxx:23001 | xxxx:23001 | online | 427138948408279045 | 2021-08-20 04:42:57 |
+------------+------------+--------+--------------------+---------------------+
3 rows in set (0.00 sec)
mysql>
It should be noted that <font color='red'> versions before 2.1.6 do not support this query operation. You need to check the status of the pump through binlogctl, as shown in the following example. </font>
ansible # /path/binlogctl -pd-urls=http://pd_host:pd_port -cmd pumps
INFO[0000] pump: {NodeID: xxxx:23001, Addr: xxxx:23001, State: online, MaxCommitTS: 432180280017551379, UpdateTime: 2021-08-20 04:45:57 +0800 CST}
INFO[0000] pump: {NodeID: xxxx:23001, Addr: xxxx:23001, State: online, MaxCommitTS: 432180280004444167, UpdateTime: 2022-03-30 18:45:14 +0800 CST}
INFO[0000] pump: {NodeID: xxxx:23001, Addr: xxxx:23001, State: online, MaxCommitTS: 432180280017551372, UpdateTime: 2022-03-30 18:45:14 +0800 CST}
(7) Rolling restart the tidb node on the ansible management machine
<font color='red'>Before performing this operation, make sure that the pump component is working properly. </font>
ansible # ansible-playbook rolling_update.yml -t tidb -i inventory.ini
It should be noted that this operation may fail to start or close the ansible action (it will be stuck until the timeout). If this happens, you can log in to the target machine to manually start or stop it. The reference command is as follows:
- Start cd /path/tidb/scripts && sudo -u tidb bash start_tidb.sh
- stop cd /path/tidb/scripts && sudo -u tidb bash stop_tidb.sh
(8) Log in to tidb to check whether binlog is enabled
ansible # /opt/soft/mysql57/bin/mysql -u root -h old.tdb.com -P 4000 -ppassword
mysql> show variables like 'log_bin';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| log_bin | 1 |
+---------------+-------+
1 row in set (0.01 sec)
- It should be noted that log_bin in versions before <font color='red'>2.1.6 is always equal to 0</font>, which means that even if enable_binlog = True, it is also 0 through show variables like 'log_bin';, but the pump will record binlog.
- It is recommended to check each tidb one by one.
(9) Update monitoring on the ansible management machine
ansible # ansible-playbook rolling_update_monitor.yml -t prometheus -i inventory.ini
(10) Create a full backup
ansible # /path/mydumper -u user -p pass -h old.tdb.com -P 4000 -t 2 -F 32 --skip-tz-utc -o /backup_path/4000 -B db_name
Note for backup:
- Tools to get https://docs.pingcap.com/zh/tidb/v2.1/backup-and-restore
- Back up at low business peaks, otherwise the network card may be full (especially tidb is a 10 Gigabit network card, and tikv is the architecture of a Gigabit network card)
- The backup may fail because the gc time is too short (resolved by adjusting the gc time)
- The backup may fail because the memory allocated by tidb is too small (solved by adjusting the tidb memory)
- After the backup is completed, it is recommended to check the file of the table creation statement to see if there is an illegal time format ("0000-00-00"). If there is an error when importing a new cluster, you need to communicate with the business to change the default value.
- mydumper does not support current-limited backup. By backing up to a machine with poor disk performance or network storage such as cfs, current-limited backup can be achieved to a certain extent.
4. Deploy the 4.0 environment and import full data
The pessimistic transaction model needs to be paid attention to. Although <font color='red'>4.0 supports the pessimistic transaction model, and the new cluster is also enabled by default, there are still certain restrictions in order to use pessimistic locks for an operation, that is, non- autocommit transactions. For details, please refer to [section 6.2.3.2]</font> of this article https://book.tidb.io/session1/chapter6/pessimistic-txn.html
(1) Install tiup
ansible # curl --proto '=https' --tlsv1.2 -sSf https://tiup-mirrors.pingcap.com/install.sh | sh
ansible # . /root/.bash_profile
ansible # tiup --version
ansible # tiup update --self
(2) Prepare the topology file
ansible # vim topology-oltp-xxx.yaml
global:
user: "tidb"
ssh_port: 22
deploy_dir: "/tidb-deploy"
data_dir: "/tidb-data"
monitored:
node_exporter_port: 11000
blackbox_exporter_port: 12000
pd_servers:
- host: 10.0.1.4
- host: 10.0.1.5
- host: 10.0.1.6
tidb_servers:
- host: 10.0.1.1
- host: 10.0.1.2
- host: 10.0.1.3
tikv_servers:
- host: 10.0.1.7
- host: 10.0.1.8
- host: 10.0.1.9
cdc_servers:
- host: 10.0.1.7
- host: 10.0.1.8
- host: 10.0.1.9
monitoring_servers:
- host: 10.0.1.10
grafana_servers:
- host: 10.0.1.10
alertmanager_servers:
- host: 10.0.1.10
The above is the configuration template provided by the official, please modify it according to the actual situation.
- It is recommended to deploy ticdc (pump) to avoid retroactive incremental data when rollback is required.
- A separate machine for each component is recommended.
(3) Check whether the ssh channel from the tiup manager to each node is normal
slightly
(4) Deploy the cluster
ansible # tiup cluster check tidb-oltp-xxx-v4.0.13 v4.0.13 topology-oltp-xxx.yaml
ansible # tiup cluster deploy tidb-oltp-xxx-v4.0.13 v4.0.13 topology-oltp-xxx.yaml
ansible # tiup cluster start tidb-oltp-xxx-v4.0.13
ansible # tiup cluster display tidb-oltp-xxx-v4.0.13
- Check may report many exceptions, which can be repaired according to the prompts, and many exceptions can also be ignored. Please refer to https://docs.pingcap.com/en/tidb/v4.0/tiup-component-cluster-check#tiup-cluster-check
(5) Authority maintenance
ansible # /opt/soft/mysql57/bin/mysql -u root -h new.tdb.com -P 15002 -p
mysql> create user if not exists root@"192.168.1.%" IDENTIFIED BY 'password';
mysql> GRANT ALL PRIVILEGES ON *.* TO 'root'@'192.168.1.%' WITH GRANT OPTION;
mysql> GRANT ALL PRIVILEGES ON *.* TO 'root'@'pd1' WITH GRANT OPTION;
mysql> GRANT ALL PRIVILEGES ON *.* TO 'root'@'pd2' WITH GRANT OPTION;
mysql> GRANT ALL PRIVILEGES ON *.* TO 'root'@'pd3' WITH GRANT OPTION;
- Here you can log in with an empty password.
- The authorization of the pd node needs to be added here, and the root user is required (let’s give the all permission, the test found that the select permission is not enough, and a more detailed permission test was not done), otherwise the dashboard cannot be used normally.
ansible # /opt/soft/mysql57/bin/mysql -u root -h new.tdb.com -P 15002 -ppassword
mysql> drop user if exists root@"%";
It is recommended to delete the empty password user root@'%'.
(6) Import full data
ansible # /path/loader -d /backup_path/4000 -h new.tdb.com -u user -p password -P 15002 -t 2 -status-addr ":9299"
Recovery requires attention:
- Tools to get https://docs.pingcap.com/zh/tidb/v2.1/backup-and-restore
- It is recommended to recover during low business peaks
- The loader may fail due to emoticons. If you encounter it, you can try Dumpling
- In the scenario of multiple loader tasks, it is recommended to avoid the default port, otherwise it may fail due to port conflict
5. Synchronize incremental data
This step is performed on the ansible management machine
(1) Obtain the backup point on the backup machine (in this example, the ansible management machine is used for backup)
View metadata file from backup directory
ansible # cd /backup_path/xxx
ansible # cat metadata
Started dump at: 2021-08-29 15:34:30
SHOW MASTER STATUS:
Log: tidb-binlog
Pos: 425971435565482001
GTID:
Finished dump at: 2021-08-29 15:34:33
ansible #
(2) Modify the configuration file
ansible # vim /path/github/tidb-ansible-2.1.8/inventory.ini
- Add monitoring of drainer components
[monitored_servers]
monitor-drainer1 ansible_host=xxxx deploy_dir=/path/tidb-data/drainer-24001
- Add drainer component
[drainer_servers]
drainer1 ansible_host=xxxx deploy_dir=/path/tidb-data/drainer-24001 initial_commit_ts="425971435565482001"
- drainer port settings
## Global variables
[all:vars]
drainer_port = 24001
(3) Prepare the configuration file of the drainer
ansible # vim /path/github/tidb-ansible-2.1.8/conf/drainer1_drainer.toml
The configuration file name naming rule is [alias_drainer.toml], otherwise the custom configuration file cannot be found during deployment.
# drainer Configuration.
# the interval time (in seconds) of detect pumps' status
detect-interval = 10
# syncer Configuration.
[syncer]
# disable sync these schema
ignore-schemas = "INFORMATION_SCHEMA,PERFORMANCE_SCHEMA,mysql"
# number of binlog events in a transaction batch
txn-batch = 2000
# work count to execute binlogs
worker-count = 32
disable-dispatch = false
# safe mode will split update to delete and insert
safe-mode = false
# downstream storage, equal to --dest-db-type
# valid values are "mysql", "pb", "tidb", "flash", "kafka"
db-type = "tidb"
# the downstream MySQL protocol database
[syncer.to]
host = "new.tdb.com"
user = "user"
password = "xxxx"
port = 15002
The configuration of txn-batch and worker-count should be 1 by default in the configuration file. It is recommended to increase it according to the actual situation. If it is too small, the incremental data may not catch up.
(4) Deploy drainer and monitor
ansible # ansible-playbook deploy_drainer.yml -i inventory.ini -l drainer1
ansible # ansible-playbook deploy.yml -i inventory.ini -l monitor-drainer1
(5) Log in to the tidb of the new cluster and authorize the drainer node
ansible # /opt/soft/mysql57/bin/mysql -u root -h new.tdb.com -P 15002 -ppassword
mysql> create user if not exists user@"drainer_host" IDENTIFIED BY 'xxxx';
mysql> GRANT ALL PRIVILEGES ON *.* TO 'user'@'drainer_host';
Note: The new cluster (4.0) must authorize the host where the drainer is located, otherwise an error will be reported when the drainer is started. For the convenience of demonstration, all permissions are directly given here
(6) Start drainer and monitor
Before starting the drainer, it is recommended to determine whether the tidb_binlog library already exists in the target library. If it exists, and you need to start incremental synchronization from the backup point, you need to delete it manually, otherwise the drainer will start to synchronize the data from the checkpoint. (Generally, it is necessary to re-import the full amount after the import of the full amount fails, and then forget to clean the tidb_binlog library)
ansible # ansible-playbook start_drainer.yml -i inventory.ini -l drainer1
ansible # ansible-playbook start.yml -i inventory.ini -l monitor-drainer1
(7) Log in to tidb to check the drainer status
ansible # /opt/soft/mysql57/bin/mysql -u root -h old.tdb.com -P 4000 -ppassword
mysql> show drainer status;
+------------+------------+--------+--------------------+---------------------+
| NodeID | Address | State | Max_Commit_Ts | Update_Time |
+------------+------------+--------+--------------------+---------------------+
| xxxx:24001 | xxxx:24001 | online | 431972431138127904 | 2021-08-25 16:42:57 |
+------------+------------+--------+--------------------+---------------------+
1 rows in set (0.00 sec)
mysql>
It should be noted that <font color='red'> versions before 2.1.6 do not support this query operation. You need to check the status of the drainer through binlogctl, as shown in the following example. </font>
ansible # /path/binlogctl -pd-urls=http://pd_host:pd_port -cmd drainers
INFO[0000] drainer: {NodeID: xxxx:24001, Addr: xxxx:24001, State: online, MaxCommitTS: 432180589478543384, UpdateTime: 2021-08-25 16:45:57 +0800 CST}
(8) Update monitoring
ansible # ansible-playbook rolling_update_monitor.yml -t prometheus -i inventory.ini
(9) Log in to grafana to view the synchronization progress
Note: If the synchronization lag is relatively large, you can disable the warning of the drainer in the alertmanager first.
6. Verify data consistency between old and new clusters
(1) Download tool
ansible # git clone https://gitee.com/mo-shan/check_data_for_mysql.git
ansible # cd check_data_for_mysql
(2) Modify the configuration
- Edit configuration file
ansible # cd /path/check_data_for_mysql
ansible # vim conf/check.conf
mysql_user="xxxx"
mysql_passwd="xxxx"
mysql_port1="6666"
mysql_port2="6666"
mysql_host1="192.168.1.1"
mysql_host2="192.168.1.2"
max_count=10000
threads=5
max_threads_running=30
mysql_path="/opt/soft/mysql57/bin/mysql"
log_partition="/dev/sda3"
log_par_size="10"
skip_check_table=""
skip_check_db="INFORMATION_SCHEMA,METRICS_SCHEMA,PERFORMANCE_SCHEMA,mysql,sys,tidb_binlog,test,tidb_loader,dm_meta" #不建议改
Please configure the relevant configuration according to the comments according to the actual situation.
- Modify work path
ansible # sed -i 's#^work_dir=.*#work_dir=\"/check_data_for_mysql_path\"#g' start.sh #将这里的check_data_for_mysql_path改成check_data_for_mysql的家目录的绝对路径
(3) Test case
- <font color='red'>Forcibly clear the log directory each time the verification task is performed, so please make a backup of the verification results</font>
- When performing verification tasks <font color='red'>It is strongly recommended to enable screen</font>
- There is a need for network card monitoring. <font color='red'>It is also strongly recommended to open the screen separately for monitoring when executing the monitoring script</font>
The first step: first open a screen to monitor the network
ansible # screen -S check_net_4000
ansible # bash manager.sh -a start
[ 2022-01-18 11:55:34 ] [ 1000 Mb/s ] [ RX : 2 MB/S ] [ TX : 2 MB/S ]
[ 2022-01-18 11:55:35 ] [ 1000 Mb/s ] [ RX : 2 MB/S ] [ TX : 4 MB/S ]
[ 2022-01-18 11:55:36 ] [ 1000 Mb/s ] [ RX : 2 MB/S ] [ TX : 2 MB/S ]
[ 2022-01-18 11:55:37 ] [ 1000 Mb/s ] [ RX : 2 MB/S ] [ TX : 3 MB/S ]
[ 2022-01-18 11:55:38 ] [ 1000 Mb/s ] [ RX : 1 MB/S ] [ TX : 2 MB/S ]
[ 2022-01-18 11:55:39 ] [ 1000 Mb/s ] [ RX : 1 MB/S ] [ TX : 2 MB/S ]
[ 2022-01-18 11:55:41 ] [ 1000 Mb/s ] [ RX : 1 MB/S ] [ TX : 2 MB/S ]
[ 2022-01-18 11:55:42 ] [ 1000 Mb/s ] [ RX : 2 MB/S ] [ TX : 8 MB/S ]
Step 2: Open a new screen to perform the verification task
ansible # screen -S check_data_4000
ansible # bash start.sh -d dba -t dbatest1 -f true
[ 2022-01-17 20:32:19 ] [ 成功 ] [ 192.168.1.1 ] [ start.sh/start.sh ] [ f_prepare:130 ] [ 本次数据一致性检查开始 ]
[ 2022-01-17 20:32:19 ] [ 警告 ] [ 192.168.1.1 ] [ start.sh/start.sh ] [ f_main:185 ] [ 本次数据一致性检查将检查如下库 : [dba] ]
[ 2022-01-17 20:32:19 ] [ 成功 ] [ 192.168.1.1 ] [ start.sh/start.sh ] [ f_main:203 ] [ 正在检查dba库 ]
[ 2022-01-17 20:32:19 ] [ 成功 ] [ 192.168.1.1 ] [ func/f_check_diff_for_mysql.sh ] [ f_check_diff_for_mysql:249 ] [ dba.dbatest1 ] [ 表结构一致 ]
[ 2022-01-17 20:32:19 ] [ 成功 ] [ 192.168.1.1 ] [ func/f_check_diff_for_mysql.sh ] [ f_check_diff_for_mysql:491 ] [ dba.dbatest1 ] [ 1,1 ] [ 00 d 00 h 00 m 00 s ] [ 9.09%, (0:0)/1 ] [ 数据一致 ]
[ 2022-01-17 20:32:19 ] [ 成功 ] [ 192.168.1.1 ] [ func/f_check_diff_for_mysql.sh ] [ f_check_diff_for_mysql:491 ] [ dba.dbatest1 ] [ 2,11 ] [ 00 d 00 h 00 m 00 s ] [ 100.00%, (0:0)/1 ] [ 数据一致 ]
[ 2022-01-17 20:32:19 ] [ 成功 ] [ 192.168.1.1 ] [ func/f_check_diff_for_mysql.sh ] [ f_check_diff_for_mysql:504 ] [ dba.dbatest1 ] [ 检查完毕 ]
[ 2022-01-17 20:32:19 ] [ 成功 ] [ 192.168.1.1 ] [ start.sh/start.sh ] [ f_main:242 ] [ 本次数据一致性检查完成 ] [ 通过 ]
ansible #
After the inspection is completed, it will prompt the inspection to pass, otherwise the inspection will fail.
- For the tool implementation logic, please refer to https://mp.weixin.qq.com/s/PPGSnPL-2FgRWftDzqxAOA
7. Deliver the new environment
dba provides a new domain name and port to the business. Here, a read-only account is provided for the business.
ansible # /opt/soft/mysql57/bin/mysql -u root -h new.tdb.com -P 15002 -ppassword
mysql> create user if not exists read_only@"host" IDENTIFIED BY 'xxxx';
mysql> GRANT SELECT ON *.* TO 'read_only'@'host';
It is important to note that the permissions table (mysql.user) should not be synchronized before delivering the new environment.
8. Business verification
Please fully verify the business.
9. Sync permission table
The permissions table structure of tidb2.1 and 4.0 is inconsistent, so it is impossible to synchronize permissions by exporting and importing. In addition, it is not possible to use the pt tool after testing. The following provides a script for synchronizing permissions. Versions 2.1 to 4.0 are valid for testing, and other versions are not yet available. test.
#!/bin/bash
port=4000
mysql_comm="/opt/soft/mysql57/bin/mysql -u root -h old.tdb.com -P ${port} -ppassword"
for user in $(${mysql_comm} -NBe "select concat(user,'@','\"',host,'\"',':::',password) from mysql.user;" 2>/dev/null)
do
user_tmp="$(awk -F::: '{print $1}' <<< "${user}")"
pass_tmp="$(awk -F::: '{print $2}' <<< "${user}")"
create_user="create user if not exists ${user_tmp} IDENTIFIED BY PASSWORD '${pass_tmp}';"
drop_user="drop user if exists ${user_tmp};"
grep -q "^root@" <<< "${user_tmp}" && {
grant_user="$(${mysql_comm} -NBe "show grants for ${user_tmp}" 2>/dev/null|sed 's/$/ WITH GRANT OPTION;/g')"
} || {
grant_user="$(${mysql_comm} -NBe "show grants for ${user_tmp}" 2>/dev/null|sed 's/$/;/g')"
echo "${drop_user}"
}
echo "${create_user}"
echo "${grant_user}"
done
The script will type out the permissions of the old cluster, and can write to the new cluster after confirmation.
ansible # bash show_grant.sh | /opt/soft/mysql57/bin/mysql -u root -h new.tdb.com -P 15002 -ppassword
- After the permissions are synchronized, <font color='red'>Please do not authorize the business</font>. If you need to authorize a new host or create a new user, ask dba for assistance.
- Non-essential <font color='red'>Please don't do ddl operations</font> for business. If necessary, please ask dba for assistance.
10. Switch traffic
- Before the service cuts traffic, it is recommended to restart the tidb of the new cluster one by one to release the auto_increment cache. After the restart, you need to check the status and delay of the drainer task, and then contact the service to switch if there is no delay.
- If you do not restart, after switching to a new cluster, a large number of [Duplicate entry '' for key 'PRIMARY'] may be reported for the auto-incremented primary key of <font color='red'>. </font>
This process is actually very simple. You can directly replace the tidb host of the new cluster with the original rs list of the vip, or apply for a new vip and resolve the original domain name to the new vip. However, due to historical problems, the original domain name and tidb port did not meet the management specifications, so the business needs to access tidb through the new domain name/port.
Note: resolve the domain name to the new vip, this only works for new incoming connections.
In view of the particularity of the environment, dba provides two solutions to allow the business to access tidb through the new domain name port.
No matter which solution is adopted, it is not recommended to drop the old domain name immediately after cutting traffic. The recommended practice is to delete the rs list of the vip corresponding to the old domain name, and attach the tidb node of the new cluster to the rs list of the vip corresponding to the old domain name (it should be noted that the new tidb port may be inconsistent with the old vip port). This is to avoid business If it is missed, observe the dns log for a few days and confirm that no business is using the old domain name before downloading it.
(1) Brutal scheme
The business directly modifies the connection information and uses new.tdb.com:15002 to connect to tidb.
Because there may be multiple businesses using this library, and each service may have multiple business machines, it is impossible for all services to switch to the new library at the same time, so the following situations will occur:
1) When writing a new library, reading the old library will not be able to read, because the new library has no synchronization link with the old library.
2) Write the old library and read the new library, but it may not be read because there is a delay between the old library and the new library.
Double writing cannot be avoided, which may lead to the following problems.
- A. Two connections that update the same row of data execute in a very short time (less than the synchronization delay from the old library to the new library). The two connections are executed in the old library/new library respectively. At this time, the final state of the row of data is not based on who executes it last. For example, first execute [update t set name = 1 where id = 2;] in the old database, and then execute [update t set name = 2 where id = 2;] in the new database. In theory, the record of this data should eventually be It is name=2, but considering the delay in synchronization between the new library and the old library, this data may be overwritten by the data of the old library and become name=1. If it is reversed, the new library is written first, and then the old library is written, which has no effect on the data.
- B. The two connections of the business insert the same row of data (data with the same primary key or the same unique key) into the new database and the old database respectively. If the new database is written first, and then the old database is written, the submission will be successful on the business side, but it will lead to The synchronization of the old library to the new library fails, because the data written by the old library is synchronized to the new library, and a primary key conflict (unique key conflict) will be reported. At this time, manual intervention by the dba is required to repair it. If it is reversed, the old library is written first and then the new library is written (regardless of the delay from the old library to the new library), then the session writing the new library will report an error, which has no effect on the data.
For the above situation, <font color='red'> needs to be fully evaluated by the business. If it is unacceptable, you can suggest that the business use the following smoothing scheme, which will have less impact. </font>
(2) Smoothing scheme
The business continues to use old.tdb.com:4000 to connect to tidb.
The dba needs to add the tidb of the new cluster to the rs list of the vip of the old cluster, but in order to avoid writing data to the old and new clusters at the same time, the rs of the vip should be dropped first, and then the tidb ip of the new cluster should be added to the vip rs list.
There are two actions involved here:
- Clear the rs list of the vip of the old cluster (offline rs), it is recommended to actively release the connection (restart/close the tidb of the old cluster), otherwise it may appear after the rs is dropped (depending on the implementation mechanism of vip), connect will not release.
- Add the ip of the tidb of the new cluster to the rs list of the vip of the old cluster.
These two operations need to be confirmed with the business, because there is a time difference (expected to be within 30s) after dropping rs and then rejoining, <font color='red'> the cluster is unavailable during this process. </font>
After completing the above operations, the access information of the old cluster will become the following table:
dns | old.tdb.com | ||
---|---|---|---|
vip | 192.168.1.100:4000 | rs : 192.168.1.1:15002 192.168.1.2:15002 192.168.1.3:15002 |
At this time, the business needs to update the configuration of the business code one by one, replace the old domain name and port with the new domain name and port (you need to replace old.tdb.com:4000 with new.tdb.com:15002 ), then modify the configuration to restart the business The impact will be relatively small.
5. Write at the end
This document is only for experience sharing and pit avoidance guide. Due to different usage scenarios and different environments, other problems may be encountered during the migration process. <font color='red'>If you need to operate in the online environment, please fully test it in the test environment. </font>
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。