Some technical thoughts on migrating business data to the cloud

Foreword In the process of supporting the migration of businesses within JD.com Group and external customers of JD Cloud to JD Public Cloud, JD Private Cloud, and JD Government Cloud, JD Technology-JD Cloud Business Group-Technical Service Group has accumulated relevant business system data Some management and technical experience of migration will be shared with you in the form of cases, hoping to help you in your business migration work.

Preparations before migration Business data migration to the cloud involves a wide variety of business data, the main types include:

database:
Relational database MySQL, PG, Oracle, etc.
Object storage:
Standard S3 interface object storage migration middleware data: ES, mongoDB, redis, etc.
File storage:
Unstructured data such as documents and pictures
Big Data:
During the cloud migration process of internal and external customers of JD Cloud, such as HBASE and HDFS files, most of the business involves the above data types. Based on the accumulated experience of related migration cases, data migration needs to do at least the following before the migration starts. Preparation.
1. An executable data migration technical solution has been formulated, including clear migration operation steps (pre-migration preparations, migration operations, and post-migration verification work), executors, and confirmers.
2. Formulate a migration emergency plan and a switchback plan, clarify the responsibility matrix, and confirm the decision-making conditions and decision-makers for abnormal situations.
3. Confirm the data security level, confirm the compliance and security of the data migration plan, and pass the review of the relevant business security department.
4. Evaluation of migration duration and cutover data synchronization window (based on POC verification data information), confirming the optional second solution for each business and data migration.
5. Confirm that the network bandwidth and quality meet the migration requirements.

Case Study The following are several cases involving different data migration scenarios.

A relational database
MySQL:
MySQL is the most common in migration, and there are also mature migration tools and migration solutions, including official tools and related open source tools, such as mysqldump, etc. Each cloud vendor also has its own DTS migration tools.

DTS tools:
The DTS service has achieved a certain degree of abstraction in the steps of transmission, synchronization, data verification, etc., and has a relatively friendly interactive interface. At the same time, multiple tasks can be performed in parallel. For scenarios that require smooth migration, it has the advantage of automation and saves a lot of manpower. , some DTS tools can implement cross-version migration.
The limitations of DTS are:
(1) The source-end database and the target-end database communicate with the DTS management service IP network, and have a stable network connection.
(2) The database needs to meet certain prerequisites to realize the incremental synchronization function after migration. The usual requirement is the permission requirement, such as REPLICATION SLAVE, REPLICATION, etc. At the same time, stored procedures and functions will not be used in the full + incremental scenario. Including, in the full migration stage, Alter Table and Drop Table DDL operations are not supported. **The tool restrictions of different manufacturers may be different. You need to read the product description carefully and verify the function through POC.

mysqldump tool:
Appropriate scenarios, if there is no good network connection or no network connection between the source and destination of the database, and a certain business interruption time is allowed, it is a more suitable solution to complete the data export and import during the shutdown window (if there is a host-level Management and control capabilities, directly migrating the database host as a whole in a mirror mode is also a feasible migration method).
The export and import speed of Mysqldump is faster than that of DTS (local operation, and compared with DTS, there are fewer intermediate links), but it takes more time for data file compression and transmission through the network or mobile media.
Other open source and commercial tools, such as streamset, can support the synchronization of mysql to heterogeneous databases, with more powerful functions and more restrictions.

Estimated migration time:
In the process of service cutover, the migration and synchronization of service data is an important step before the cutover. It is also a link that takes a long time in the cutover process, is prone to errors, and causes cutover delay or failure. Therefore, data migration and synchronization are required. Time consuming to make reliable estimates.
Database synchronization is a table-level concurrency to migrate the full amount of data. Therefore, DTS must be evaluated based on factors such as the actual data type, number of data rows, network bandwidth, network latency, synchronization instance specifications, the number of database tables, and the size of a single database table. duration.

For example, the database size is 500G, and there are 5 tables, one of which is 400G, and the remaining 4 tables are 25G each. Because the single table 400G is relatively large, the migration time will be longer. If there are five 100G tables, the migration time will be shortened.
Before the production data is officially migrated, there is usually a POC for the migration of the test environment to verify and evaluate the switching process and time-consuming of the production environment.

The JD Cloud DTS data migration synchronization architecture is as follows:

Case 1 Migrating from the public cloud of friends and merchants to the public cloud of JD.com, the conversion from a DTS migration to a manual migration method caused by the source-side binlog problem.

Project conditions, the business has an 8-hour shutdown time, so the migration technical solution DTS and manual database import are optional solutions. In view of the non-stop service and data incremental features of DTS, we choose to start the service through JD.com before the service is stopped. The cloud DTS service synchronizes historical data and enables the DTS incremental synchronization function. Based on the downtime window, we give the database online migration and incremental synchronization a time of 4 hours. The DTS service does not affect the online business. It is based on the migration experience and evaluation of the test environment.
In the afternoon before the shutdown, in order to leave enough time buffer for the migration, we started the DTS service of the main database in advance, and the database migration process was carried out normally. The estimated migration time is 4 hours. binlog issue, a fatal error occurred, causing the DTS task to fail.

Migration Task Run Error
ERROR 1236 (HY000): The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.
Region: cn-north-1
ClusterGID: dts-r1rroa

ERROR 1236 (HY000): The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.

Because of the final binlog error (part of the binglog is lost), the DTS task cannot be recovered, and the final 4 hours of DTS transmission is wasted. Because the migration is a systematic project, other data migration processes are also progressing according to the plan, and everyone has no time to analyze at this time. specific reason.
Because the customer service has been pushed and stopped at night, other data migration and service debugging of service migration have already started.

Therefore, I decided to export the file in mysqldump mode, the local export speed is very fast (20M/s), and the compressed database export file size is reduced, which reduces the network transmission time. It is transmitted to the cloud host on the JD cloud side through the network, and then imported into RDS by source. The whole process of exporting, transferring and importing takes less than 2 hours.

After importing MySQL data, verify the migration data according to the migration process, and use the checksum_table tool to compare the source and destination databases.

Source library information: --src-host sourceIP --src-user user --src-pass pass
Target library information: --dest-host targetURL --dest-user user --dest-pass pass

During the verification process, it was found that some tables were inconsistent, and the business side confirmed that the source side did not stop the service completely after the migration started, and there were still data writing operations, because the business side did not check the message generation of mq and kafka according to the migration specification. , just stop some services, and then check the new data through business and R&D, clean up some data, and complete the database migration.

According to the project experience, the failure of this DTS service due to the binlog problem is not an isolated case.

Preparations (1) Prepare an alternative for database migration and prepare contingency plans.
(2) When problems occur, the decision-making conditions and decision-makers are confirmed in advance, and decisions can be made in a timely manner as needed during the implementation process.

Migration of vendor-modified (non-native) databases:
In the specific database version of some cloud manufacturers, some customized development will be done on standard database products such as mariaDB, PG and other databases to meet some special needs of customers' business. This kind of database is deeply bound by the manufacturer. Type, when doing business migration or disaster recovery data synchronization, customized migration and synchronization solutions are made according to time scenarios. Most of them need to do some customized configuration and operations from the R&D level.

Case 2 A financial user, the original system runs on T's financial cloud, and uses customized RDS services. Due to the business and data disaster recovery specifications of the financial industry, remote disaster recovery needs to be done, and the goal is to achieve business-level disaster recovery , run the disaster recovery system on the Jingdong financial cloud platform.

In order to realize the migration from T Cloud's customized TDSQL to JD Cloud, a detailed investigation was made on the source-side database. Because the source-side is a customized distributed database with automatic horizontal splitting and Shared Nothing architecture, JD Cloud is used. The DTS tool is not suitable for this scenario. At the same time, in the two environments, the data is required to be basically synchronized in real time to meet the needs of business disaster recovery.

Formulating a solution When formulating a data synchronization plan, I also investigated the solutions of traditional disaster recovery vendors. Because traditional vendor disaster recovery solutions are mostly based on host-level data and IO analysis or log analysis, some intrusive agents are required. The installation of RDS on the cloud cannot be adapted to the scenario of RDS on the cloud. The relevant manufacturers also said that they are transforming to disaster recovery on the cloud, but there is no mature product (the adaptation is difficult), so the final solution adopts the gtid-based The master-slave replication scheme realizes the heterogeneous cloud synchronization of the database, which shields the problems caused by architectural differences.
Note: Parts of business information and underlying operations have been hidden.

First, adjust the permissions on the source side:
GRANT SELECT, RELOAD, SHOW DATABASES, EXECUTE, REPLICATION SLAVE, REPLICATION CLIENT, SHOW VIEW ON . TO 'user'@'192.168.%'
Make a full logical backup of the source:
mysqldump –h xx –uusername -p –database nx_db -f --single-transaction --master-data=2 --skip-lock-tables > /data1/bs.dmp
Note that there must be gtid information in the export file.
Disaster recovery side import:
mysql –h xx -f –uusername -p < bs.dmp
Copy configuration in the background:
set gtid_slave_pos='0-13381-1xxxx06';
CHANGE MASTER TO MASTER_HOST='sourceIP',MASTER_USER='username',MASTER_PASSWORD='*',MASTER_USE_GTID='slave_pos';
Synchronous verification
Data verification (1) Stop the write operation on the business side, and verify the key tables through the checksum table tablename on the T cloud side and the JD cloud side respectively.
(2) On the business side, execute commands on both sides of T cloud/JD cloud to check the quantity of tables/views:
select count(1) from information_schema.tables where table_schema='nx_db';
select count(1) from information_schema.views where table_schema='nx_db';
The business test verifies whether the ddl/dml is copied normally by creating a test table/adding, deleting, modifying and checking.

After the basic function verification is completed, it is necessary to further verify the source-side database master-slave switch and the impact of network interruption on database synchronization. For the log configuration of the source-side database, it is necessary to put forward the Binlog local retention time requirement (not less than 48 hours) to avoid network The log expires due to the long interruption time, which affects the synchronization service of the database.

In order to ensure the reliability of data and business disaster recovery, it is necessary to perform real-time monitoring and alarm configuration on the network leased line. When a network problem occurs, it is necessary to receive an alarm in time and deal with it as soon as possible, so as to avoid the interruption of the intermediate leased line network and affect the disaster recovery business. impact on availability.
During the POC verification period, the network impact was tested for interruption. After the interruption for 2 hours, the follow-up observation data can still be synchronized and equalized normally, and the impact caused by the possible network interruption in the actual business can be tolerated.

Protection of the source database In this case of heterogeneous cloud disaster recovery, because the network communication with the source end cloud is done through a dedicated line, and the source end database is accessed through IP, therefore, the overall migration of the application host is performed. After arriving at JD.com, the host can still access the source database through the network, so it is possible to write operations to the source database. In order to block access to the source database, the host security group can be used to block the host from external 3306 Port access, or through subnet-level ACLs, restrict access to specific ports on specified network segments. After applying configuration adjustments and changing the database connection point, adjust the security group entries or ACL policies to release the corresponding access rights .

Due to the subnet planning of some databases, the use of ACL may affect database synchronization. Therefore, in this case, an additional security group is created for the service host, and a port 3306 blocking policy is configured to protect the access to the source database. After the business adjustment is completed, the security group can be removed, and business data can be written normally.

2. ES migration
ES is more and more widely used, and ES data migration has become an important part of data migration in business migration.
The ES migration technology involves both out-of-service migration and non-stop server migration. Non-stop server migration has many requirements on the source and destination networks and services of the migration. Currently, there are still many limitations in its implementation. Stop the migration.

Usually, you can choose the reindex or snapshot method and the logstash method for the migration technology path. For several methods, you need to refer to the official version requirements, and choose the migration method that meets the version requirements.

Snapshot method:

In Snapshot mode, data snapshots are created from the source ES cluster and then restored in the target ES cluster. A repository must be created before creating a snapshot. A repository can contain multiple snapshot files. The repository supports S3 and shared file storage systems, etc. The self-built ES can use shared file storage (in consideration of speed, cost and other factors, it is the best choice), it is recommended to use the object storage that supports the S3 protocol when using public cloud ES services.

In terms of speed and efficiency, the snapshot method is better than reindex. When there is no need to make any changes to the source and the network storage conditions are available, the snapshot method is preferred to migrate ES.

reindex is an api interface provided by Elasticsearch, which can import data from the source ES cluster to the current ES cluster to realize data migration. reindex is suitable for migration scenarios with large data volume, index adjustment requirements or failure to connect to shared storage, and Scenarios where only part of the data needs to be migrated.
The reindex method requires that the target end can access the service port of the source ES.

Case 3 Customer business is migrated from Youshang Cloud to JD Cloud, the source ES is a K8S cluster self-built service, and the service access mode is nodeport mode. For security reasons, the access mode is restricted to internal business host access, and the service does not pass through the Internet Open to the public.
Choose the migration technology solution. The ES built by the source does not have the S3 plug-in installed. Considering that the snapshot migration requires the source to install the S3 plug-in, and the business deployed through the POD method needs to re-image and update the application, considering the time and workload. It is not the best choice, so reindex is adopted to migrate business data.

In order to pull data from JD Cloud to ES, configure an nginx reverse proxy on the source end to access the internal ES interface through the public network, and configure a whitelist to restrict access to the IP that is exported by the JD NAT gateway. Public IP to ensure data access security.

On the JD Cloud side, because the subnet of the production environment is not configured with a public network exit, in order to temporarily pull data to meet the migration requirements, adjust the routing table, configure detailed routing, and configure the source public network IP to the routing table of the corresponding subnet. Point to the NAT gateway, temporarily open the public network connection, through the NAT gateway, you can pull the ES data on the source side, and whiten the public network IP of the source side in the ES service. Note that the whitening operation will restart the ES service.

To meet the network communication requirements, the detailed route of the ES subnet is temporarily configured, and the detailed route needs to be deleted after the data migration is completed.

Before migration, confirm that the relevant migration conditions are met:

Both the source and JD Cloud ES services create corresponding indexes. It is necessary to confirm that the index on the cloud is newly created. The mapping of the source and destination can be the same or different. Through reindex, the field type after mapping can be modified.
You can access the service port of the ES of the source-side cloud-side service from the JD cloud-side ES. The verification method can be verified by telnet or curl -XGET http://:9200 .

Create an index on the origin :
Source ES cluster
1. Create an index (example)

2. Write data (example)

Target ES configuration

3. Migration of object storage For the migration of object storage data compatible with the S3 protocol, each public cloud vendor (including some traditional disaster recovery vendors) has migration tools or scripts, and the migration technology is not difficult. However, because objects from different manufacturers are stored in different regions, there may be differences in underlying versions and configurations.
Therefore, the same tool or script may encounter file access problems when processing object storage data in different regions. Before and after the migration, the integrity and availability of the migrated data needs to be checked.

General sequence for object storage migration :
1. Configure the mirror back to the source on the target side to ensure that reading 404 can return to the source to get the data
2. Use the migration tool to migrate historical data on the source side
3. Data verification after synchronization

In the actual migration, because the synchronization of incremental data is involved (the migration tool supports the parameter setting of transfer.coverFile, whether to overwrite the file, it can also achieve incremental copy), therefore, it should be based on the actual data storage volume of the project, business Access information such as features, business downtime windows, etc., considering the migration process and selecting technical solutions.

Case 4

The migration of a business from Youshang Cloud to JD Cloud involves object storage migration. The number of source files is about 10 million. Before the migration, make a file list for the source object storage, and check the list data of the migration tool and the actual data of the object storage. Can match, and then perform the migration operation through the migration tool. Because the amount of files is large, and the business has new data uploaded every day, it is necessary to ensure that all files are correctly synchronized. Therefore, the historical and incremental data synchronization scheme adopted is to perform full migration one week in advance, and then synchronize newly added files through mirroring back to the source. Do full migration one week before cutover
1. One week before the cutover, use the osstransfer migration tool of JD Cloud for full migration.
2. After the migration, a .list.txt file named after the source oss bucket will be generated. This file contains a list of all files in the source oss bucket.
3. The migration log will be generated in the log directory of the migration toolkit. The relevant log description (the log file is very important, and an off-site backup is made after the migration is completed): All migrated files will be recorded in audit-0.log. Successful migration The files are recorded in audit.success. You can use the command grep "1$" audit-0.log to view the files that failed to migrate. The number of files in the txt file and the number of files in the audit.success file are used to generate the source oss manifest file list. For comparison, if the number is the same, all the migrations are successful.

File list get configuration example:

Incremental migration after full migration on the day of cutover
1. Use the osstransfer tool to generate a list of source oss manifest files.
2. Find the newly added files during the period from full migration to incremental migration from the file list.
3. Enable oss image back-to-source.
4. Use curl to access the newly added files (to access the target oss), and mirror the newly added files back to the source.

The problem was encountered in the actual migration:
In the POC stage, when migrating data in the test environment, this solution was used to verify that everything went smoothly. However, when the production environment was cut over, a problem was encountered, and a loop error occurred in the list list required for judging the incremental file, resulting in the list task running all the time. The list of lists has a lot of duplicate content.

The version of the migration software is the same as the version used for migration in the test environment, and in the production environment, the full synchronization of the migration software a week ago is also normal. The data is also normally synchronized to the object storage of JD Cloud. The need for cutover is to obtain the newly added files within a week through the back-to-source method. If the list file is incorrect, the incremental data synchronization cannot be performed.

Problem handling The time for business cutover is limited, the problem is quickly escalated, and the problem is fed back to the R&D colleagues of the tool software, and the R&D colleagues quickly invest in the investigation (it is early morning, I applaud the professionalism of the JD R&D students). After investigation by R&D colleagues, it was found that on the source-side Youshang Cloud, the object storage used in the test environment and the object storage of the production environment are located in different zones, and the OSS interface of the zone where the object storage of the production environment of Youshang Cloud is located is made this week. Adjusted, resulting in errors in the original tool list.
The R&D colleagues urgently updated a toolkit to provide to colleagues at the migration site, which solved the problem of the object storage file list loop error, successfully completed the file list check, and obtained the newly added file list by comparing the list files generated before and after. Configure the mirror back-to-source, synchronize the newly added files for a week through the script, and after the verification is correct, configure the business application to enable object storage, and the business startup and verification work continue normally.

4. Redis migration There are two scenarios for using Redis in the business. One is that it is only used as a cache and does not do data persistence. In this business scenario, after the migration, after the business is deployed in the new environment, the business is directly adjusted to point to the new redis instance That is, one is data persistence. When this kind of business is migrated to the cloud, it is necessary to perform redis data migration operations according to business requirements.

Redis has two persistence schemes, rdb (point-in-time snapshot) and aof. The rdb mode is a binary format, similar to a snapshot, and the recovery is directly overwritten. Aof saves the command (text format), which is similar to the append mode. If If you need to keep the data of the redis on the target side, you can use the aof method. The aof method needs to stop the write operation of the redis on the source side. Redis loads RDB to restore data faster than AOF, but it should be noted that the old version of Redis service is not compatible with the new version of RDB format, so RDB mode is not suitable for downgraded migration or recovery.
During business migration, you need to select the appropriate redis migration tool according to the redis usage scenario, source and destination version requirements, data storage, and network conditions.

The migration of Rdb and aof is officially described in detail (bgrewriteaof/bgsave/redisdump, etc.), and the use is relatively simple, so this article will not introduce it.

JD Cloud has developed a redis migration tool, redissycner (currently supports self-built redis business migration), which provides redis migration and synchronization by simulating the redis replication protocol.
Redissycner deploys via docker:
git clone https://github.com/TraceNature/redissyncer.git
Enter the directory docker-compose up –d
Download client software: wget https://github.com/TraceNature/redissyncercli/releases/download/v0.1.0/redissyncer-cli-0.1.0-linux-a md64.tar.gz
Adjust the configuration file: .config.yaml syncserver: http://xxxx:8080 (docker service address) token: xxx
Note that the content of the token file needs to be confirmed in the container.
After editing the configuration file, you can start the service, and configure the operating environment by writing the task json to be executed.
{
"sourcePassword": "xxxx",
"sourceRedisAddress": "10.0.1.101:6379",
"targetRedisAddress": "10.0.1.102:6379",
"targetPassword": "xxxx",
"taskName": "testtask",
"targetRedisVersion": 4.0,
"autostart": true,
"afresh": true,
"batchSize": 100
} redissyncer-cli -i
redissyncer-cli > task create source ./task.json

V. Importance of Data Backup There are many lessons to be learned, but incidents of neglecting data backups that cause problems during migration implementations are still common.

The problem could come from the customer, it could come from our implementation team, it could come from the ISV or other teams or individuals who might be manipulating the data. Some problems are caused by insufficient communication between the parties responsible for the migration, inadequate publicity, or poor technology, and some are caused by misoperation.
The pressure or resistance of data backup implementation in actual scenarios mainly comes from insufficient storage space and the possible impact on performance caused by the backup process.

In addition to the storage space required for backup data files, the availability and consistency verification of database files will also occupy a large amount of storage space (temporary). Therefore, in the actual migration process, the storage space requirement may be greater than that of the existing database. 2 times the amount of data (both source and destination). Therefore, in important business scenarios, before migration, it is necessary to evaluate the storage space required for data backup and consider the cost of backup space.

Case 5 A bureau commissioned business was migrated from the vmware environment to the government cloud. Before the migration, the author made a full-machine backup of all the migrated hosts (exported to external storage outside vmware) for the customer. Facts have proved that these links ( The cost increase caused by preparing the storage environment, communicating with the vmware operation and maintenance party, time-consuming data export, etc.) is worth it.
The migration process was very smooth. About one month after the business was migrated to the government cloud service and completed the business handover, I received a call from the customer, hoping to restore the data through the host that was backed up before the migration.

Reason of the problem The reason is that an ISV arranged an inexperienced newcomer to re-install the database software and deleted the original data when doing business upgrades in the new environment. After assisting the customer to restore data through the backup image (historical data, the new data part is supplemented by the ISV), the customer purchased the disaster recovery service provided by the government cloud, and began to regularly perform full and incremental backups of important hosts and data , to avoid or reduce losses caused by business errors or misoperations through disaster recovery services on the cloud.

6. Summary of business data migration
1. Make backups in advance. With backup data, the pressure of the migration process will be reduced, and a relatively relaxed migration atmosphere is very beneficial to the implementation of migration.
2. Selection of migration technologies and tools. Now there are more and more data migration tools, and each major manufacturer has its own tools. However, the limitations and compatibility of products are different, and they need to be selected and verified according to the nature of the business.
3. Prepare a rollback plan and do POC verification. POC can find some problems and prepare solutions in advance.
4. Prepare the process manual, clarify the person responsible for the operation, and contact the relevant departments to prepare for the escort in the transition phase of the migration. Product and service problems must be able to find someone to support.
5. Clarify the responsibility matrix and conduct comprehensive communication. Communication can find problems that are difficult to find at the technical level. The sooner a migration organization is established and a limited communication mechanism is formed, the better it is for the smooth implementation of the migration.

Some technical thoughts on migrating business data to the cloud

京东云开发者

引用和评论

JDK从8升级到21的问题集

嘎嘎好用！推荐三款开源的 Redis 桌面客户端！

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

如何实现页面广告随时上下线、过期自动下线及到时自动上线

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统