Based on the time class division, I have written the implementation and details before. Today, I will continue to share a real case of time-based partitioning: the table tuning process of an Internet company's database system.

Question and Background:

  1. The amount of data in a single table is too large, 10W records will be generated every day, and 3650W records will be generated in a year.
  2. 95% of the queries on this table are within a certain day or a few days, and the maximum filter interval does not exceed one month. For example, in the time period of March 1, 2019, April 20, 2019, or May 1, 2019 and May 5, 2019. Occasionally, cross-month and cross-year queries are involved, but the frequency is very low.
  3. Records are kept for 10 years. That is, there are 360 million records in a single table. The single table is too large and inconvenient to manage. If the single table is damaged later, it will be difficult to repair.
  4. The performance of single-table query is very poor, and the performance of historical data deletion is also very poor.

Based on the above analysis, the following conclusions are drawn:

  1. The data range of query filtering is relatively concentrated and not so scattered; the performance of outdated data cleaning should also be considered.
  2. Consider splitting the table into 10 new tables, one is the current table, and the remaining 9 are historical archive tables; the current table stores the data of the last two years, and the old data is migrated to the historical table for archiving at the end of each year, and the expired historical data is archived. to clean up.
  3. Consider using MySQL partitioned tables for some filtering scenarios, ideal for 95% of queries; use partition replacement to move data to historical tables.
  4. The partition table brings several advantages: First, query performance is improved; second, it is easy to manage, and expired data can be cleaned up quickly; third, it is transparent to the application, and the application code does not need to be changed temporarily.

Next, look at the optimization process of the table:

Due to privacy considerations, it is inconvenient to paste the original table structure. Here we use a simplified example table to see the optimization process. The original table is pt_old, the number of fields is reduced to 3, the number of records is reduced by 10 times to 3650W, 365W per year (the original customer has 30 fields and the number of records is 360 million), the record range is from 2011 to 2020, just ten years of data .

(localhost:ytt)<mysql>show create table pt_old\G
*************************** 1. row ***************************
       Table: pt_old
Create Table: CREATE TABLE `pt_old` (
  `id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `r1` int DEFAULT NULL,
  `log_date` date DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_log_date` (`log_date`)
) ENGINE=InnoDB AUTO_INCREMENT=64306811 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)

(localhost:ytt)<mysql>select min(log_date),max(log_date),count(*) from pt_old;
+---------------+---------------+----------+
| min(log_date) | max(log_date) | count(*) |
+---------------+---------------+----------+
| 2011-01-01    | 2020-12-31    | 36500000 |
+---------------+---------------+----------+
1 row in set (21.14 sec)

First export the original table data (export 10 data by year), then directly import it to the new partition table later, execute the following script:

root@ytt-unbuntu:/home/ytt/scripts# cat pt_export 
#!/bin/sh
for i in `seq 2011 2020` 
do 
    {
        mysql -D ytt -e "select * from pt_old where log_date between '$i-01-01' and '$i-12-31' into outfile '/var/lib/mysql-files/pt_$i.csv' fields terminated by ',' " 
    } &
done
wait
root@ytt-unbuntu:/home/ytt/scripts# ./pt_export
root@ytt-unbuntu:/var/lib/mysql-files# ls -sihl
总用量 788M
5767677 79M -rw-r----- 1 mysql mysql 79M 2月   4 15:39 pt_2011.csv
5775332 79M -rw-r----- 1 mysql mysql 79M 2月   4 15:42 pt_2012.csv
5775334 79M -rw-r----- 1 mysql mysql 79M 2月   4 15:42 pt_2013.csv
5774596 79M -rw-r----- 1 mysql mysql 79M 2月   4 15:42 pt_2014.csv
5775335 79M -rw-r----- 1 mysql mysql 79M 2月   4 15:42 pt_2015.csv
5775333 79M -rw-r----- 1 mysql mysql 79M 2月   4 15:42 pt_2016.csv
5775329 79M -rw-r----- 1 mysql mysql 79M 2月   4 15:42 pt_2017.csv
5775330 79M -rw-r----- 1 mysql mysql 79M 2月   4 15:42 pt_2018.csv
5775336 79M -rw-r----- 1 mysql mysql 79M 2月   4 15:42 pt_2019.csv
5775331 79M -rw-r----- 1 mysql mysql 79M 2月   4 15:42 pt_2020.csv

Create 10 tables with the year as the granularity, of which the table pt_2020 is the partition table:

root@ytt-unbuntu:/home/ytt/scripts# for i in `seq 2011 2020`;do mysql -e"use ytt;create table pt_$i like pt_old;";done;

Due to the hard rules of MySQL partition table, the partition key must be the primary key or part of the primary key, and the time field is added to the primary key.

(localhost:ytt)<mysql>alter table pt_2020 drop primary key, add primary key (id,log_date);
Query OK, 0 rows affected (0.29 sec)
Records: 0  Duplicates: 0  Warnings: 0

Add a partition to the table pt_2020 (it is possible to store the data of the current year and last year, so it should be partitioned according to the day and divided into two years, so that the old data will be migrated directly in the new year), and modify the previous stored procedure as follows:

DELIMITER $$

USE `ytt`$$

DROP PROCEDURE IF EXISTS `sp_add_partition_pt_current`$$

CREATE DEFINER=`root`@`%` PROCEDURE `sp_add_partition_pt_current`(
IN f_year_start YEAR,
IN f_year_end YEAR,
IN f_tbname VARCHAR(64)
)
BEGIN
DECLARE v_days INT UNSIGNED DEFAULT 365;
DECLARE v_year DATE DEFAULT '2011-01-01';
DECLARE v_partition_name VARCHAR(64) DEFAULT '';
DECLARE v_log_date DATE;
DECLARE i,j INT UNSIGNED DEFAULT 1;
  SET @stmt = '';
  SET @stmt_begin = CONCAT('ALTER TABLE ',f_tbname,' PARTITION BY RANGE COLUMNS(log_date)(');
  SET i = f_year_start;
  WHILE i <= f_year_end DO
    SET v_year = CONCAT(i,'-01-01');
    SET v_days = DATEDIFF(DATE_ADD(v_year,INTERVAL 1 YEAR),v_year);
    SET j = 1;
    WHILE j <= v_days DO
      SET v_log_date = DATE_ADD(v_year,INTERVAL j DAY);
      SET v_partition_name = CONCAT('p',i,'_',LPAD(j,3,'0'));
      SET @stmt = CONCAT(@stmt,'PARTITION ',v_partition_name,' VALUES LESS THAN(''',v_log_date,'''),');
      SET j = j + 1;
    END WHILE;
    SET i = i + 1;
  END WHILE;
  SET @stmt_end = 'PARTITION p_max VALUES LESS THAN (maxvalue))';
  SET @stmt = CONCAT(@stmt_begin,@stmt,@stmt_end);
  PREPARE s1 FROM @stmt;
  EXECUTE s1;
  DROP PREPARE s1;
  SELECT NULL,NULL,NULL INTO @stmt,@stmt_begin,@stmt_end;
END$$

DELIMITER ;


(localhost:ytt)<mysql>call sp_add_partition_pt_current(2020,2021,'pt_2020');
Query OK, 1 row affected (42.11 sec)

Import the original data separately: the data in 2020 is imported into the table pt_2020, and the other data is imported into the historical tables pt_2011 to pt_2019.

root@ytt-unbuntu:/home/ytt/scripts# cat pt_import 
#!/bin/sh
for i in `seq 2011 2020` 
do 
    {
        mysql -D ytt -e "load data infile '/var/lib/mysql-files/pt_$i.csv' into table pt_$i fields terminated by ',' " 
    } &
done
wait
root@ytt-unbuntu:/home/ytt/scripts# ./pt_import 

Change table p_2020 to current table:

(localhost:ytt)<mysql>alter table pt_2020 rename to pt_current;
Query OK, 0 rows affected (0.12 sec)

Next, we need to verify that the performance of the table after transformation is as expected:

First, query performance partitioned tables have advantages.

Second, the management and operation and maintenance efficiency of partition tables should also be improved accordingly.

If these two points meet the requirements, you can directly rename the partition table to the original table and delete the original table.

Let's first verify whether the query performance has improved:
The first query: query the records of the day '2020-03-01'

Here I execute each query twice based on whether the data is cached or not. Based on the original table pt_old, the first query time is 1 minute and 1.7 seconds, and the second time is 0.03 seconds; based on the partition table pt_current, the first query time is 0.02 seconds, and the second time is 0.01 seconds. If only the first query time is compared, the query performance of the partitioned table is greatly improved; for the second time, it is similar, but the query performance of the partitioned table is still ahead.

(localhost:ytt)<mysql>select * from pt_old where log_date = '2020-03-01';
...
9593 rows in set (1 min 1.70 sec)
-- 第二次
9593 rows in set (0.03 sec)


(localhost:ytt)<mysql>select * from pt_current where log_date = '2020-03-01';
...
9593 rows in set (0.02 sec)
-- 第二次
9593 rows in set (0.01 sec)
The second query: query the records of the last 5 days at the end of 2020

Still each query is executed twice. The query time based on the original table pt_old is 2 minutes and 42.21 seconds for the first time, and 0.13 seconds for the second time; the query time based on the partition table pt_current is 0.07 seconds for the first time and 0.01 seconds for the second time. two query results show that the performance of the partition table has improved significantly.

(localhost:ytt)<mysql>select * from pt_old where log_date in ('2020-12-27','2020-12-28','2020-12-29','2020-12-30','2020-12-31');
...
30097 rows in set (2 min 42.21 sec)
...
-- 第二次
30097 rows in set (0.13 sec)


(localhost:ytt)<mysql>select * from pt_current where log_date in ('2020-12-27','2020-12-28','2020-12-29','2020-12-30','2020-12-31');
...
30097 rows in set (0.07 sec)
...
-- 第二次
30097 rows in set (0.01 sec)

Now let's see if the management and operation performance has been improved?

Since the partition table is used, it will involve a very difficult problem: How to adjust the partition table to accommodate the newly added records at the end of each year? MySQL does not have a direct method, but we can use the default partition p_max to manually expand the capacity.

Take a look at the partition data of p_current in the table below:

(localhost:ytt)<mysql>select left(partition_name,5) p,sum(table_rows) cnt from information_schema.partitions where table_name = 'pt_current' group by leftt(partition_name,5);
+-------+---------+
| p     | cnt     |
+-------+---------+
| p2020 | 3641722 |
| p2021 |       0 |
| p_max |       0 |
+-------+---------+
3 rows in set (0.02 sec)

At present, there is only data in 2020, and there is no data in 2021. By the end of 2021, the records will be automatically added to the partition p_max. Therefore, before the early morning of January 1, 2022, the data of the whole year of 2020 should be moved out to pt_2020, and the partition definition of 2022 should be added.

Then according to our analysis, I will write a stored procedure that automatically expands the partition, which can be automatically run with the OS's JOB or MySQL's EVENT. The code is as follows:

DELIMITER $$

USE `ytt`$$

DROP PROCEDURE IF EXISTS `sp_autoextend_partition_pt_current`$$

CREATE DEFINER=`root`@`%` PROCEDURE `sp_autoextend_partition_pt_current`(
IN f_year YEAR
)
BEGIN
    DECLARE v_days INT UNSIGNED DEFAULT 365;
    DECLARE v_days_interval DATE DEFAULT '2018-12-31';
    DECLARE i INT UNSIGNED DEFAULT 1;
    SET @stmt = '';
    SET v_days =  DATEDIFF(CONCAT(f_year+1,'-01-01'),CONCAT(f_year,'-01-01'));
     
    SET @stmt_begin = 'ALTER TABLE pt_current REORGANIZE PARTITION p_max into(';
        WHILE i <= v_days DO
           SET v_days_interval = DATE_ADD(CONCAT(f_year,'-01-01'),INTERVAL i DAY);
           SET @stmt = CONCAT(@stmt,'PARTITION p',f_year,'_',LPAD(i,3,"0"),' VALUES LESS THAN (''',v_days_interval,'''),');     
           SET i = i + 1;        
        END WHILE;    
    SET @stmt_end = 'PARTITION p_max VALUES LESS THAN (maxvalue))';
        SET @stmt = CONCAT(@stmt_begin,@stmt,@stmt_end);
        PREPARE s1 FROM @stmt;
        EXECUTE s1;
        DROP PREPARE s1;
        SET @stmt = NULL;
        SET @stmt_begin = NULL;
        SET @stmt_end = NULL;    
    END$$

DELIMITER ;

Now to augment the partition data for 2022:

(localhost:ytt)<mysql>call sp_autoextend_partition_pt_current(2022);
Query OK, 0 rows affected (14.55 sec)

The next thing that needs to be done at the end of each year is to remove the data of last year, delete the old partition definition, and add a new one-year partition definition.

Now that the time is 2022, first insert the data of 2021 into pt_current (in the real environment, this part of the data already exists):

(localhost:ytt)<mysql>insert into pt_current (r1,log_date) select r1,date_add(log_date,interval 1 year) from pt_current;
Query OK, 3641722 rows affected (2 min 28.75 sec)
Records: 3641722  Duplicates: 0  Warnings: 0

(localhost:ytt)<mysql>select left(partition_name,5) p,sum(table_rows) cnt from information_schema.partitions where table_name = 'pt_current' group by left(partition_name,5);
+-------+---------+
| p     | cnt     |
+-------+---------+
| p2020 | 3641722 |
| p2021 | 3641726 |
| p2022 |       0 |
| p_max |       0 |
+-------+---------+
4 rows in set (0.02 sec)

Then move the data in 2020 to the history table: (Because there are many partitions in the partition table each year, for the convenience of writing, I did not use the partition replacement function here.)

(localhost:ytt)<mysql>create table pt_2020 like pt_old;
Query OK, 0 rows affected (0.05 sec)

(localhost:ytt)<mysql>insert into pt_2020 select * from pt_current where log_date between '2020-01-01' and '2020-12-31';
Query OK, 3641722 rows affected (1 min 12.54 sec)
Records: 3641722  Duplicates: 0  Warnings: 0

Delete expired data:

(localhost:ytt)<mysql>SELECT CONCAT('alter table ytt.pt_current drop partition ',partition_name,';') FROM information_schema.`PARTITIONS`  WHERE table_schema = 'ytt' AND table_name = 'pt_current'  AND partition_name like 'p2020%' into outfile '/var/lib/mysql-files/drop_expire_partition_2020.sql';
Query OK, 366 rows affected (0.00 sec)

mysql> \. /var/lib/mysql-files/drop_expire_partition_2020.sql
Query OK, 0 rows affected (0.83 sec)
Records: 0  Duplicates: 0  Warnings: 0

...

Query OK, 0 rows affected (0.82 sec)
Records: 0  Duplicates: 0  Warnings: 0

...

It should be noted: The partition definition must have rules, which is conducive to the later cleaning of expired data.


Is there anything else you would like to know about the technical content of MySQL? Let me know by leaving a message!


爱可生开源社区
426 声望209 粉丝

成立于 2017 年,以开源高质量的运维工具、日常分享技术干货内容、持续的全国性的社区活动为社区己任;目前开源的产品有:SQL审核工具 SQLE,分布式中间件 DBLE、数据传输组件DTLE。