1

The scene where partitions or tables are most applicable is still splitting based on the time field. In this section, we will talk about how to better split based on the time field. Respectively according to the realization method of several dimensions of year, month, day and some details of attention.

first, split by year

The choice of the granularity of date field splitting is closely related to the business search request. For example, if you keep 10 years of data, and each query is based on a specific year as a filter condition, it is definitely best to split it by year. For example, the following SQL:

select * from ytt_pt1 where log_date >='2018-01-01' and log_date < '2019-01-01';

Let's take a look at a practical example of splitting separately by year: the table ytt_pt1 contains 1000W records, and the partition table is created with the granularity of the year.

mysql> create table ytt_pt1(id bigint, log_date date);
Query OK, 0 rows affected (0.18 sec)

mysql> insert into ytt_pt1 select id,log_date from ytt_p1 limit 10000000;
Query OK, 10000000 rows affected (3 min 49.53 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE ytt_pt1 PARTITION BY RANGE (year(log_date))
    -> (
    -> PARTITION p0001 VALUES LESS THAN (2012),
    -> PARTITION p0002 VALUES LESS THAN (2013),
    -> PARTITION p0003 VALUES LESS THAN (2014),
    -> PARTITION p0004 VALUES LESS THAN (2015),
    -> PARTITION p0005 VALUES LESS THAN (2016),
    -> PARTITION p0006 VALUES LESS THAN (2017),
    -> PARTITION p0007 VALUES LESS THAN (2018),
    -> PARTITION p0008 VALUES LESS THAN (2019),
    -> PARTITION p0009 VALUES LESS THAN (2020),
    -> PARTITION p0010 VALUES LESS THAN (2021),
    -> PARTITION p_max VALUES LESS THAN (maxvalue)
    -> );
Query OK, 10000000 rows affected (2 min 33.31 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

Take a look at the query effect on an annual basis: the following SQL directly uses partition p0008, and the query time is 0.91 seconds. This time is not short, and filter conditions can be added later to reduce the query time.

mysql> select count(*) from ytt_pt1 where log_date >='2018-01-01' and log_date < '2019-01-01';
+----------+
| count(*) |
+----------+
|  1000204 |
+----------+
1 row in set (0.91 sec)

mysql> explain  select count(*) from ytt_pt1 where log_date >='2018-01-01' and log_date < '2019-01-01'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ytt_pt1
   partitions: p0008
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 998002
     filtered: 11.11
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

Note: The query can only be filtered directly based on the field. If the filter is based on the field expression, MySQL will not be sure which partition to go to, and will scan all partitions. The processing method is the same as that of a single-table query. For example, the statement:

select count(*) from ytt_pt1 where year(log_date) = '2018' ;

Look at the execution: MySQL scans all partitions, and the query execution time is more than 9 seconds.

mysql> select count(*) from ytt_pt1 where year(log_date) = '2018' ;
+----------+
| count(*) |
+----------+
|  1000204 |
+----------+
1 row in set (9.19 sec)

mysql> explain select count(*) from ytt_pt1 where year(log_date) = '2018' \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ytt_pt1
   partitions: p0001,p0002,p0003,p0004,p0005,p0006,p0007,p0008,p0009,p0010,p_max
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 9982648
     filtered: 100.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

If you do not insist on this way of writing, you can give the optimizer a hint, specifically to specify the partition to retrieve data, or create a virtual column based on the field expression:

mysql> select count(*) from ytt_pt1 partition(p0008) where year(log_date) = '2018' ;
+----------+
| count(*) |
+----------+
|  1000204 |
+----------+
1 row in set (0.84 sec)

If the query is filtered frequently by month as a dimension, it is definitely best to split by month; for example, it is necessary to retrieve some records of the current month of 2020 for subsequent data processing. The approximate SQL is as follows:

select * from ytt_pt1_按月拆分表 where log_date in ('2020-01-01','2020-01-02',...)
second, split
Split according to the month, there are two ways of writing:

The first type: Disassemble 12 partitions directly according to the month: The partition type of the following table ytt_pt1_month1 is LIST, which is directly calculated based on the function month.

mysql> show create table ytt_pt1_month1\G
*************************** 1. row ***************************
       Table: ytt_pt1_month1
Create Table: CREATE TABLE `ytt_pt1_month1` (
  `id` bigint DEFAULT NULL,
  `log_date` date DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
/*!50100 PARTITION BY LIST (month(`log_date`))
(PARTITION p0001 VALUES IN (1) ENGINE = InnoDB,
 PARTITION p0002 VALUES IN (2) ENGINE = InnoDB,
 PARTITION p0003 VALUES IN (3) ENGINE = InnoDB,
 PARTITION p0004 VALUES IN (4) ENGINE = InnoDB,
 PARTITION p0005 VALUES IN (5) ENGINE = InnoDB,
 PARTITION p0006 VALUES IN (6) ENGINE = InnoDB,
 PARTITION p0007 VALUES IN (7) ENGINE = InnoDB,
 PARTITION p0008 VALUES IN (8) ENGINE = InnoDB,
 PARTITION p0009 VALUES IN (9) ENGINE = InnoDB,
 PARTITION p0010 VALUES IN (10) ENGINE = InnoDB,
 PARTITION p0011 VALUES IN (11) ENGINE = InnoDB,
 PARTITION p0012 VALUES IN (12) ENGINE = InnoDB) */
1 row in set (0.00 sec)

For example, to query the records in the first half of 2020: the query is limited to the partition p0001, but the time is not ideal, it takes 0.66 seconds.

mysql> select count(*) from ytt_pt1_month1 where log_date in ('2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-05','2020-01-06','2020-01-07','2020-01-08','2020-01-09','2020-01-10','2020-01-11','2020-01-12','2020-01-13','2020-01-14','2020-01-15');
+----------+
| count(*) |
+----------+
|    41540 |
+----------+
1 row in set (0.66 sec)

mysql> explain select count(*) from ytt_pt1_month1 where log_date in ('2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-05','2020-01-06','2020-01-07','2020-01-08','2020-01-09','2020-01-10','2020-01-11','2020-01-12','2020-01-13','2020-01-14','2020-01-15')\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ytt_pt1_month1
   partitions: p0001
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 848224
     filtered: 50.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

The second type: For each year's data, 12 partitions are separately divided, that is, according to the joint dimension of year and month, there are 144 partitions, and each partition corresponds to the data of a specific year and month.

Adding this partition is a little more troublesome. Clone the table ytt_pt1_month1 to ytt_pt1_month2. Here is a stored procedure to add partition information:

DELIMITER $$

USE `ytt`$$

DROP PROCEDURE IF EXISTS `sp_add_partition_ytt_pt1_month2`$$

CREATE DEFINER=`root`@`%` PROCEDURE `sp_add_partition_ytt_pt1_month2`()
BEGIN
    DECLARE i,j INT UNSIGNED DEFAULT 1;
    DECLARE v_tmp_date DATE;
    SET @stmt = '';
    SET @stmt_begin = 'ALTER TABLE ytt_pt1_month2 PARTITION BY RANGE COLUMNS (log_date)(';
        SET i = 2010;        
        WHILE i <= 2020 DO
          SET j = 1;
          WHILE j <= 12 DO
            SET v_tmp_date = CONCAT(i,'-01-01');
            SET @stmt = CONCAT(@stmt,'PARTITION p',i,'_',LPAD(j,2,"0"),' VALUES LESS THAN (''',DATE_ADD(v_tmp_date,INTERVAL j MONTH),'''),');
            SET j = j + 1;
          END WHILE;
          SET i = i + 1;
        END WHILE;    
    SET @stmt_end = 'PARTITION p_max VALUES LESS THAN (maxvalue))';
        SET @stmt = CONCAT(@stmt_begin,@stmt,@stmt_end);
        PREPARE s1 FROM @stmt;
        EXECUTE s1;
        DROP PREPARE s1;

        SET @stmt = NULL;
        SET @stmt_begin = NULL;
        SET @stmt_end = NULL;    
    END$$

DELIMITER ;

mysql> call sp_add_partition_ytt_pt1_month2;
Query OK, 0 rows affected (2 min 20.48 sec)

The result is similar to this:

PARTITION p2010_01 VALUES LESS THAN ('2010-02-01') ENGINE = InnoDB,
 ...
 PARTITION p2010_12 VALUES LESS THAN ('2011-01-01') ENGINE = InnoDB,
 PARTITION p2011_01 VALUES LESS THAN ('2011-02-01') ENGINE = InnoDB,
...
 PARTITION p2011_12 VALUES LESS THAN ('2012-01-01') ENGINE = InnoDB,
 ...
 PARTITION p2020_12 VALUES LESS THAN ('2021-01-01') ENGINE = InnoDB,
 PARTITION p_max VALUES LESS THAN (MAXVALUE) ENGINE = InnoDB)

After adding the partitions, let's observe the query that just obtained the records of the first half of 2020:

mysql> select count(*) from ytt_pt1_month2 where log_date in ('2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-05','2020-01-06','2020-01-07','2020-01-08','2020-01-09','2020-01-10','2020-01-11','2020-01-12','2020-01-13','2020-01-14','2020-01-15');
+----------+
| count(*) |
+----------+
|    41540 |
+----------+
1 row in set (0.06 sec)

mysql> explain   select count(*) from ytt_pt1_month2 where log_date in ('2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-05','2020-01-06','2020-01-07','2020-01-08','2020-01-09','2020-01-10','2020-01-11','2020-01-12','2020-01-13','2020-01-14','2020-01-15')\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ytt_pt1_month2
   partitions: p2020_01
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 85498
     filtered: 50.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

The query time is 0.06 seconds, which is 10 times faster than the first split method.

Does that mean that partitioning based on the union of the year and month must be more optimized than partitioning based on the single month? Not necessarily. If the query sentence filter condition contains a fixed month of the partition, such as every December, it is definitely better to partition by month than by year and month.

third, split

Filter frequently-frequent queries based on the day dimension, and split it by day is best. For example, if you only want to query the data on January 1, 2020, the approximate SQL is as follows:

select * from ytt_pt1 where log_date = '2020-01-01'

Similar to the joint dimension of year and month, write a script or stored procedure to add partitions. The only thing to note here is that the number of MySQL partition tables is limited, the maximum is 8192, so if you partition by day, store 10 years of data, the number of partitions The number is 3650, which is also within the limit.

Modify the previous stored procedure, the code is as follows:

DELIMITER $$

USE `ytt`$$

DROP PROCEDURE IF EXISTS `sp_add_partition_ytt_pt1_day`$$

CREATE DEFINER=`root`@`%` PROCEDURE `sp_add_partition_ytt_pt1_day`(
IN f_year_start YEAR,
IN f_year_end YEAR
)
BEGIN
    DECLARE v_days INT UNSIGNED DEFAULT 365;
    DECLARE v_year DATE DEFAULT '2010-01-01';
    DECLARE v_partition_name VARCHAR(64) DEFAULT '';
    DECLARE v_log_date DATE;
    DECLARE i,j INT UNSIGNED DEFAULT 1;
    SET @stmt = '';
    SET @stmt_begin = 'ALTER TABLE ytt_pt1_day PARTITION BY RANGE COLUMNS (log_date)(';
    SET i = f_year_start;
    WHILE i <= f_year_end DO 
      SET v_year = CONCAT(i,'-01-01');
      SET v_days = DATEDIFF(DATE_ADD(v_year,INTERVAL 1 YEAR),v_year);            
      SET j = 1;
      WHILE j <= v_days DO
        SET v_log_date = DATE_ADD(v_year,INTERVAL j DAY);
       SET v_partition_name = CONCAT('p',i,'_',LPAD(j,3,'0'));
       SET @stmt = CONCAT(@stmt,'PARTITION ',v_partition_name,' VALUES LESS THAN (''',v_log_date,'''),');
       SET j = j + 1;        
      END WHILE;
      SET i = i + 1;    
    END WHILE;
    SET @stmt_end = 'PARTITION p_max VALUES LESS THAN (maxvalue))';
    SET @stmt = CONCAT(@stmt_begin,@stmt,@stmt_end);
    PREPARE s1 FROM @stmt;
    EXECUTE s1;
    DROP PREPARE s1;
    SELECT NULL,NULL,NULL INTO @stmt,@stmt_begin,@stmt_end;
END$$

DELIMITER ;

mysql> CALL sp_add_partition_ytt_pt1_day('2010','2020');
Query OK, 1 row affected (14 min 13.69 sec)

Next, the query time in days must be the shortest, only 0.01 seconds.

mysql> select count(*) from ytt_pt1_day where log_date = '2020-01-01';
+----------+
| count(*) |
+----------+
|     2675 |
+----------+
1 row in set (0.01 sec)

At this time, if such a query is based on year or month, the performance is definitely not optimal.

mysql> select count(*) from ytt_pt1 where log_date = '2020-01-01';
+----------+
| count(*) |
+----------+
|     2675 |
+----------+
1 row in set (0.68 sec)

mysql> select count(*) from ytt_pt1_month1 where log_date = '2020-01-01';
+----------+
| count(*) |
+----------+
|     2675 |
+----------+
1 row in set (0.87 sec)

mysql> select count(*) from ytt_pt1_month2 where log_date = '2020-01-01';
+----------+
| count(*) |
+----------+
|     2675 |
+----------+
1 row in set (0.09 sec)

It can be seen that this type of query based on other methods takes significantly longer time to partition than by day.

To summarize:

This article mainly describes the most commonly used data splitting method in daily work: splitting by time. The time field is often implemented in three different dimensions of year, month, and day, and the specific implementation method and applicable scenarios are described in detail.


What else do you want to know about the technical content of MySQL? Hurry up and leave a message to tell the editor!


爱可生开源社区
426 声望209 粉丝

成立于 2017 年,以开源高质量的运维工具、日常分享技术干货内容、持续的全国性的社区活动为社区己任;目前开源的产品有:SQL审核工具 SQLE,分布式中间件 DBLE、数据传输组件DTLE。