2

Mysql百万数据查询之时间查询

在校期间,一直没有接触过百万级的数据,一直没有百万数据下查询的感觉。
近期做毕业设计,需要用到较大的数据来进行图表分析和订单推荐。
一直疑惑如何设计表才能让普通的sql更高效,因此便以下尝试并记录于此。
本篇文章着重于表的设计对查询sql的影响
小弟不才,文章仅做记录,有更好的想法或有误请留言指出,必当积极回应。

优化设想

  1. 时间以int时间戳的形式保存,并建立相关索引
  2. 时间拆分,以year、month、day进行保存,并以year+month和year+month+day做索引(空间换时间)
  3. 以时间戳保存,并按月进行分区

测试环境

- 表引擎:innoDB
- 字符集:utf8mb4
- 数据库版本:5.7.18
数据库引擎为默认的innoDB、因为实际业务考虑事务,因此就默认innoDB
为了不受其他环节影响,用的是`腾讯云mysql基础版1核1000MB/50GB`

建表

创建相关Order表

-- 创建order表
CREATE TABLE tb_order (
    `id` BIGINT ( 20 ) NOT NULL AUTO_INCREMENT,
    `item_name` VARCHAR ( 255 ) NOT NULL,
    `item_price` INT ( 11 ) UNSIGNED NOT NULL,
    `item_state` TINYINT ( 1 ) NOT NULL,
    `create_time` INT ( 11 ) UNSIGNED NOT NULL,
    `time_year` CHAR ( 4 ) NOT NULL,
    `time_month` CHAR ( 2 ) NOT NULL,
    `time_day` CHAR ( 2 ) NOT NULL,
    PRIMARY KEY ( id ) 
) ENGINE = INNODB DEFAULT CHARSET = utf8mb4;
CREATE INDEX idx_order_ctime ON tb_order ( create_time );
CREATE INDEX idx_order_state ON tb_order ( `item_state` );
CREATE INDEX idx_order_day ON tb_order ( `time_year`, `time_month` );
CREATE INDEX idx_order_dmonth ON tb_order ( `time_year`, `time_month`, `time_day`);

创建Mock存储过程

-- 随机字符串函数
CREATE DEFINER=`root`@`%` FUNCTION `rand_string`(n INT) RETURNS varchar(255) CHARSET utf8
BEGIN 
    DECLARE chars_str varchar(100) DEFAULT 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'; 
    DECLARE return_str varchar(255) DEFAULT '' ;
    DECLARE i INT DEFAULT 0; 
WHILE i < n DO 
    SET return_str = concat(return_str,substring(chars_str , FLOOR(1 + RAND()*62 ),1)); 
    SET i = i +1; 
END WHILE; 
    RETURN return_str; 
END

-- 随机范围时间函数
CREATE DEFINER=`root`@`%` FUNCTION `rand_date`(`startDate` date,`endDate` date) RETURNS datetime
BEGIN
    #Routine body goes here...
    DECLARE sec INT DEFAULT 0;
    DECLARE ret DATETIME;
    SET sec = ABS(UNIX_TIMESTAMP(endDate) - UNIX_TIMESTAMP(startDate));
    SET ret = DATE_ADD(startDate, INTERVAL FLOOR( 1+RAND ( ) * ( sec-1))SECOND);
    RETURN ret;
END

-- 模拟订单存储过程
CREATE DEFINER=`root`@`%` PROCEDURE `mock_order`(IN `size` int UNSIGNED,IN `sd` date,IN `ed` date)
BEGIN
    #Routine body goes here...
    DECLARE i int UNSIGNED DEFAULT 1;
    DECLARE randOrderName VARCHAR(10);
    DECLARE randOrderTime DATETIME;
    WHILE i<= size DO
        SELECT rand_string(10) INTO randOrderName;
        SELECT rand_date(sd,ed) INTO randOrderTime;
        INSERT INTO tb_order(`item_name`,`item_price`,`item_state`,`create_time`,`time_year`,`time_month`,`time_day`)
        VALUES(randOrderName,RAND()*100,ROUND(RAND()),UNIX_TIMESTAMP(randOrderTime),DATE_FORMAT(randOrderTime,'%Y'),DATE_FORMAT(randOrderTime,'%m'),DATE_FORMAT(randOrderTime,'%d'));
        SET i = i+1;

    END WHILE;
END

-- 执行存储过程
CALL mock_order(1200000,'2020-03-01','2020-12-31')

-- CALL mock_order(FLOOR(RAND() * 9999999),'2020-03-01','2020-12-31')

执行结果

数据总量:

SELECT count(*) FROM tb_order
> OK
> 时间: 0.607s

执行结果.jpg)

数据分布:

SELECT 
    count(*) as count,
    DATE_FORMAT(FROM_UNIXTIME(create_time),'%Y-%m') as date
FROM
    tb_order
GROUP BY 
    date desc
> OK
> 时间: 5.007s

数据分布:

结果分布

测试过程

  1. 统计月订单数量
  2. 统计月订单总额
  3. 根据月份统计订单状态
  4. 查询日订单
  5. 统计日订单总额
  6. 分页查询月订单

时间戳

1. 统计月订单数量
SELECT
    count( 1 ) 
FROM
    tb_order 
WHERE
    item_state = 1 
    AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
> OK
> 时间: 6.348s

20200308204156.png

插曲:因为不确定 使用 UNIX_TIMESTAMP() 函数对原本的查询结果是否有影响,决定多测测量使用UNIX_TIMESTAMP() 与直接使用时间戳的时间对比
次数 使用转换函数 直接使用时间戳
1 6.105s 5.768s
2 6.239s 6.199s
3 5.681s 6.161s
4 5.687s 5.605s
5 6.118s 5.621s
平均值 5.966s 5.870s
结果显示 相差并不是很多 0.09s 而且单纯的执行 select UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 需要0.045s 所以下列均使用UNIX_TIMESTAMP() 进行测试
2. 统计月订单总额
SELECT
    sum(item_price) as total
FROM
    tb_order 
WHERE
    item_state = 1 
    AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
> OK
> 时间: 5.901s

20200308205618.png

3. 根据月份统计订单状态
SELECT
    count( * ) AS count,
    item_state AS state 
FROM
    tb_order 
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) 
GROUP BY
    item_state
> OK
> 时间: 12.126s

20200308205805.png

时间慢的有点不可接受
4. 查询日订单
SELECT
    * 
FROM
    tb_order 
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' )
> OK
> 时间: 0.54s

20200308212150.png

5. 统计日订单总额
SELECT
    SUM(item_price) as total
FROM
    tb_order 
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' )
> OK
> 时间: 0.154s

20200308212401.png

6. 分页查询月订单
SELECT
    * 
FROM
    tb_order 
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) 
    LIMIT 10000,20
> OK
> 时间: 0.107s

20200308211153.png

时间拆分的形式

1. 统计月订单数量
SELECT
    count(1) 
FROM
    tb_order 
WHERE
    item_state = 1 
    AND time_year = '2020' and time_month = '04'
> OK
> 时间: 1.329s

20200308211318.png

2. 统计月订单总额
SELECT
    sum( item_price ) AS total 
FROM
    tb_order 
WHERE
    item_state = 1 
    AND time_year = '2020' 
    AND time_month = '04'
> OK
> 时间: 1.23s

20200308211634.png

3. 根据月份统计订单状态
SELECT
    count( * ) AS count,
    item_state AS state 
FROM
    tb_order 
WHERE
    time_year = '2020' 
    AND time_month = '04'
GROUP BY
    item_state
> OK
> 时间: 1.429s

20200308211725.png

4. 查询日订单
SELECT
    * 
FROM
    tb_order 
WHERE
    time_year = '2020' and time_month = '04' and time_day = '01'
> OK
> 时间: 0.663s

20200308212225.png

5. 统计日订单总额
SELECT
    SUM(item_price) as total
FROM
    tb_order 
WHERE
    time_year = '2020' AND time_month = '04' AND time_day = '01'
> OK
> 时间: 0.091s

20200308212430.png

6. 分页查询月订单
SELECT
    * 
FROM
    tb_order 
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) 
    LIMIT 10000,20
> OK
> 时间: 0.107s

20200308212613.png

由于 根据 时间戳来查询的结果是按照时间戳索引排序的,因此是从小到大
而单纯的拆分查询并没有有序
因此 添加一个order by 来进行排序
SELECT
    * 
FROM
    tb_order 
WHERE
    time_year = '2020' 
    AND time_month = '04' 
ORDER BY
    create_time  
    LIMIT 10000,
    20
> OK
> 时间: 1.599s

20200308213113.png

时间分区

因为添加分区会改变表的文件结构,因此copy一个表
1. 新建主键
因为range分区建立的字段必须为主键或唯一键,因此需要删除原先主键并新建主键
ALTER TABLE tb_order_2 CHANGE COLUMN id id BIGINT(20) UNSIGNED NOT NULL;
alter table tb_order_2 DROP PRIMARY key;
alter table tb_order_2 add PRIMARY key(id,create_time);
2. 新建表分区
ALTER table tb_order_2 PARTITION BY RANGE(create_time)(
PARTITION  p_2020_01  VALUES LESS THAN (1580486400),
PARTITION  p_2020_02  VALUES LESS THAN (1582992000),
PARTITION  p_2020_03  VALUES LESS THAN (1585670400),
PARTITION  p_2020_04  VALUES LESS THAN (1588262400),
PARTITION  p_2020_05  VALUES LESS THAN (1590940800),
PARTITION  p_2020_06  VALUES LESS THAN (1593532800),
PARTITION  p_2020_07  VALUES LESS THAN (1596211200),
PARTITION  p_2020_08  VALUES LESS THAN (1598889600),
PARTITION  p_2020_09  VALUES LESS THAN (1601481600),
PARTITION  p_2020_10  VALUES LESS THAN (1604160000),
PARTITION  p_2020_11  VALUES LESS THAN (1606752000),
PARTITION  p_2020_12  VALUES LESS THAN MAXVALUE
)

20200312105829.png

3. 查看表分区结构
SELECT
    partition_name part,
    partition_expression expr,
    partition_description descr,
    table_rows 
FROM
    information_schema.PARTITIONS 
WHERE
    table_schema = SCHEMA () 
    AND table_name = 'tb_order_2';

20200312105949.png

4. 统计月订单
SELECT
    count( 1 ) 
FROM
    tb_order_2
WHERE
    item_state = 1 
    AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
> OK
> 时间: 0.133s
可以看出时间相比之间查询时间戳快很多。explain一下
20200312110146.png
用了分区进行查询
5. 统计月订单总额
SELECT
    sum(item_price) as total
FROM
    tb_order_2
WHERE
    item_state = 1 
    AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
> OK
> 时间: 0.526s

20200312110252.png

6. 根据月份统计订单状态
SELECT
    count( * ) AS count,
    item_state AS state 
FROM
    tb_order_2
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) 
GROUP BY
    item_state
> OK
> 时间: 0.273s

20200312112123.png

7. 查询日订单
SELECT
    * 
FROM
    tb_order_2
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' )
> OK
> 时间: 0.587s
全部查询的话 并 差距并不大
20200312112257.png
8. 统计日订单总额
SELECT
    SUM(item_price) as total
FROM
    tb_order_2
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' )
> OK
> 时间: 0.068s

20200312112342.png

9. 分页查询月订单
SELECT
    * 
FROM
    tb_order_2
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) 
    LIMIT 10000,20
> OK
> 时间: 0.073s

20200312112415.png

图表

测试类型 统计月订单数 统计月订单总额 根据月份统计订单状态 查询日订单 统计日订单总额 分页查询月订单
时间戳 6.348s 5.901s 12.126s 0.54s 0.154s 0.107s
时间拆分 1.329s 1.23s 1.429s 0.663s 0.091s 0.107s
分区 0.133s 0.526s 0.273s 0.587s 0.068s 0.073s

总结

总体看来,就时间查询来说,效率是表分区>时间拆分>时间戳,时间戳的表结构为int结构,因为之间看过其他文章有对比过效率 int > datetime > timestamp,因此我在这里设计表的时候就采用了int的形式。

其实怎么选择都应该按需求而定,当我们的表不大,数量级还没上去,哪种结构查询相差并不是很大,先按时间戳的形式查询,待表结构太大可以进行表的分区,结构若还是很大,可以进行分表。

当然这所有都是我个人(菜鸟)的想法,统计次数也较少,无法完全代表真实情况。希望大家多提出想法


Bt4Grisl
13 声望0 粉丝