Mysql百万数据查询之时间查询
在校期间,一直没有接触过百万级的数据,一直没有百万数据下查询的感觉。
近期做毕业设计,需要用到较大的数据来进行图表分析和订单推荐。
一直疑惑如何设计表才能让普通的sql更高效,因此便以下尝试并记录于此。
本篇文章着重于表的设计对查询sql的影响
小弟不才,文章仅做记录,有更好的想法或有误请留言指出,必当积极回应。
优化设想
- 时间以int时间戳的形式保存,并建立相关索引
- 时间拆分,以year、month、day进行保存,并以year+month和year+month+day做索引(空间换时间)
- 以时间戳保存,并按月进行分区
测试环境
- 表引擎:innoDB
- 字符集:utf8mb4
- 数据库版本:5.7.18
数据库引擎为默认的innoDB、因为实际业务考虑事务,因此就默认innoDB
为了不受其他环节影响,用的是`腾讯云mysql基础版1核1000MB/50GB`
建表
创建相关Order表
-- 创建order表
CREATE TABLE tb_order (
`id` BIGINT ( 20 ) NOT NULL AUTO_INCREMENT,
`item_name` VARCHAR ( 255 ) NOT NULL,
`item_price` INT ( 11 ) UNSIGNED NOT NULL,
`item_state` TINYINT ( 1 ) NOT NULL,
`create_time` INT ( 11 ) UNSIGNED NOT NULL,
`time_year` CHAR ( 4 ) NOT NULL,
`time_month` CHAR ( 2 ) NOT NULL,
`time_day` CHAR ( 2 ) NOT NULL,
PRIMARY KEY ( id )
) ENGINE = INNODB DEFAULT CHARSET = utf8mb4;
CREATE INDEX idx_order_ctime ON tb_order ( create_time );
CREATE INDEX idx_order_state ON tb_order ( `item_state` );
CREATE INDEX idx_order_day ON tb_order ( `time_year`, `time_month` );
CREATE INDEX idx_order_dmonth ON tb_order ( `time_year`, `time_month`, `time_day`);
创建Mock存储过程
-- 随机字符串函数
CREATE DEFINER=`root`@`%` FUNCTION `rand_string`(n INT) RETURNS varchar(255) CHARSET utf8
BEGIN
DECLARE chars_str varchar(100) DEFAULT 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789';
DECLARE return_str varchar(255) DEFAULT '' ;
DECLARE i INT DEFAULT 0;
WHILE i < n DO
SET return_str = concat(return_str,substring(chars_str , FLOOR(1 + RAND()*62 ),1));
SET i = i +1;
END WHILE;
RETURN return_str;
END
-- 随机范围时间函数
CREATE DEFINER=`root`@`%` FUNCTION `rand_date`(`startDate` date,`endDate` date) RETURNS datetime
BEGIN
#Routine body goes here...
DECLARE sec INT DEFAULT 0;
DECLARE ret DATETIME;
SET sec = ABS(UNIX_TIMESTAMP(endDate) - UNIX_TIMESTAMP(startDate));
SET ret = DATE_ADD(startDate, INTERVAL FLOOR( 1+RAND ( ) * ( sec-1))SECOND);
RETURN ret;
END
-- 模拟订单存储过程
CREATE DEFINER=`root`@`%` PROCEDURE `mock_order`(IN `size` int UNSIGNED,IN `sd` date,IN `ed` date)
BEGIN
#Routine body goes here...
DECLARE i int UNSIGNED DEFAULT 1;
DECLARE randOrderName VARCHAR(10);
DECLARE randOrderTime DATETIME;
WHILE i<= size DO
SELECT rand_string(10) INTO randOrderName;
SELECT rand_date(sd,ed) INTO randOrderTime;
INSERT INTO tb_order(`item_name`,`item_price`,`item_state`,`create_time`,`time_year`,`time_month`,`time_day`)
VALUES(randOrderName,RAND()*100,ROUND(RAND()),UNIX_TIMESTAMP(randOrderTime),DATE_FORMAT(randOrderTime,'%Y'),DATE_FORMAT(randOrderTime,'%m'),DATE_FORMAT(randOrderTime,'%d'));
SET i = i+1;
END WHILE;
END
-- 执行存储过程
CALL mock_order(1200000,'2020-03-01','2020-12-31')
-- CALL mock_order(FLOOR(RAND() * 9999999),'2020-03-01','2020-12-31')
执行结果
数据总量:
SELECT count(*) FROM tb_order
> OK
> 时间: 0.607s
.jpg)
数据分布:
SELECT
count(*) as count,
DATE_FORMAT(FROM_UNIXTIME(create_time),'%Y-%m') as date
FROM
tb_order
GROUP BY
date desc
> OK
> 时间: 5.007s
结果分布
测试过程
- 统计月订单数量
- 统计月订单总额
- 根据月份统计订单状态
- 查询日订单
- 统计日订单总额
- 分页查询月订单
时间戳
1. 统计月订单数量
SELECT
count( 1 )
FROM
tb_order
WHERE
item_state = 1
AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' )
AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
> OK
> 时间: 6.348s
插曲:因为不确定 使用UNIX_TIMESTAMP()
函数对原本的查询结果是否有影响,决定多测测量使用UNIX_TIMESTAMP()
与直接使用时间戳的时间对比
次数 | 使用转换函数 | 直接使用时间戳 |
---|---|---|
1 | 6.105s | 5.768s |
2 | 6.239s | 6.199s |
3 | 5.681s | 6.161s |
4 | 5.687s | 5.605s |
5 | 6.118s | 5.621s |
平均值 | 5.966s | 5.870s |
结果显示 相差并不是很多 0.09s 而且单纯的执行select UNIX_TIMESTAMP( '2020-04-01 00:00:00' )
需要0.045s 所以下列均使用UNIX_TIMESTAMP()
进行测试
2. 统计月订单总额
SELECT
sum(item_price) as total
FROM
tb_order
WHERE
item_state = 1
AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' )
AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
> OK
> 时间: 5.901s
3. 根据月份统计订单状态
SELECT
count( * ) AS count,
item_state AS state
FROM
tb_order
WHERE
create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' )
AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
GROUP BY
item_state
> OK
> 时间: 12.126s
时间慢的有点不可接受
4. 查询日订单
SELECT
*
FROM
tb_order
WHERE
create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' )
AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' )
> OK
> 时间: 0.54s
5. 统计日订单总额
SELECT
SUM(item_price) as total
FROM
tb_order
WHERE
create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' )
AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' )
> OK
> 时间: 0.154s
6. 分页查询月订单
SELECT
*
FROM
tb_order
WHERE
create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' )
AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
LIMIT 10000,20
> OK
> 时间: 0.107s
时间拆分的形式
1. 统计月订单数量
SELECT
count(1)
FROM
tb_order
WHERE
item_state = 1
AND time_year = '2020' and time_month = '04'
> OK
> 时间: 1.329s
2. 统计月订单总额
SELECT
sum( item_price ) AS total
FROM
tb_order
WHERE
item_state = 1
AND time_year = '2020'
AND time_month = '04'
> OK
> 时间: 1.23s
3. 根据月份统计订单状态
SELECT
count( * ) AS count,
item_state AS state
FROM
tb_order
WHERE
time_year = '2020'
AND time_month = '04'
GROUP BY
item_state
> OK
> 时间: 1.429s
4. 查询日订单
SELECT
*
FROM
tb_order
WHERE
time_year = '2020' and time_month = '04' and time_day = '01'
> OK
> 时间: 0.663s
5. 统计日订单总额
SELECT
SUM(item_price) as total
FROM
tb_order
WHERE
time_year = '2020' AND time_month = '04' AND time_day = '01'
> OK
> 时间: 0.091s
6. 分页查询月订单
SELECT
*
FROM
tb_order
WHERE
create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' )
AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
LIMIT 10000,20
> OK
> 时间: 0.107s
由于 根据 时间戳来查询的结果是按照时间戳索引排序的,因此是从小到大
而单纯的拆分查询并没有有序
因此 添加一个order by
来进行排序
SELECT
*
FROM
tb_order
WHERE
time_year = '2020'
AND time_month = '04'
ORDER BY
create_time
LIMIT 10000,
20
> OK
> 时间: 1.599s
时间分区
因为添加分区会改变表的文件结构,因此copy一个表
1. 新建主键
因为range分区建立的字段必须为主键或唯一键,因此需要删除原先主键并新建主键
ALTER TABLE tb_order_2 CHANGE COLUMN id id BIGINT(20) UNSIGNED NOT NULL;
alter table tb_order_2 DROP PRIMARY key;
alter table tb_order_2 add PRIMARY key(id,create_time);
2. 新建表分区
ALTER table tb_order_2 PARTITION BY RANGE(create_time)(
PARTITION p_2020_01 VALUES LESS THAN (1580486400),
PARTITION p_2020_02 VALUES LESS THAN (1582992000),
PARTITION p_2020_03 VALUES LESS THAN (1585670400),
PARTITION p_2020_04 VALUES LESS THAN (1588262400),
PARTITION p_2020_05 VALUES LESS THAN (1590940800),
PARTITION p_2020_06 VALUES LESS THAN (1593532800),
PARTITION p_2020_07 VALUES LESS THAN (1596211200),
PARTITION p_2020_08 VALUES LESS THAN (1598889600),
PARTITION p_2020_09 VALUES LESS THAN (1601481600),
PARTITION p_2020_10 VALUES LESS THAN (1604160000),
PARTITION p_2020_11 VALUES LESS THAN (1606752000),
PARTITION p_2020_12 VALUES LESS THAN MAXVALUE
)
3. 查看表分区结构
SELECT
partition_name part,
partition_expression expr,
partition_description descr,
table_rows
FROM
information_schema.PARTITIONS
WHERE
table_schema = SCHEMA ()
AND table_name = 'tb_order_2';
4. 统计月订单
SELECT
count( 1 )
FROM
tb_order_2
WHERE
item_state = 1
AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' )
AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
> OK
> 时间: 0.133s
可以看出时间相比之间查询时间戳快很多。explain一下
用了分区进行查询
5. 统计月订单总额
SELECT
sum(item_price) as total
FROM
tb_order_2
WHERE
item_state = 1
AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' )
AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
> OK
> 时间: 0.526s
6. 根据月份统计订单状态
SELECT
count( * ) AS count,
item_state AS state
FROM
tb_order_2
WHERE
create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' )
AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
GROUP BY
item_state
> OK
> 时间: 0.273s
7. 查询日订单
SELECT
*
FROM
tb_order_2
WHERE
create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' )
AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' )
> OK
> 时间: 0.587s
全部查询的话 并 差距并不大![]()
8. 统计日订单总额
SELECT
SUM(item_price) as total
FROM
tb_order_2
WHERE
create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' )
AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' )
> OK
> 时间: 0.068s
9. 分页查询月订单
SELECT
*
FROM
tb_order_2
WHERE
create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' )
AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
LIMIT 10000,20
> OK
> 时间: 0.073s
图表
测试类型 | 统计月订单数 | 统计月订单总额 | 根据月份统计订单状态 | 查询日订单 | 统计日订单总额 | 分页查询月订单 |
---|---|---|---|---|---|---|
时间戳 | 6.348s | 5.901s | 12.126s | 0.54s | 0.154s | 0.107s |
时间拆分 | 1.329s | 1.23s | 1.429s | 0.663s | 0.091s | 0.107s |
分区 | 0.133s | 0.526s | 0.273s | 0.587s | 0.068s | 0.073s |
总结
总体看来,就时间查询来说,效率是表分区>时间拆分>时间戳,时间戳的表结构为int结构,因为之间看过其他文章有对比过效率 int > datetime > timestamp,因此我在这里设计表的时候就采用了int的形式。其实怎么选择都应该按需求而定,当我们的表不大,数量级还没上去,哪种结构查询相差并不是很大,先按时间戳的形式查询,待表结构太大可以进行表的分区,结构若还是很大,可以进行分表。
当然这所有都是我个人(菜鸟)的想法,统计次数也较少,无法完全代表真实情况。希望大家多提出想法
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。