目标
这节课展示了如何在Apache Drill中做一些标准的SQL分析:例如,总结数据通过使用简单集合函数和连接数据源。注意,Apache Drill提供ANSI SQL支持,不是一个“类似sql”接口(界面)。
查询在这节课中
现在你知道在他们的原始形式数据源是什么样子的,使用 select *查询,尝试运行一些简单但更有意义的查询在每个数据源。这些查询演示Drill如何支持ANSI SQL构造和钻还如何把来自不同数据源的数据结合在一个单一SELECT声明。
在一个单一文件或者表上显示一个合并查询。 使用GROUP BY,WHERE,HAVING,ORDER BY子句。
执行hive之间的连接、MapR-DB和文件系统的数据源。
用表和列别名。
创建一个钻视图。
聚合
设置hive模式:
0: jdbc:drill:> use hive.`default`;
+-------+-------------------------------------------+
| ok | summary |
+-------+-------------------------------------------+
| true | Default schema changed to [hive.default] |
+-------+-------------------------------------------+
1 row selected
返回月销量总额:
0: jdbc:drill:> select `month`, sum(order_total) from orders group by `month` order by 2 desc;
+------------+---------+
| month | EXPR$1 |
+------------+---------+
| June | 950481 |
| May | 947796 |
| March | 836809 |
| April | 807291 |
| July | 757395 |
| October | 676236 |
| August | 572269 |
| February | 532901 |
| September | 373100 |
| January | 346536 |
+------------+---------+
10 rows selected
Drill支持SQL聚合函数,比如SUM,MAX,AVG和MIN。标准的SQL句子做为关系数据库以同样的方式在钻查询中工作。
需要注意,back ticks“month”的列查询,只是因为“month” 在SQL中是保留字。
返回月和州销售总额的前20名:
0: jdbc:drill:> select `month`, state, sum(order_total) as sales from orders group by `month`, state order by 3 desc limit 20;
+-----------+--------+---------+
| month | state | sales |
+-----------+--------+---------+
| May | ca | 119586 |
| June | ca | 116322 |
| April | ca | 101363 |
| March | ca | 99540 |
| July | ca | 90285 |
| October | ca | 80090 |
| June | tx | 78363 |
| May | tx | 77247 |
| March | tx | 73815 |
| August | ca | 71255 |
| April | tx | 68385 |
| July | tx | 63858 |
| February | ca | 63527 |
| June | fl | 62199 |
| June | ny | 62052 |
| May | fl | 61651 |
| May | ny | 59369 |
| October | tx | 55076 |
| March | fl | 54867 |
| March | ny | 52101 |
+-----------+--------+---------+
20 rows selected
SUM函数结果要注意别名。钻支持列别名和表别名。
HAVING条款
这个查询使用HAVING条款去束缚一个聚合结果。
设置dfs.clicks的工作区间:
0: jdbc:drill:> use dfs.clicks;
+-------+-----------------------------------------+
| ok | summary |
+-------+-----------------------------------------+
| true | Default schema changed to [dfs.clicks] |
+-------+-----------------------------------------+
1 row selected
返回显示高点击的设备总数:
0: jdbc:drill:> select t.user_info.device, count(*) from `clicks/clicks.json` t group by t.user_info.device having count(*) > 1000;
+---------+---------+
| EXPR$0 | EXPR$1 |
+---------+---------+
| IOS5 | 11814 |
| AOS4.2 | 5986 |
| IOS6 | 4464 |
| IOS7 | 3135 |
| AOS4.4 | 1562 |
| AOS4.3 | 3039 |
+---------+---------+
6 rows selected
聚合是一种在点击流数据中的计数,为每个不同的移动设备记录。只有设备注册更多的超过1000的交易活动获得结果集。
UNION操作符★★★
像上面一样使用相同的工作区(dfs.clicks)。
在营销活动前后结合点击活动
0: jdbc:drill:> select t.trans_id transaction, t.user_info.cust_id customer from `clicks/clicks.campaign.json` t union all select u.trans_id, u.user_info.cust_id from `clicks/clicks.json` u limit 5;
+-------------+------------+
| transaction | customer |
+-------------+------------+
| 35232 | 18520 |
| 31995 | 17182 |
| 35760 | 18228 |
| 37090 | 17015 |
| 37838 | 18737 |
+-------------+------------+
UNION ALL查询返回所有存在的两个文件行中的行(包括任何从这些文件重复的行): clicks.campaign.json 和 clicks.json。
子查询
设置hive工作区:
0: jdbc:drill:> use hive.`default`;
+-------+-------------------------------------------+
| ok | summary |
+-------+-------------------------------------------+
| true | Default schema changed to [hive.default] |
+-------+-------------------------------------------+
1 row selected
比较跨洲的订单总量:
0: jdbc:drill:> select ny_sales.cust_id, ny_sales.total_orders, ca_sales.total_orders
from
(select o.cust_id, sum(o.order_total) as total_orders from hive.orders o where state = 'ny' group by o.cust_id) ny_sales
left outer join
(select o.cust_id, sum(o.order_total) as total_orders from hive.orders o where state = 'ca' group by o.cust_id) ca_sales
on ny_sales.cust_id = ca_sales.cust_id
order by ny_sales.cust_id
limit 20;
+------------+------------+------------+
| cust_id | ny_sales | ca_sales |
+------------+------------+------------+
| 1001 | 72 | 47 |
| 1002 | 108 | 198 |
| 1003 | 83 | null |
| 1004 | 86 | 210 |
| 1005 | 168 | 153 |
| 1006 | 29 | 326 |
| 1008 | 105 | 168 |
| 1009 | 443 | 127 |
| 1010 | 75 | 18 |
| 1012 | 110 | null |
| 1013 | 19 | null |
| 1014 | 106 | 162 |
| 1015 | 220 | 153 |
| 1016 | 85 | 159 |
| 1017 | 82 | 56 |
| 1019 | 37 | 196 |
| 1020 | 193 | 165 |
| 1022 | 124 | null |
| 1023 | 166 | 149 |
| 1024 | 233 | null |
+------------+------------+------------+
这个例子演示了Drill支持子查询。
CAST函数★★
使用maprdb工作区:
0: jdbc:drill:> use maprdb;
+-------+-------------------------------------+
| ok | summary |
+-------+-------------------------------------+
| true | Default schema changed to [maprdb] |
+-------+-------------------------------------+
1 row selected (0.088 seconds)
用适当的数据类型返回客户数据:
0: jdbc:drill:> select cast(row_key as int) as cust_id, cast(t.personal.name as varchar(20)) as name,
cast(t.personal.gender as varchar(10)) as gender, cast(t.personal.age as varchar(10)) as age,
cast(t.address.state as varchar(4)) as state, cast(t.loyalty.agg_rev as dec(7,2)) as agg_rev,
cast(t.loyalty.membership as varchar(20)) as membership
from customers t limit 5;
+----------+----------------------+-----------+-----------+--------+----------+-------------+
| cust_id | name | gender | age | state | agg_rev | membership |
+----------+----------------------+-----------+-----------+--------+----------+-------------+
| 10001 | "Corrine Mecham" | "FEMALE" | "15-20" | "va" | 197.00 | "silver" |
| 10005 | "Brittany Park" | "MALE" | "26-35" | "in" | 230.00 | "silver" |
| 10006 | "Rose Lokey" | "MALE" | "26-35" | "ca" | 250.00 | "silver" |
| 10007 | "James Fowler" | "FEMALE" | "51-100" | "me" | 263.00 | "silver" |
| 10010 | "Guillermo Koehler" | "OTHER" | "51-100" | "mn" | 202.00 | "silver" |
+----------+----------------------+-----------+-----------+--------+----------+-------------+
注意这个查询的以下特点:
CAST函数所需表中每一列。这个函数返回MapR-DB / HBase二进制可读整型和字符串数据。或者,您可以使用CONVERT_TO / CONVERT_FROM函数译码字符串列。在大多数情况下CONVERT_TO / CONVERT_FROM比CAST更有效。只使用CONVERT_TO转换二进制类型到VARCHAR以外的其他任何类型。☆♫
row_key列函数作为表的主键(在这种情况下客户ID)。
表别名t是必需的;否则列族名称将被解析为表名,查询将返回一个错误。★▲
删除字符串引号:
你可以使用regexp_replace函数来删除在查询结果字符串中的引号。例如,返回弗吉尼亚州一个州的名字va而不是“va”:
0: jdbc:drill:> select cast(row_key as int), regexp_replace(cast(t.address.state as varchar(10)),'"','')
from customers t limit 1;
+------------+------------+
| EXPR$0 | EXPR$1 |
+------------+------------+
| 10001 | va |
+------------+------------+
1 row selected
创建视图命令
0: jdbc:drill:> use dfs.views;
+-------+----------------------------------------+
| ok | summary |
+-------+----------------------------------------+
| true | Default schema changed to [dfs.views] |
+-------+----------------------------------------+
1 row selected
使用可变工作区:
一个可变的(或可写)工作空间是一个支持“写” 操作工作区。这个属性是配置存储插件的一部分。 你可以在可变工作区创建钻视图和表。
在MapR-DB表创建视图:
0: jdbc:drill:> create or replace view custview as select cast(row_key as int) as cust_id,
cast(t.personal.name as varchar(20)) as name,
cast(t.personal.gender as varchar(10)) as gender,
cast(t.personal.age as varchar(10)) as age,
cast(t.address.state as varchar(4)) as state,
cast(t.loyalty.agg_rev as dec(7,2)) as agg_rev,
cast(t.loyalty.membership as varchar(20)) as membership
from maprdb.customers t;
+-------+-------------------------------------------------------------+
| ok | summary |
+-------+-------------------------------------------------------------+
| true | View 'custview' created successfully in 'dfs.views' schema |
+-------+-------------------------------------------------------------+
1 row selected
钻提供了类似于关系数据库创建(CREAT)或替换视图(REPLACE VIEW)的语法创建视图。使用或替换选项更容易查看更新视图后没有先删除它。注意,在FROM子句中这个例子必须参考maprdb.customers。MapR-DB表到dfs.views工作区并不直接可见。
不像传统数据库视图位置,代表性的是DBA /开发者驱动操作,在drill中基于文件系统的视图非常无足轻重。一个视图只是一个特殊的文件与一个特定的(.drill)扩展。您可以存储视图在您的本地文件系统或指向一个特定的工作区。你可以指定任何查询而不是钻数据源在创建视图体内声明。
钻提供了分散的元数据模型。钻能够查询定义在数据源的元数据如hive,HBase和文件系统。钻也支持在文件系统中的创建元数据。
查询的视图中数据:
0: jdbc:drill:> select * from custview limit 1;
+----------+-------------------+-----------+----------+--------+----------+-------------+
| cust_id | name | gender | age | state | agg_rev | membership |
+----------+-------------------+-----------+----------+--------+----------+-------------+
| 10001 | "Corrine Mecham" | "FEMALE" | "15-20" | "va" | 197.00 | "silver" |
+----------+-------------------+-----------+----------+--------+----------+-------------+
1 row selected
通过直接从文件系统中探测一旦用户知道哪些数据可用,视图可以用来读取数据到下游工具例如Tableau和MicroStrategy进行分析和可视化。 这些工具,出现一个视图只是作为一个“表”和一个可选择的“列”。
通过数据源查询
继续使用 dfs.views查询。
★★连结客户视图和订单表:
0: jdbc:drill:> select membership, sum(order_total) as sales from hive.orders, custview
where orders.cust_id=custview.cust_id
group by membership order by 2;
+------------+------------+
| membership | sales |
+------------+------------+
| "basic" | 380665 |
| "silver" | 708438 |
| "gold" | 2787682 |
+------------+------------+
3 rows selected
★在这个查询中,我们从MapR-DB表(由custview代表)读取数据,并与订单表信息合并在hive中。 当做像这样的跨数据源查询,你需要使用完全限定 表/视图名称。例如,通过“hive”orders表是有前缀的,那是存储插件通过drill注册的名称。我们不能为“custview”使用任何前缀,因为我们明确地转换了dfs.views custview被存储在哪的工作区。
注意:如果你的任何查询的结果似乎被截断,因为 行宽,设置显示的最大宽度10000:
这组命令不使用分号。
★★★加入客户、订单和点击流数据:
0: jdbc:drill:> select custview.membership, sum(orders.order_total) as sales from hive.orders, custview,
dfs.`/mapr/demo.mapr.com/data/nested/clicks/clicks.json` c
where orders.cust_id=custview.cust_id and orders.cust_id=c.user_info.cust_id
group by custview.membership order by 2;
+------------+------------+
| membership | sales |
+------------+------------+
| "basic" | 372866 |
| "silver" | 728424 |
| "gold" | 7050198 |
+------------+------------+
3 rows selected
★这三方结合选择从三个不同的数据源在一次查询查询:
hive.orders表
custview(HBase客户表的视图)
clicks.json文件
两组的连接列加入条件是cust_id列。视图工作区用于这个查询以至于custview可以访问。hive.orders表也轻松查询。
然而,注意JSON文件不是从视图工作区中直接可见的,所以查询指定了文件的完整路径:
dfs.`/mapr/demo.mapr.com/data/nested/clicks/clicks.json`
Lesson 3: Run Queries on Complex Data Types →
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。