目标

这节课展示了如何在Apache Drill中做一些标准的SQL分析:例如,总结数据通过使用简单集合函数和连接数据源。注意,Apache Drill提供ANSI SQL支持,不是一个“类似sql”接口(界面)。

查询在这节课中

现在你知道在他们的原始形式数据源是什么样子的,使用 select *查询,尝试运行一些简单但更有意义的查询在每个数据源。这些查询演示Drill如何支持ANSI SQL构造和钻还如何把来自不同数据源的数据结合在一个单一SELECT声明。

  • 在一个单一文件或者表上显示一个合并查询。 使用GROUP BY,WHERE,HAVING,ORDER BY子句。

  • 执行hive之间的连接、MapR-DB和文件系统的数据源。

  • 用表和列别名。

  • 创建一个钻视图。

聚合

设置hive模式:

0: jdbc:drill:> use hive.`default`;
+-------+-------------------------------------------+
|  ok   |                  summary                  |
+-------+-------------------------------------------+
| true  | Default schema changed to [hive.default]  |
+-------+-------------------------------------------+
1 row selected 

返回月销量总额:

0: jdbc:drill:> select `month`, sum(order_total) from orders group by `month` order by 2 desc;
+------------+---------+
|   month    | EXPR$1  |
+------------+---------+
| June       | 950481  |
| May        | 947796  |
| March      | 836809  |
| April      | 807291  |
| July       | 757395  |
| October    | 676236  |
| August     | 572269  |
| February   | 532901  |
| September  | 373100  |
| January    | 346536  |
+------------+---------+
10 rows selected 

Drill支持SQL聚合函数,比如SUM,MAX,AVG和MIN。标准的SQL句子做为关系数据库以同样的方式在钻查询中工作。

需要注意,back ticks“month”的列查询,只是因为“month” 在SQL中是保留字。

返回月和州销售总额的前20名:

0: jdbc:drill:> select `month`, state, sum(order_total) as sales from orders group by `month`, state order by 3 desc limit 20;
+-----------+--------+---------+
|   month   | state  |  sales  |
+-----------+--------+---------+
| May       | ca     | 119586  |
| June      | ca     | 116322  |
| April     | ca     | 101363  |
| March     | ca     | 99540   |
| July      | ca     | 90285   |
| October   | ca     | 80090   |
| June      | tx     | 78363   |
| May       | tx     | 77247   |
| March     | tx     | 73815   |
| August    | ca     | 71255   |
| April     | tx     | 68385   |
| July      | tx     | 63858   |
| February  | ca     | 63527   |
| June      | fl     | 62199   |
| June      | ny     | 62052   |
| May       | fl     | 61651   |
| May       | ny     | 59369   |
| October   | tx     | 55076   |
| March     | fl     | 54867   |
| March     | ny     | 52101   |
+-----------+--------+---------+
20 rows selected 

SUM函数结果要注意别名。钻支持列别名和表别名。

HAVING条款

这个查询使用HAVING条款去束缚一个聚合结果。

设置dfs.clicks的工作区间:

0: jdbc:drill:> use dfs.clicks;
+-------+-----------------------------------------+
|  ok   |                 summary                 |
+-------+-----------------------------------------+
| true  | Default schema changed to [dfs.clicks]  |
+-------+-----------------------------------------+
1 row selected

返回显示高点击的设备总数:

0: jdbc:drill:> select t.user_info.device, count(*) from `clicks/clicks.json` t  group by t.user_info.device having count(*) > 1000;
+---------+---------+
| EXPR$0  | EXPR$1  |
+---------+---------+
| IOS5    | 11814   |
| AOS4.2  | 5986    |
| IOS6    | 4464    |
| IOS7    | 3135    |
| AOS4.4  | 1562    |
| AOS4.3  | 3039    |
+---------+---------+
6 rows selected

聚合是一种在点击流数据中的计数,为每个不同的移动设备记录。只有设备注册更多的超过1000的交易活动获得结果集。

UNION操作符★★★

像上面一样使用相同的工作区(dfs.clicks)。

在营销活动前后结合点击活动

0: jdbc:drill:> select t.trans_id transaction, t.user_info.cust_id customer from `clicks/clicks.campaign.json` t union all select u.trans_id, u.user_info.cust_id  from `clicks/clicks.json` u limit 5;
+-------------+------------+
| transaction |  customer  |
+-------------+------------+
| 35232       | 18520      |
| 31995       | 17182      |
| 35760       | 18228      |
| 37090       | 17015      |
| 37838       | 18737      |
+-------------+------------+  

UNION ALL查询返回所有存在的两个文件行中的行(包括任何从这些文件重复的行): clicks.campaign.json 和 clicks.json。

子查询

设置hive工作区:

0: jdbc:drill:> use hive.`default`;
+-------+-------------------------------------------+
|  ok   |                  summary                  |
+-------+-------------------------------------------+
| true  | Default schema changed to [hive.default]  |
+-------+-------------------------------------------+
1 row selected

比较跨洲的订单总量:

0: jdbc:drill:> select ny_sales.cust_id, ny_sales.total_orders, ca_sales.total_orders
from
(select o.cust_id, sum(o.order_total) as total_orders from hive.orders o where state = 'ny' group by o.cust_id) ny_sales
left outer join
(select o.cust_id, sum(o.order_total) as total_orders from hive.orders o where state = 'ca' group by o.cust_id) ca_sales
on ny_sales.cust_id = ca_sales.cust_id
order by ny_sales.cust_id
limit 20;
+------------+------------+------------+
|  cust_id   |  ny_sales  |  ca_sales  |
+------------+------------+------------+
| 1001       | 72         | 47         |
| 1002       | 108        | 198        |
| 1003       | 83         | null       |
| 1004       | 86         | 210        |
| 1005       | 168        | 153        |
| 1006       | 29         | 326        |
| 1008       | 105        | 168        |
| 1009       | 443        | 127        |
| 1010       | 75         | 18         |
| 1012       | 110        | null       |
| 1013       | 19         | null       |
| 1014       | 106        | 162        |
| 1015       | 220        | 153        |
| 1016       | 85         | 159        |
| 1017       | 82         | 56         |
| 1019       | 37         | 196        |
| 1020       | 193        | 165        |
| 1022       | 124        | null       |
| 1023       | 166        | 149        |
| 1024       | 233        | null       |
+------------+------------+------------+

这个例子演示了Drill支持子查询。

CAST函数★★

使用maprdb工作区:

0: jdbc:drill:> use maprdb;
+-------+-------------------------------------+
|  ok   |               summary               |
+-------+-------------------------------------+
| true  | Default schema changed to [maprdb]  |
+-------+-------------------------------------+
1 row selected (0.088 seconds)

用适当的数据类型返回客户数据:

0: jdbc:drill:> select cast(row_key as int) as cust_id, cast(t.personal.name as varchar(20)) as name, 
cast(t.personal.gender as varchar(10)) as gender, cast(t.personal.age as varchar(10)) as age,
cast(t.address.state as varchar(4)) as state, cast(t.loyalty.agg_rev as dec(7,2)) as agg_rev, 
cast(t.loyalty.membership as varchar(20)) as membership
from customers t limit 5;
+----------+----------------------+-----------+-----------+--------+----------+-------------+
| cust_id  |         name         |  gender   |    age    | state  | agg_rev  | membership  |
+----------+----------------------+-----------+-----------+--------+----------+-------------+
| 10001    | "Corrine Mecham"     | "FEMALE"  | "15-20"   | "va"   | 197.00   | "silver"    |
| 10005    | "Brittany Park"      | "MALE"    | "26-35"   | "in"   | 230.00   | "silver"    |
| 10006    | "Rose Lokey"         | "MALE"    | "26-35"   | "ca"   | 250.00   | "silver"    |
| 10007    | "James Fowler"       | "FEMALE"  | "51-100"  | "me"   | 263.00   | "silver"    |
| 10010    | "Guillermo Koehler"  | "OTHER"   | "51-100"  | "mn"   | 202.00   | "silver"    |
+----------+----------------------+-----------+-----------+--------+----------+-------------+

注意这个查询的以下特点:

  • CAST函数所需表中每一列。这个函数返回MapR-DB / HBase二进制可读整型和字符串数据。或者,您可以使用CONVERT_TO / CONVERT_FROM函数译码字符串列。在大多数情况下CONVERT_TO / CONVERT_FROM比CAST更有效。只使用CONVERT_TO转换二进制类型到VARCHAR以外的其他任何类型。☆♫

  • row_key列函数作为表的主键(在这种情况下客户ID)。

  • 表别名t是必需的;否则列族名称将被解析为表名,查询将返回一个错误。★▲

删除字符串引号:

你可以使用regexp_replace函数来删除在查询结果字符串中的引号。例如,返回弗吉尼亚州一个州的名字va而不是“va”:

0: jdbc:drill:> select cast(row_key as int), regexp_replace(cast(t.address.state as varchar(10)),'"','')
from customers t limit 1;
+------------+------------+
|   EXPR$0   |   EXPR$1   |
+------------+------------+
| 10001      | va         |
+------------+------------+
1 row selected  

创建视图命令

0: jdbc:drill:> use dfs.views;
+-------+----------------------------------------+
|  ok   |                summary                 |
+-------+----------------------------------------+
| true  | Default schema changed to [dfs.views]  |
+-------+----------------------------------------+
1 row selected

使用可变工作区:

一个可变的(或可写)工作空间是一个支持“写” 操作工作区。这个属性是配置存储插件的一部分。 你可以在可变工作区创建钻视图和表。

在MapR-DB表创建视图:

0: jdbc:drill:> create or replace view custview as select cast(row_key as int) as cust_id,
cast(t.personal.name as varchar(20)) as name, 
cast(t.personal.gender as varchar(10)) as gender, 
cast(t.personal.age as varchar(10)) as age, 
cast(t.address.state as varchar(4)) as state,
cast(t.loyalty.agg_rev as dec(7,2)) as agg_rev,
cast(t.loyalty.membership as varchar(20)) as membership
from maprdb.customers t;
+-------+-------------------------------------------------------------+
|  ok   |                           summary                           |
+-------+-------------------------------------------------------------+
| true  | View 'custview' created successfully in 'dfs.views' schema  |
+-------+-------------------------------------------------------------+
1 row selected

钻提供了类似于关系数据库创建(CREAT)或替换视图(REPLACE VIEW)的语法创建视图。使用或替换选项更容易查看更新视图后没有先删除它。注意,在FROM子句中这个例子必须参考maprdb.customers。MapR-DB表到dfs.views工作区并不直接可见。

不像传统数据库视图位置,代表性的是DBA /开发者驱动操作,在drill中基于文件系统的视图非常无足轻重。一个视图只是一个特殊的文件与一个特定的(.drill)扩展。您可以存储视图在您的本地文件系统或指向一个特定的工作区。你可以指定任何查询而不是钻数据源在创建视图体内声明。

钻提供了分散的元数据模型。钻能够查询定义在数据源的元数据如hive,HBase和文件系统。钻也支持在文件系统中的创建元数据。

查询的视图中数据:

0: jdbc:drill:> select * from custview limit 1;
+----------+-------------------+-----------+----------+--------+----------+-------------+
| cust_id  |       name        |  gender   |   age    | state  | agg_rev  | membership  |
+----------+-------------------+-----------+----------+--------+----------+-------------+
| 10001    | "Corrine Mecham"  | "FEMALE"  | "15-20"  | "va"   | 197.00   | "silver"    |
+----------+-------------------+-----------+----------+--------+----------+-------------+
1 row selected

通过直接从文件系统中探测一旦用户知道哪些数据可用,视图可以用来读取数据到下游工具例如Tableau和MicroStrategy进行分析和可视化。 这些工具,出现一个视图只是作为一个“表”和一个可选择的“列”。

通过数据源查询

继续使用 dfs.views查询。

★★连结客户视图和订单表:

0: jdbc:drill:> select membership, sum(order_total) as sales from hive.orders, custview
where orders.cust_id=custview.cust_id
group by membership order by 2;
+------------+------------+
| membership |   sales    |
+------------+------------+
| "basic"    | 380665     |
| "silver"   | 708438     |
| "gold"     | 2787682    |
+------------+------------+
3 rows selected

★在这个查询中,我们从MapR-DB表(由custview代表)读取数据,并与订单表信息合并在hive中。 当做像这样的跨数据源查询,你需要使用完全限定 表/视图名称。例如,通过“hive”orders表是有前缀的,那是存储插件通过drill注册的名称。我们不能为“custview”使用任何前缀,因为我们明确地转换了dfs.views custview被存储在哪的工作区。

注意:如果你的任何查询的结果似乎被截断,因为 行宽,设置显示的最大宽度10000:

这组命令不使用分号。

★★★加入客户、订单和点击流数据:

0: jdbc:drill:> select custview.membership, sum(orders.order_total) as sales from hive.orders, custview,
dfs.`/mapr/demo.mapr.com/data/nested/clicks/clicks.json` c 
where orders.cust_id=custview.cust_id and orders.cust_id=c.user_info.cust_id 
group by custview.membership order by 2;
+------------+------------+
| membership |   sales    |
+------------+------------+
| "basic"    | 372866     |
| "silver"   | 728424     |
| "gold"     | 7050198    |
+------------+------------+
3 rows selected

★这三方结合选择从三个不同的数据源在一次查询查询:

  • hive.orders表

  • custview(HBase客户表的视图)

  • clicks.json文件

两组的连接列加入条件是cust_id列。视图工作区用于这个查询以至于custview可以访问。hive.orders表也轻松查询。

然而,注意JSON文件不是从视图工作区中直接可见的,所以查询指定了文件的完整路径:

dfs.`/mapr/demo.mapr.com/data/nested/clicks/clicks.json`

Lesson 3: Run Queries on Complex Data Types →


眼泪成诗hocc
91 声望13 粉丝

forget-me-not.


引用和评论

0 条评论