今天的数据是动态的和被驱动的应用。一个商业应用驱动的的新时代的发展由动态产业如网络、社会、移动和物联网等与新数据类型和新的数据模型形成数据集。这些应用程序是迭代的,相关的数据模型通常是半结构化的,非模式化和不断进化的。半结构化数据模型可以被复杂/嵌套,非模式化,能够在每一行和不同领域持续发展达到频繁添加和删除以满足业务需求。
本教程向您展示了如何查询本地动态数据集,如JSON和在几分钟内查询来自任何类型的数据派生的洞察力。被使用的数据集的例子是Yelp签到数据集,它具有以下结构:
check-in
{
'type': 'checkin',
'business_id': (encrypted business id),
'checkin_info': {
'0-0': (number of checkins from 00:00 to 01:00 on all Sundays),
'1-0': (number of checkins from 01:00 to 02:00 on all Sundays),
...
'14-4': (number of checkins from 14:00 to 15:00 on all Thursdays),
...
'23-6': (number of checkins from 23:00 to 00:00 on all Saturdays)
}, # if there was no checkin for a hour-day block it will not be in the dataset
}
值得重复的评论这个代码片段的底部:
# if there was no checkin for a hour-day block it will not be in the dataset.
你在checkin_info看到的元素名称是未知的前期并且每一行可能会有所不同。数据,虽然简单,但是高度动态的数据。分析数据不需要先在相关平面结构表示这个数据集,因为您将在Hadoop使用任何其他SQL技术。
第1步:首先下载钻,如果你还没有这样做,到您的机器中
http://drill.apache.org/download/
tar -xvf apache-drill-0.9.0.tar
在本地桌面安装钻(嵌入模式)。你不需要Hadoop。
♣第二步:开始钻壳。
bin/drill-embedded
第三步:开始使用SQL分析数据
首先,我们先看一下数据集:
0: jdbc:drill:zk=local> SELECT * FROM dfs.`/home/zj/software/drill/yelp/yelp_academic_dataset_checkin.json` limit 2;
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+------------------------+
| checkin_info | type | business_id |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+------------------------+
| {"3-4":1,"13-5":1,"6-6":1,"14-5":1,"14-6":1,"14-2":1,"14-3":1,"19-0":1,"11-5":1,"13-2":1,"11-6":2,"11-3":1,"12-6":1,"6-5":1,"5-5":1,"9-2":1,"9-5":1,"9-6":1,"5-2":1,"7-6":1,"7-5":1,"7-4":1,"17-5":1,"8-5":1,"10-2":1,"10-5":1,"10-6":1} | checkin | JwUE5GmEO-sH1FuwJgKBlQ |
| {"6-6":2,"6-5":1,"7-6":1,"7-5":1,"8-5":2,"10-5":1,"9-3":1,"12-5":1,"15-3":1,"15-5":1,"15-6":1,"16-3":1,"10-0":1,"15-4":1,"10-4":1,"8-2":1} | checkin | uGykseHzyS5xAMWoN6YUqA |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+------------------------+
请注意:
本文档将钻例如输出的目的。在这种情况下钻输出不对齐。
你直接查询JSON文件中的数据。定义模式在hive仓库并不是必须的。内部元素的名称 checkin_info列第一行和第二行之间是不同的。
▲▲▲▲▲钻提供了一个名为KVGEN的函数(键值生成器)在处理复杂的数据时是有用的,其中包含任意地组成的动态图和未知的元素名称例如checkin_info。KVGEN转动态映射到一个数组的键值对,其中键代表动态元素名称。
让我们在checkin_info元素应用KVGEN生成键-值对。
0: jdbc:drill:zk=local> SELECT KVGEN(checkin_info) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` LIMIT 2;
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| checkins |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [{"key":"3-4","value":1},{"key":"13-5","value":1},{"key":"6-6","value":1},{"key":"14-5","value":1},{"key":"14-6","value":1},{"key":"14-2","value":1},{"key":"14-3","value":1},{"key":"19-0","value":1},{"key":"11-5","value":1},{"key":"13-2","value":1},{"key":"11-6","value":2},{"key":"11-3","value":1},{"key":"12-6","value":1},{"key":"6-5","value":1},{"key":"5-5","value":1},{"key":"9-2","value":1},{"key":"9-5","value":1},{"key":"9-6","value":1},{"key":"5-2","value":1},{"key":"7-6","value":1},{"key":"7-5","value":1},{"key":"7-4","value":1},{"key":"17-5","value":1},{"key":"8-5","value":1},{"key":"10-2","value":1},{"key":"10-5","value":1},{"key":"10-6","value":1}] |
| [{"key":"6-6","value":2},{"key":"6-5","value":1},{"key":"7-6","value":1},{"key":"7-5","value":1},{"key":"8-5","value":2},{"key":"10-5","value":1},{"key":"9-3","value":1},{"key":"12-5","value":1},{"key":"15-3","value":1},{"key":"15-5","value":1},{"key":"15-6","value":1},{"key":"16-3","value":1},{"key":"10-0","value":1},{"key":"15-4","value":1},{"key":"10-4","value":1},{"key":"8-2","value":1}] |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
钻提供了另一个函数来操作复杂的数据称为“Flatten”,打破“KVGen”造成的键值对列表到单独的行进一步应用分析功能。
0: jdbc:drill:zk=local> SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` LIMIT 20;
+--------------------------+
| checkins |
+--------------------------+
| {"key":"3-4","value":1} |
| {"key":"13-5","value":1} |
| {"key":"6-6","value":1} |
| {"key":"14-5","value":1} |
| {"key":"14-6","value":1} |
| {"key":"14-2","value":1} |
| {"key":"14-3","value":1} |
| {"key":"19-0","value":1} |
| {"key":"11-5","value":1} |
| {"key":"13-2","value":1} |
| {"key":"11-6","value":2} |
| {"key":"11-3","value":1} |
| {"key":"12-6","value":1} |
| {"key":"6-5","value":1} |
| {"key":"5-5","value":1} |
| {"key":"9-2","value":1} |
| {"key":"9-5","value":1} |
| {"key":"9-6","value":1} |
| {"key":"5-2","value":1} |
| {"key":"7-6","value":1} |
+--------------------------+
可以从数据中快速的获得值,通过在数据集应用KVGEN和FLATTEN函数动态数据集在fly——no需要耗时的模式定义和数据存储中间格式。
夷为平地的数据输出,您可以使用标准SQL功能,如过滤器、聚合、排序。 让我们看看几个例子。
▲▲▲▲▲在Yelp数据集获取签到记录的总数
0: jdbc:drill:zk=local> SELECT SUM(checkintbl.checkins.`value`) AS TotalCheckins FROM (
. . . . . . . . . . . > SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` ) checkintbl
. . . . . . . . . . . > ;
+---------------+
| TotalCheckins |
+---------------+
| 4713811 |
+---------------+
▲▲▲▲▲获取专门为周日午夜的签到的数量
0: jdbc:drill:zk=local> SELECT SUM(checkintbl.checkins.`value`) AS SundayMidnightCheckins FROM (
. . . . . . . . . . . > SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` ) checkintbl WHERE checkintbl.checkins.key='23-0';
+------------------------+
| SundayMidnightCheckins |
+------------------------+
| 8575 |
+------------------------+
▲▲▲▲▲获取每星期几签到的数量
0: jdbc:drill:zk=local> SELECT `right`(checkintbl.checkins.key,1) WeekDay,sum(checkintbl.checkins.`value`) TotalCheckins from (
. . . . . . . . . . . > select flatten(kvgen(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` ) checkintbl GROUP BY `right`(checkintbl.checkins.key,1) ORDER BY TotalCheckins;
+------------+---------------+
| WeekDay | TotalCheckins |
+------------+---------------+
| 1 | 545626 |
| 0 | 555038 |
| 2 | 555747 |
| 3 | 596296 |
| 6 | 735830 |
| 4 | 788073 |
| 5 | 937201 |
+------------+---------------+
▲▲▲▲▲获取一天中每小时的签到数量
0: jdbc:drill:zk=local> SELECT SUBSTR(checkintbl.checkins.key,1,strpos(checkintbl.checkins.key,'-')-1) AS HourOfTheDay ,SUM(checkintbl.checkins.`value`) TotalCheckins FROM (
. . . . . . . . . . . > SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.`/users/nrentachintala/Downloads/yelp/yelp_academic_dataset_checkin.json` ) checkintbl GROUP BY SUBSTR(checkintbl.checkins.key,1,strpos(checkintbl.checkins.key,'-')-1) ORDER BY TotalCheckins;
+--------------+---------------+
| HourOfTheDay | TotalCheckins |
+--------------+---------------+
| 3 | 20357 |
| 4 | 21076 |
| 2 | 28116 |
| 5 | 33842 |
| 1 | 45467 |
| 6 | 54174 |
| 0 | 74127 |
| 7 | 96329 |
| 23 | 102009 |
| 8 | 130091 |
| 22 | 140338 |
| 9 | 162913 |
| 21 | 211949 |
| 10 | 220687 |
| 15 | 261384 |
| 14 | 276188 |
| 16 | 292547 |
| 20 | 293783 |
| 13 | 328373 |
| 11 | 338675 |
| 17 | 374186 |
| 19 | 385381 |
| 12 | 399797 |
| 18 | 422022 |
+--------------+---------------+
总结
在本教程中,您将冲浪结构化和半结构化数据没有任何前期或ETL模式管理。
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。