1. Data Lake Analytics(DLA)简介

关于Data Lake的概念,更多阅读可以参考:
https://en.wikipedia.org/wiki...

以及AWS和Azure关于Data Lake的解读:
https://amazonaws-china.com/b...
https://azure.microsoft.com/e...

终于,阿里云现在也有了自己的数据湖分析产品:https://www.aliyun.com/produc...

可以点击申请使用(目前公测阶段还属于邀测模式,我们会尽快审批申请),体验本教程的TPC-H CSV数据格式的数据分析之旅。

产品文档:https://help.aliyun.com/produ...

  1. 开通Data Lake Analytics与OSS服务

如果您已经开通,可以跳过该步骤。如果没有开通,可以参考:https://help.aliyun.com/docum...
进行产品开通服务申请。

  1. 下载TPC-H测试数据集

可以从这下载TPC-H 100MB的数据集:
https://public-datasets-cn-ha...

  1. 上传数据文件到OSS

登录阿里云官网的OSS控制台:https://oss.console.aliyun.co...
规划您要使用的OSS bucket,创建或选择好后,点击“文件管理”,因为有8个数据文件,为每个数据文件创建对应的文件目录:

图片描述

创建好8个目录如下:

图片描述

点击进入目录,上传相应的数据文件,例如,customer目录,则上传customer.tbl文件。

图片描述

上传好后,如下图。然后,依次把其他7个数据文件也上传到对应的目录下。

图片描述

至此,8个数据文件都上传到了您的OSS bucket中:

oss://xxx/tpch_100m/customer/customer.tbl
oss://xxx/tpch_100m/lineitem/lineitem.tbl
oss://xxx/tpch_100m/nation/nation.tbl
oss://xxx/tpch_100m/orders/orders.tbl
oss://xxx/tpch_100m/part/part.tbl
oss://xxx/tpch_100m/partsupp/partsupp.tbl
oss://xxx/tpch_100m/region/region.tbl
oss://xxx/tpch_100m/supplier/supplier.tbl

  1. 登录Data Lake Analytics控制台

https://openanalytics.console...
点击“登录数据库”,输入开通服务时分配的用户名和密码,登录Data Lake Analytics控制台。

  1. 创建Schema和Table

输入创建SCHEMA的语句,点击“同步执行”。

CREATE SCHEMA tpch_100m with DBPROPERTIES(
LOCATION = 'oss://test-bucket-julian-1/tpch_100m/',
catalog='oss'
);
(注意:目前在同一个阿里云region,Data Lake Analytics的schema名全局唯一,建议schema名尽量根据业务定义,已有重名schema,在创建时会提示报错,则请换一个schema名字。)

Schema创建好后,在“数据库”的下拉框中,选择刚刚创建的schema。然后在SQL文本框中输入建表语句,点击同步执行。
建表语句语法参考:https://help.aliyun.com/docum...

image.png | left

TPC-H对应的8个表的建表语句如下,分别贴入文档框中执行(LOCATION子句中的数据文件位置请根据您的实际OSS bucket目录相应修改)。(注意:目前控制台中还不支持多个SQL语句执行,请单条语句执行。)

CREATE EXTERNAL TABLE nation (

N_NATIONKEY INT, 
N_NAME STRING,
   N_ID STRING,
N_REGIONKEY INT, 
N_COMMENT STRING

)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'oss://test-bucket-julian-1/tpch_100m/nation';

CREATE EXTERNAL TABLE lineitem (

L_ORDERKEY INT, 
L_PARTKEY INT, 
L_SUPPKEY INT, 
L_LINENUMBER INT, 
L_QUANTITY DOUBLE, 
L_EXTENDEDPRICE DOUBLE, 
L_DISCOUNT DOUBLE, 
L_TAX DOUBLE, 
L_RETURNFLAG STRING, 
L_LINESTATUS STRING, 
L_SHIPDATE DATE, 
L_COMMITDATE DATE, 
L_RECEIPTDATE DATE, 
L_SHIPINSTRUCT STRING, 
L_SHIPMODE STRING, 
L_COMMENT STRING

)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'oss://test-bucket-julian-1/tpch_100m/lineitem';

CREATE EXTERNAL TABLE orders (

O_ORDERKEY INT, 
O_CUSTKEY INT, 
O_ORDERSTATUS STRING, 
O_TOTALPRICE DOUBLE, 
O_ORDERDATE DATE, 
O_ORDERPRIORITY STRING, 
O_CLERK STRING, 
O_SHIPPRIORITY INT, 
O_COMMENT STRING

)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'oss://test-bucket-julian-1/tpch_100m/orders';

CREATE EXTERNAL TABLE supplier (

S_SUPPKEY INT, 
S_NAME STRING, 
S_ADDRESS STRING, 
S_NATIONKEY INT, 
S_PHONE STRING, 
S_ACCTBAL DOUBLE, 
S_COMMENT STRING

)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'oss://test-bucket-julian-1/tpch_100m/supplier';

CREATE EXTERNAL TABLE partsupp (

PS_PARTKEY INT, 
PS_SUPPKEY INT, 
PS_AVAILQTY INT, 
PS_SUPPLYCOST DOUBLE, 
PS_COMMENT STRING

)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'oss://test-bucket-julian-1/tpch_100m/partsupp';

CREATE EXTERNAL TABLE customer (

C_CUSTKEY INT, 
C_NAME STRING, 
C_ADDRESS STRING, 
C_NATIONKEY INT, 
C_PHONE STRING, 
C_ACCTBAL DOUBLE, 
C_MKTSEGMENT STRING, 
C_COMMENT STRING

)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'oss://test-bucket-julian-1/tpch_100m/customer';

CREATE EXTERNAL TABLE part (

P_PARTKEY INT, 
P_NAME STRING, 
P_MFGR STRING, 
P_BRAND STRING, 
P_TYPE STRING, 
P_SIZE INT, 
P_CONTAINER STRING, 
P_RETAILPRICE DOUBLE, 
P_COMMENT STRING

)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'oss://test-bucket-julian-1/tpch_100m/part';

CREATE EXTERNAL TABLE region (

R_REGIONKEY INT, 
R_NAME STRING, 
R_COMMENT STRING

)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'oss://test-bucket-julian-1/tpch_100m/region';
建表完毕后,刷新页面,在左边导航条中能看到schema下的8张表。

image.png | left

  1. 执行TPC-H查询

TPC-H总共22条查询,如下:
Q1:

SELECT l_returnflag,

     l_linestatus,
     Sum(l_quantity)                                       AS sum_qty,
     Sum(l_extendedprice)                                  AS sum_base_price,
     Sum(l_extendedprice * (1 - l_discount))               AS sum_disc_price,
     Sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
     Avg(l_quantity)                                       AS avg_qty,
     Avg(l_extendedprice)                                  AS avg_price,
     Avg(l_discount)                                       AS avg_disc,
     Count(*)                                              AS count_order

FROM lineitem
WHERE l_shipdate <= date '1998-12-01' - INTERVAL '93' day
GROUP BY l_returnflag,

     l_linestatus

ORDER BY l_returnflag,

     l_linestatus

LIMIT 1;
Q2:

SELECT s_acctbal,

     s_name,
     n_name,
     p_partkey,
     p_mfgr,
     s_address,
     s_phone,
     s_comment

FROM part,

     supplier,
     partsupp,
     nation,
     region

WHERE p_partkey = ps_partkey
AND s_suppkey = ps_suppkey
AND p_size = 35
AND p_type LIKE '%NICKEL'
AND s_nationkey = n_nationkey
AND n_regionkey = r_regionkey
AND r_name = 'MIDDLE EAST'
Q3:

SELECT l_orderkey,

     Sum(l_extendedprice * (1 - l_discount)) AS revenue,
     o_orderdate,
     o_shippriority

FROM customer,

     orders,
     lineitem

WHERE c_mktsegment = 'AUTOMOBILE'
AND c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND o_orderdate < date '1995-03-31'
AND l_shipdate > date '1995-03-31'
GROUP BY l_orderkey,

     o_orderdate,
     o_shippriority

ORDER BY revenue DESC,

     o_orderdate

LIMIT 10;
Q4:

SELECT o_orderpriority,

     Count(*) AS order_count

FROM orders,

     lineitem

WHERE o_orderdate >= date '1997-10-01'
AND o_orderdate < date '1997-10-01' + INTERVAL '3' month
AND l_orderkey = o_orderkey
AND l_commitdate < l_receiptdate
GROUP BY o_orderpriority
ORDER BY o_orderpriority
LIMIT 1;
Q5:

SELECT n_name,

     Sum(l_extendedprice * (1 - l_discount)) AS revenue

FROM customer,

     orders,
     lineitem,
     supplier,
     nation,
     region

WHERE c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND l_suppkey = s_suppkey
AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey
AND n_regionkey = r_regionkey
AND r_name = 'ASIA'
AND o_orderdate >= date '1995-01-01'
AND o_orderdate < date '1995-01-01' + INTERVAL '1' year
GROUP BY n_name
ORDER BY revenue DESC
LIMIT 1;
Q6:

SELECT sum(l_extendedprice * l_discount) AS revenue
FROM lineitem
WHERE l_shipdate >= date '1995-01-01'
AND l_shipdate < date '1995-01-01' + interval '1' year
AND l_discount between 0.04 - 0.01 AND 0.04 + 0.01
AND l_quantity < 24
LIMIT 1;
Q7:

SELECT supp_nation,

     cust_nation,
     l_year,
     Sum(volume) AS revenue

FROM (

            SELECT n1.n_name                          AS supp_nation,
                   n2.n_name                          AS cust_nation,
                   Extract(year FROM l_shipdate)      AS l_year,
                   l_extendedprice * (1 - l_discount) AS volume
            FROM   supplier,
                   lineitem,
                   orders,
                   customer,
                   nation n1,
                   nation n2
            WHERE  s_suppkey = l_suppkey
            AND    o_orderkey = l_orderkey
            AND    c_custkey = o_custkey
            AND    s_nationkey = n1.n_nationkey
            AND    c_nationkey = n2.n_nationkey
            AND    ( (
                                 n1.n_name = 'GERMANY'
                          AND    n2.n_name = 'INDIA')
                   OR     (
                                 n1.n_name = 'INDIA'
                          AND    n2.n_name = 'GERMANY') )
            AND    l_shipdate BETWEEN date '1995-01-01' AND    date '1996-12-31' ) AS shipping

GROUP BY supp_nation,

     cust_nation,
     l_year

ORDER BY supp_nation,

     cust_nation,
     l_year

LIMIT 1;
Q8:

SELECT o_year,

     Sum(
     CASE
              WHEN nation = 'INDIA' THEN volume
              ELSE 0
     end) / Sum(volume) AS mkt_share

FROM (

            SELECT Extract(year FROM o_orderdate)     AS o_year,
                   l_extendedprice * (1 - l_discount) AS volume,
                   n2.n_name                          AS nation
            FROM   part,
                   supplier,
                   lineitem,
                   orders,
                   customer,
                   nation n1,
                   nation n2,
                   region
            WHERE  p_partkey = l_partkey
            AND    s_suppkey = l_suppkey
            AND    l_orderkey = o_orderkey
            AND    o_custkey = c_custkey
            AND    c_nationkey = n1.n_nationkey
            AND    n1.n_regionkey = r_regionkey
            AND    r_name = 'ASIA'
            AND    s_nationkey = n2.n_nationkey
            AND    o_orderdate BETWEEN date '1995-01-01' AND    date '1996-12-31'
            AND    p_type = 'STANDARD ANODIZED STEEL' ) AS all_nations

GROUP BY o_year
ORDER BY o_year
LIMIT 1;
Q9:

SELECT nation,

     o_year,
     Sum(amount) AS sum_profit

FROM (

            SELECT n_name                                                          AS nation,
                   Extract(year FROM o_orderdate) AS o_year,
                   l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity AS amount
            FROM   part,
                   supplier,
                   lineitem,
                   partsupp,
                   orders,
                   nation
            WHERE  s_suppkey = l_suppkey
            AND    ps_suppkey = l_suppkey
            AND    ps_partkey = l_partkey
            AND    p_partkey = l_partkey
            AND    o_orderkey = l_orderkey
            AND    s_nationkey = n_nationkey
            AND    p_name LIKE '%aquamarine%' ) AS profit

GROUP BY nation,

     o_year

ORDER BY nation,

     o_year DESC

LIMIT 1;
Q10:

SELECT c_custkey,

     c_name,
     Sum(l_extendedprice * (1 - l_discount)) AS revenue,
     c_acctbal,
     n_name,
     c_address,
     c_phone,
     c_comment

FROM customer,

     orders,
     lineitem,
     nation

WHERE c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND o_orderdate >= date '1994-08-01'
AND o_orderdate < date '1994-08-01' + INTERVAL '3' month
AND l_returnflag = 'R'
AND c_nationkey = n_nationkey
GROUP BY c_custkey,

     c_name,
     c_acctbal,
     c_phone,
     n_name,
     c_address,
     c_comment

ORDER BY revenue DESC
LIMIT 20;
Q11:

SELECT ps_partkey,

     Sum(ps_supplycost * ps_availqty) AS value

FROM partsupp,

     supplier,
     nation

WHERE ps_suppkey = s_suppkey
AND s_nationkey = n_nationkey
AND n_name = 'PERU'
GROUP BY ps_partkey
HAVING Sum(ps_supplycost * ps_availqty) >
(
SELECT Sum(ps_supplycost ps_availqty) 0.0001000000 as sum_value
FROM partsupp,

   supplier,
   nation

WHERE ps_suppkey = s_suppkey
AND s_nationkey = n_nationkey
AND n_name = 'PERU'
)
ORDER BY value DESC
LIMIT 1;
Q12:

SELECT l_shipmode, sum(case when o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH' then 1

else 0

end) AS high_line_count, sum(case when o_orderpriority <> '1-URGENT' and o_orderpriority <> '2-HIGH' then 1
else 0
end) AS low_line_count
FROM orders,

 lineitem

WHERE o_orderkey = l_orderkey
AND l_shipmode in ('MAIL', 'TRUCK')
AND l_commitdate < l_receiptdate
AND l_shipdate < l_commitdate
AND l_receiptdate >= date '1996-01-01'
AND l_receiptdate < date '1996-01-01' + interval '1' year
GROUP BY l_shipmode
ORDER BY l_shipmode
LIMIT 1;
Q13:

SELECT c_count, count(*) AS custdist
FROM (

SELECT c_custkey, count(o_orderkey) AS c_count
FROM customer,
     orders
WHERE c_custkey = o_custkey
AND o_comment NOT LIKE '%pending%accounts%'
GROUP BY c_custkey ) AS c_orders

GROUP BY c_count
ORDER BY custdist DESC, c_count DESC
LIMIT 1;
Q14:

SELECT 100.00 sum(case when p_type like 'PROMO%' then l_extendedprice (1 - l_discount)

else 0

end) / sum(l_extendedprice * (1 - l_discount)) AS promo_revenue
FROM lineitem,

 part

WHERE l_partkey = p_partkey
AND l_shipdate >= date '1996-01-01'
AND l_shipdate < date '1996-01-01' + interval '1' month
LIMIT 1;
Q15:

WITH revenue0 AS
(
SELECT l_suppkey AS supplier_no, sum(l_extendedprice * (1 - l_discount)) AS total_revenue
FROM lineitem
WHERE l_shipdate >= date '1993-01-01'
AND l_shipdate < date '1993-01-01' + interval '3' month
GROUP BY l_suppkey
)
SELECT s_suppkey, s_name, s_address, s_phone, total_revenue
FROM supplier, revenue0
WHERE s_suppkey = supplier_no
AND total_revenue IN (

SELECT max(total_revenue)
FROM revenue0 )

ORDER BY s_suppkey;
Q16:

SELECT p_brand, p_type, p_size, count(distinct ps_suppkey) AS supplier_cnt
FROM partsupp,

 part

WHERE p_partkey = ps_partkey
AND p_brand <> 'Brand#23'
AND p_type NOT LIKE 'PROMO BURNISHED%'
AND p_size IN (1, 13, 10, 28, 21, 35, 31, 11)
AND ps_suppkey NOT IN (

SELECT s_suppkey
FROM supplier
WHERE s_comment LIKE '%Customer%Complaints%' )

GROUP BY p_brand, p_type, p_size
ORDER BY supplier_cnt DESC, p_brand, p_type, p_size
LIMIT 1;
Q17:

SELECT

sum(l_extendedprice) / 7.0 AS avg_yearly

FROM

lineitem,
part

WHERE p_partkey = l_partkey

AND p_brand = 'Brand#44'
AND p_container = 'WRAP PKG'
AND l_quantity < (
    SELECT
        0.2 * avg(l_quantity)
    FROM
        lineitem, part
    WHERE
        l_partkey = p_partkey
);

Q18:

SELECT c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity)
FROM customer,

 orders,
 lineitem

WHERE o_orderkey IN (

  SELECT l_orderkey
  FROM lineitem
  GROUP BY l_orderkey
  HAVING sum(l_quantity) > 315 )

AND c_custkey = o_custkey
AND o_orderkey = l_orderkey
GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice
ORDER BY o_totalprice DESC, o_orderdate
LIMIT 100;
Q19:

SELECT sum(l_extendedprice* (1 - l_discount)) AS revenue
FROM lineitem,

 part

WHERE ( p_partkey = l_partkey and p_brand = 'Brand#12'

    and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')
    and l_quantity >= 6 and l_quantity <= 6 + 10
    and p_size between 1 and 5
    and l_shipmode in ('AIR', 'AIR REG')
    and l_shipinstruct = 'DELIVER IN PERSON' )
or ( p_partkey = l_partkey and p_brand = 'Brand#13'
    and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')
    and l_quantity >= 10 and l_quantity <= 10 + 10
    and p_size between 1 and 10
    and l_shipmode in ('AIR', 'AIR REG')
    and l_shipinstruct = 'DELIVER IN PERSON' )
or ( p_partkey = l_partkey and p_brand = 'Brand#24'
    and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')
    and l_quantity >= 21 and l_quantity <= 21 + 10
    and p_size between 1 and 15
    and l_shipmode in ('AIR', 'AIR REG')
    and l_shipinstruct = 'DELIVER IN PERSON' )

LIMIT 1;
Q20:

with temp_table as
(
select 0.5 * sum(l_quantity) as col1
from lineitem,

  partsupp

where l_partkey = ps_partkey and l_suppkey = ps_suppkey
and l_shipdate >= date '1993-01-01'
and l_shipdate < date '1993-01-01' + interval '1' year
)
select s_name, s_address
from supplier,

 nation

where s_suppkey in (

select ps_suppkey
from partsupp,
     temp_table
where ps_partkey in (
    select p_partkey
    from part
    where p_name like 'dark%' )
    and ps_availqty > temp_table.col1 )
and s_nationkey = n_nationkey and n_name = 'JORDAN'

order by s_name
limit 1;
Q21:

select

s_name,
count(*) as numwait

from

supplier,
lineitem l1,
orders,
nation

where

s_suppkey = l1.l_suppkey
and o_orderkey = l1.l_orderkey
and o_orderstatus = 'F'
and l1.l_receiptdate > l1.l_commitdate
and exists (
    select
        *
    from
        lineitem l2
    where
        l2.l_orderkey = l1.l_orderkey
        and l2.l_suppkey <> l1.l_suppkey
)
and not exists (
    select
        *
    from
        lineitem l3
    where
        l3.l_orderkey = l1.l_orderkey
        and l3.l_suppkey <> l1.l_suppkey
        and l3.l_receiptdate > l3.l_commitdate
)
and s_nationkey = n_nationkey
and n_name = 'SAUDI ARABIA'

group by

s_name

order by

numwait desc,
s_name

limit 100;
Q22:

with temp_table_1 as
(
select avg(c_acctbal) as avg_value
from customer
where c_acctbal > 0.00 and substring(c_phone from 1 for 2)
in ('33', '29', '37', '35', '25', '27', '43')
),
temp_table_2 as
(
select count(*) as count1
from orders, customer
where o_custkey = c_custkey
)
select cntrycode, count(*) as numcust, sum(c_acctbal) as totacctbal
from (

select substring(c_phone from 1 for 2) as cntrycode, c_acctbal
from customer, temp_table_1, temp_table_2
where substring(c_phone
    from 1
    for 2) in ('33', '29', '37', '35', '25', '27', '43')
    and c_acctbal > temp_table_1.avg_value
    and temp_table_2.count1 = 0) as custsale

group by cntrycode
order by cntrycode
limit 1;

  1. 异步执行查询

Data Lake Analytics支持“同步执行”模式和“异步执行”模式。“同步执行”模式下,控制台界面等待执行结果返回;“异步执行”模式下,立刻返回查询任务的ID。

image.png | left

点击“执行状态”,可以看到该异步查询任务的执行状态,主要分为:“RUNNING”,“SUCCESS”,“FAILURE”。

image.png | left

点击“刷新”,当STATUS变为“SUCCESS”时,表示查询成功,同时可查看查询耗时“ELAPSE_TIME”和查询扫描的数据字节数“SCANNED_DATA_BYTES”。

image.png | left

  1. 查看查询历史

点击“执行历史”,可以看到您执行的查询的历史详细信息,包括:
1)查询语句;
2)查询耗时与执行具体时间;
3)查询结果返回行数;
4)查询状态;
5)查询扫描的字节数;
6)结果集回写到的目标OSS文件(Data Lake Analytics会将查询结果集保存用户的bucket中)。

image.png | left

查询结果文件自动上传到用户同region的OSS bucket中,其中包括结果数据文件和结果集元数据描述文件。

{QueryLocation}/{query_name}|Unsaved}/{yyyy}/{mm}/{dd}/{query_id}/xxx.csv
{QueryLocation}/{query_name}|Unsaved}/{yyyy}/{mm}/{dd}/{query_id}/xxx.csv.metadata
其中QueryLocation为:

aliyun-oa-query-results-<your_account_id>-<oss_region>
image.png | left

  1. 后续

至此,本教程一步一步教您如何利用Data Lake Analytics云产品分析您OSS上的CSV格式的数据文件。除了CSV文件外,Data Lake Analytics还支持Parquet、ORC、json、RCFile、AVRO等多种格式文件的数据分析能力。特别是Parquet、ORC,相比CSV文件,有极大的性能和成本优势(同样内容的数据集,拥有更小的存储空间、更快的查询性能,这也意味着更低的分析成本)。
后续陆续会有更多教程和文章,手把手教您轻松使用Data Lake Analytics进行数据湖上数据分析和探索,开启您的云上低成本、即存即用的数据分析和探索之旅。


暖忆
375 声望39 粉丝