poetry 下运行 dbt(qbit)
前言
- dbt 是 Data Build Tool 的简称,目前由 dbt labs 公司负责开发和维护,公司的前身是Fishtown Analytics。
- dbt 主要处理
ETL
中的T
(transform) dbt 主要分为两部分
dbt Core: cli 命令行工具 dbt Cloud: 云服务
- dbt Core 的 github 地址是:https://github.com/dbt-labs/d...
- qbit 个人觉得 dbt Core 的官方入门教程不友好,一开始就要求谷歌的 BigQuery 服务、GitHub 账号等,其实
dbt Core
可以不联网完全离线使用。 - 本文主要参照:【大数据架构之旅】2 从零起步学 dbt
技术栈
Windows 10
Python 3.8.10
poetry 1.3.1
git 2.35.1.windows.2
PostgreSQL 15.1
poetry 项目配置
.../test_dbt/pyproject.toml
[tool.poetry] name = "test-dbt" version = "0.1.0" description = "" authors = ["qbit <q@bit.cn>"] readme = "README.md" packages = [{include = "test_dbt"}] [[tool.poetry.source]] name = "aliyun" url = "https://mirrors.aliyun.com/pypi/simple/" default = true [tool.poetry.dependencies] python = "^3.8" dbt-core = "~1.3.1" dbt-postgres = "~1.3.1" [build-system] requires = ["poetry-core"] build-backend = "poetry.core.masonry.api"
- 本文中的
poetry 项目文件夹
指.../test_dbt/
- 本文中的
dbt 项目文件夹
指.../test_dbt/dbt_demo/
创建数据库与初始化项目
- 在 PostgreSQL 里面新建数据库
dbt-demo
初始化 dbt 项目,在
.../test_dbt/
目录下执行以下命令poetry run dbt init dbt_demo
按提示执行完后,
.../test_dbt/
的目录结构如下$ tree . ├── dbt_demo │ ├── analyses │ ├── dbt_project.yml │ ├── macros │ ├── models │ │ └── example │ │ ├── my_first_dbt_model.sql │ │ ├── my_second_dbt_model.sql │ │ └── schema.yml │ ├── README.md │ ├── seeds │ ├── snapshots │ └── tests ├── logs │ └── dbt.log ├── poetry.lock └── pyproject.toml
在个人用户目录下找到文件
~/.dbt/profiles.yml
,qbit
的绝对路径是C:\Users\qbit\.dbt\profiles.yml
,将以下内容复制粘贴到文件,中括号里面的内容按自己的实际情况填写dbt_demo: outputs: dev: type: postgres threads: 1 host: [host] port: [port] user: [dev_username] pass: [dev_password] dbname: dbt_demo schema: dev_schema prod: type: postgres threads: 1 host: [host] port: [port] user: [prod_username] pass: [prod_password] dbname: dbt_demo schema: prod_schema target: dev
- 下文命令若无特殊说明,都在
.../test_dbt/dbt_demo/
目录下运行
自带模型
进入
.../test_dbt/dbt_demo/
目录后运行以下命令检查环境信息$ poetry run dbt debug 06:41:09 Running with dbt=1.3.1 dbt version: 1.3.1 python version: 3.8.10 python path: D:\Python3Project\test_dbt\.venv\Scripts\python.exe os info: Windows-10-10.0.19045-SP0 Using profiles.yml file at C:\Users\qbit\.dbt\profiles.yml Using dbt_project.yml file at D:\Python3Project\test_dbt\dbt_demo\dbt_project.yml Configuration: profiles.yml file [OK found and valid] dbt_project.yml file [OK found and valid] Required dependencies: - git [OK found] Connection: host: 192.168.1.52 port: 5432 user: postgres database: dbt_demo schema: dev_schema search_path: None keepalives_idle: 0 sslmode: None Connection test: [OK connection ok] All checks passed!
生成自带模型
$ poetry run dbt run 06:47:04 Running with dbt=1.3.1 06:47:04 Partial parse save file not found. Starting full parse. 06:47:04 Found 2 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics 06:47:04 06:47:05 Concurrency: 1 threads (target='dev') 06:47:05 06:47:05 1 of 2 START sql table model dev_schema.my_first_dbt_model ..................... [RUN] 06:47:05 1 of 2 OK created sql table model dev_schema.my_first_dbt_model ................ [SELECT 2 in 0.15s] 06:47:05 2 of 2 START sql view model dev_schema.my_second_dbt_model ..................... [RUN] 06:47:05 2 of 2 OK created sql view model dev_schema.my_second_dbt_model ................ [CREATE VIEW in 0.09s] 06:47:05 06:47:05 Finished running 1 table model, 1 view model in 0 hours 0 minutes and 0.53 seconds (0.53s). 06:47:05 06:47:05 Completed successfully 06:47:05 06:47:05 Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
- 检查 PostgreSQL 数据库,应该可以看到一张名为
my_first_dbt_model
的表,和一个名为my_second_dbt_model
的视图
外部数据
创建底表
- 从加州大学欧文分校下载共享单车的数据集,将
hour.csv
和day.csv
放到.../test_dbt/dbt_demo/seeds
文件夹下。 在
.../test_dbt/dbt_demo/seeds
目录下创建文件bike_share.yml
,内容如下:version: 2 seeds: - name: hour config: column_types: dteday: date - name: day config: column_types: dteday: date
- 这里指定
date
类型是为了避免dbt
自动推断类型出错 - 关于数据列的类型,可参考官方文档 column_types
执行以下命令导入基础数据
$ poetry run dbt seed 07:19:26 Running with dbt=1.3.1 07:19:26 Found 2 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics 07:19:26 07:19:26 Concurrency: 1 threads (target='dev') 07:19:26 07:19:26 1 of 2 START seed file dev_schema.day .......................................... [RUN] 07:19:31 1 of 2 OK loaded seed file dev_schema.day ...................................... [INSERT 731 in 4.54s] 07:19:31 2 of 2 START seed file dev_schema.hour ......................................... [RUN] 07:21:24 2 of 2 OK loaded seed file dev_schema.hour ..................................... [INSERT 17379 in 112.83s] 07:21:24 07:21:24 Finished running 2 seeds in 0 hours 1 minutes and 57.65 seconds (117.65s). 07:21:24 07:21:24 Completed successfully 07:21:24 07:21:24 Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
- 检查 PostgreSQL 数据库,应该可以看到新建了 2 张表:
hour
和day
自定义模型
- 创建模型目录
.../test_dbt/dbt_demo/models/bike_share
,下面 sql 文件中ref('day')
是引用的上面导入的day
数据库表 在
.../models/bike_share
目录内创建weather_trend.sql
,内容如下/* 天气趋势表 */ {{ config(materialized='table') }} with weather_trend as ( select weathersit, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt from {{ ref('day') }} group by weathersit )
在
.../models/bike_share
目录内创建season_trend.sql
,内容如下/* 季节趋势表 */ {{ config(materialized='table') }} with season_trend as ( select season, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt from {{ ref('day') }} group by season )
在
.../models/bike_share
目录内创建month_trend.sql
,内容如下/* 月度趋势表 */ {{ config(materialized='table') }} with month_trend as ( select mnth, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt from {{ ref('day') }} group by mnth )
在
.../models/bike_share
目录内创建weekday_trend.sql
,内容如下/* 星期几趋势表 */ {{ config(materialized='table') }} with weekday_trend as ( select weekday, min(temp*47+8) as min_temp, max(temp*47+8) as max_temp, avg(temp*47+8) as mean_temp, sum(casual) as total_casual, sum(registered) as total_registered, sum(cnt) as total_cnt from {{ ref('day') }} group by weekday )
执行以下命令生成模型
$ poetry run dbt run 07:43:59 Running with dbt=1.3.1 07:43:59 Found 6 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics 07:43:59 07:43:59 Concurrency: 1 threads (target='dev') 07:43:59 07:43:59 1 of 6 START sql table model dev_schema.month_trend ............................ [RUN] 07:44:00 1 of 6 OK created sql table model dev_schema.month_trend ....................... [SELECT 12 in 0.15s] 07:44:00 2 of 6 START sql table model dev_schema.my_first_dbt_model ..................... [RUN] 07:44:00 2 of 6 OK created sql table model dev_schema.my_first_dbt_model ................ [SELECT 2 in 0.09s] 07:44:00 3 of 6 START sql table model dev_schema.season_trend ........................... [RUN] 07:44:00 3 of 6 OK created sql table model dev_schema.season_trend ...................... [SELECT 4 in 0.07s] 07:44:00 4 of 6 START sql table model dev_schema.weather_trend .......................... [RUN] 07:44:00 4 of 6 OK created sql table model dev_schema.weather_trend ..................... [SELECT 3 in 0.08s] 07:44:00 5 of 6 START sql table model dev_schema.weekday_trend .......................... [RUN] 07:44:00 5 of 6 OK created sql table model dev_schema.weekday_trend ..................... [SELECT 7 in 0.07s] 07:44:00 6 of 6 START sql view model dev_schema.my_second_dbt_model ..................... [RUN] 07:44:00 6 of 6 OK created sql view model dev_schema.my_second_dbt_model ................ [CREATE VIEW in 0.07s] 07:44:00 07:44:00 Finished running 5 table models, 1 view model in 0 hours 0 minutes and 0.78 seconds (0.78s). 07:44:00 07:44:00 Completed successfully 07:44:00 07:44:00 Done. PASS=6 WARN=0 ERROR=0 SKIP=0 TOTAL=6
文档生成和查看
运行以下命令生成文档
$ poetry run dbt docs generate 07:49:09 Running with dbt=1.3.1 07:49:09 Found 6 models, 4 tests, 0 snapshots, 0 analyses, 289 macros, 0 operations, 2 seed files, 0 sources, 0 exposures, 0 metrics 07:49:09 07:49:09 Concurrency: 1 threads (target='dev') 07:49:09 07:49:09 Done. 07:49:10 Building catalog 07:49:10 Catalog written to D:\Python3Project\test_dbt\dbt_demo\target\catalog.json
运行以下命令,会启动 http 服务,并打开默认浏览器查看文档
$ poetry run dbt docs serve 07:49:53 Running with dbt=1.3.1 07:49:53 Serving docs at 0.0.0.0:8080 07:49:53 To access from your browser, navigate to: http://localhost:8080 07:49:53 07:49:53 07:49:53 Press Ctrl+C to exit. 127.0.0.1 - - [28/Dec/2022 15:49:54] "GET / HTTP/1.1" 200 - 127.0.0.1 - - [28/Dec/2022 15:49:54] "GET /manifest.json?cb=1672213794801 HTTP/1.1" 200 - 127.0.0.1 - - [28/Dec/2022 15:49:54] "GET /catalog.json?cb=1672213794801 HTTP/1.1" 200 -
- 网页界面
- 点击右下角按钮查看血缘关系图(Lineage Graph)
目录结构
最后再看一眼目录结构
$ tree ./dbt_demo/ -L 3 ./dbt_demo/ ├── analyses ├── dbt_packages ├── dbt_project.yml ├── logs │ └── dbt.log ├── macros ├── models │ ├── bike_share │ │ ├── month_trend.sql │ │ ├── season_trend.sql │ │ ├── weather_trend.sql │ │ └── weekday_trend.sql │ └── example │ ├── my_first_dbt_model.sql │ ├── my_second_dbt_model.sql │ └── schema.yml ├── README.md ├── seeds │ ├── bike_share.yml │ ├── day.csv │ └── hour.csv ├── snapshots ├── target │ ├── catalog.json │ ├── compiled │ │ └── dbt_demo │ ├── graph.gpickle │ ├── index.html │ ├── manifest.json │ ├── partial_parse.msgpack │ ├── run │ │ └── dbt_demo │ └── run_results.json └── tests
本文出自 qbit snap
qbit snap
开箱即用,拿走不谢。
262 声望
276 粉丝
推荐阅读
Postman 发送 RSA 签名请求(qbit)
前言Postman 版本 10.13.6qbit 将 RSAForPostman 压缩拷贝到了码云便于国内访问Postman 脚本示例Pre-request Script {代码...} Postman 内置库官方文档:[链接] {代码...} 本文出自 qbit snap
qbit阅读 154
又一款眼前一亮的Linux终端工具!
今天给大家介绍一款最近发现的功能十分强大,颜值非常高的一款终端工具。这个神器我是在其他公众号文章上看到的,但他们都没把它的强大之处介绍明白,所以我自己体验一波后,再向大家分享自己的体验。
良许赞 6阅读 1.9k
FastAPI性能碾压Flask?
不止一次的听过,FastAPI性能碾压Flask,直追Golang,不过一直没有测试过,今天闲着没事测试一下看看结果。不知道是哪里出了问题,结果大跌眼镜。
二毛erma0赞 2阅读 10.3k评论 3
Linux终端居然也可以做文件浏览器?
大家好,我是良许。在抖音上做直播已经整整 5 个月了,我很自豪我一路坚持到了现在【笑脸】最近我在做直播的时候,也开始学习鱼皮大佬,直播写代码。当然我不懂 Java 后端,因此就写写自己擅长的 Shell 脚本。但...
良许赞 1阅读 2.1k
Python之如何优雅的重试
为了避免偶尔的网络连接失败,需要加上重试机制,那么最简单的形式就是在对应的代码片段加一个循环,循环体里使用异常捕获,连接成功时退出循环,否则就重复执行相关逻辑,此时修改之后的函数f如下
Harpsichord1207赞 3阅读 7.4k
基于 EKS Fargate 搭建微服务性能分析系统
近期 Amazon Fargate 在中国区正式落地,因 Fargate 使用 Serverless 架构,更加适合对性能要求不敏感的服务使用,Pyroscope 是一款基于 Golang 开发的应用程序性能分析工具,Pyroscope 的服务端为无状态服务且性...
亚马逊云开发者阅读 7.8k
一文带你搞懂如何优化慢SQL
最近通过SGM监控发现有两个SQL的执行时间占该任务总执行时间的90%,通过对该SQL进行分析和优化的过程中,又重新对SQL语句的执行顺序和SQL语句的执行计划进行了系统性的学习,整理的相关学习和总结如下;
京东云开发者赞 1阅读 695
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。