dbt (data build tool) is a popular open source data conversion tool that can realize data conversion through SQL, convert commands into tables or views, and improve the work efficiency of data analysts. The TiDB community recently launched the dbt-tidb plugin, which realizes the compatibility between TiDB and dbt. This article will introduce how to implement simple analysis of data in TiDB through dbt through a simple case.
The main function of dbt is to transform the data in the database or data warehouse. In the process of E (Extract), L (Load), and T (Transform), it is only responsible for the transformation (transform) process. Through the dbt-tidb plug-in, data analysts can directly create forms and match data through SQL during the process of using TiDB, without paying attention to the process of creating tables or views, and can visually see the flow of data; at the same time, they can use dbt Jinja can write SQL, test, package management and other functions, which greatly improves work efficiency.
(Image source: https://blog.getdbt.com/what-exactly-is-dbt/ )
Next, I will take the official dbt tutorial as an example to introduce the combination of TiDB and dbt.
The relevant software used in this example and its version requirements:
TiDB 5.3 or later
dbt 1.0.1 or later
dbt-tidb 1.0.0
Install
In addition to local CLI tools, dbt also supports dbt Cloud (currently, dbt Cloud only supports adapters officially maintained by dbt-lab), among which there are multiple installation methods for local CLI tools. Here we directly use pypi to install the dbt and dbt-tidb plugins.
To install dbt and dbt-tidb, only one command is required, because dbt will be installed as a dependency when installing dbt-tidb.
$ pip install dbt-tidb
dbt can also be installed by yourself, please refer to the official installation tutorial for the installation method.
Create project: jaffle_shop
jaffle_shop is an engineering project provided by dbt-lab to demonstrate the functions of dbt, and you can get it directly from GitHub.
$ git clone https://github.com/dbt-labs/jaffle_shop
$ cd jaffle_shop
Expand all the files in the jaffle_shop project directory here.
dbt_project.yml is the configuration file of the dbt project, which saves the project name, the path information of the database configuration file, and so on.
The SQL model and table constraints of the project are stored in the models directory. Note that this part is written by the data analyst.
The seed directory stores CSV files. Such files can be derived from database export tools. For example, TiDB can export the data in the table to CSV files through Dumpling. In the jaffle_shop project, these CSV files are used as raw data to be processed.
Regarding their more specific content, I will explain in more detail after using one of the above files or directories.
ubuntu@ubuntu:~/jaffle_shop$ tree
.
├── dbt_project.yml
├── etc
│ ├── dbdiagram_definition.txt
│ └── jaffle_shop_erd.png
├── LICENSE
├── models
│ ├── customers.sql
│ ├── docs.md
│ ├── orders.sql
│ ├── overview.md
│ ├── schema.yml
│ └── staging
│ ├── schema.yml
│ ├── stg_customers.sql
│ ├── stg_orders.sql
│ └── stg_payments.sql
├── README.md
└── seeds
├── raw_customers.csv
├── raw_orders.csv
└── raw_payments.csv
configuration item
1. Global configuration
dbt has a default global configuration file: ~/.dbt/profiles.yml, we first create this file in the user directory and configure the connection information of the TiDB database.
$ vi ~/.dbt/profiles.yml
jaffle_shop_tidb: # Project name
target: dev # target
outputs:
dev:
type: tidb # 适配器类型
server: 127.0.0.1 # 地址
port: 4000 # 端口号
schema: analytics # 数据库名称
username: root # 用户名
password: "" # 密码
2. Project configuration
In the jaffle_shop project directory, there is a configuration file for this project named dbt_project.yml. Change the profile configuration item to jaffle_shop_tidb, which is the project name in the profiles.yml file. In this way, the project will query the database connection configuration in the ~/.dbt/profiles.yml file.
$ cat dbt_project.yml
name: 'jaffle_shop'
config-version: 2
version: '0.1'
profile: 'jaffle_shop_tidb' # Note the modification here
model-paths: ["models"] # model path
seed-paths: ["seeds"] # seed path
test-paths: ["tests"]
analysis-paths: ["analysis"]
macro-paths: ["macros"]
target-path: "target"
clean-targets:
- "target"
- "dbt_modules"
- "logs"
require-dbt-version: [">=1.0.0", "<2.0.0"]
models:
jaffle_shop:
materialized: table # models/ 中的 *.sql 物化为表
staging:
materialized: view # models/staging/ 中的 *.sql 物化为视图
3. Verify the configuration
You can use the following commands to check whether the database and project configuration is correct.
$ dbt debug
06:59:18 Running with dbt=1.0.1
dbt version: 1.0.1
python version: 3.8.10
python path: /usr/bin/python3
os info: Linux-5.4.0-97-generic-x86_64-with-glibc2.29
Using profiles.yml file at /home/ubuntu/.dbt/profiles.yml
Using dbt_project.yml file at /home/ubuntu/jaffle_shop/dbt_project.yml
Configuration:
profiles.yml file [OK found and valid]
dbt_project.yml file [OK found and valid]
Configuration:
profiles.yml file [OK found and valid]
dbt_project.yml file [OK found and valid]
Required dependencies:
- git [OK found]
Connection:
server: 127.0.0.1
port: 4000
database: None
schema: analytics
user: root
Connection test: [OK connection ok]
All checks passed!
Load CSV
Load the CSV data and materialize the CSV as a table in the target database. Note: In general, this step is not required for dbt projects, because the data for your pending project is all in the database.
$ dbt seed
07:03:24 Running with dbt=1.0.1
07:03:24 Partial parse save file not found. Starting full parse.
07:03:25 Found 5 models, 20 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
07:03:25
07:03:25 Concurrency: 1 threads (target='dev')
07:03:25
07:03:25 1 of 3 START seed file analytics.raw_customers.........................[RUN ]
07:03:25 1 of 3 OK loaded seed file analytics.raw_customers............................... [INSERT 100 in 0.19 s]
07:03:25 2 of 3 START seed file analytics.raw_orders.......................... . [RUN]
07:03:25 2 of 3 OK loaded seed file analytics.raw_orders....... [INSERT 99 in 0.14s]
07:03:25 3 of 3 START seed file analytics.raw_payments............. [ RUN]
07:03:26 3 of 3 OK loaded seed file analytics.raw_payments............................... [INSERT 113 in 0.24s]
07:03:26
07:03:26 Finished running 3 seeds in 0.71s.
07:03:26
07:03:26 Completed successfully
07:03:26
07:03:26 Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
In the above results, it can be clearly seen that three tasks have been executed, and three tables of analytics.raw_customers, analytics.raw_orders, and analytics.raw_payments are loaded respectively.
Next, go to the TiDB database and see what's going on.
It was found that there is an additional analytics database, which is the engineering database created by dbt for us.
mysql> show databases;
| Database |
| INFORMATION_SCHEMA |
|METRICS_SCHEMA|
| PERFORMANCE_SCHEMA |
| analytics |
| mysql |
| test |
6 rows in set (0.00 sec)
There are three tables in the analytics database, corresponding to the results of the above three tasks.
mysql> show tables;
| Tables_in_analytics |
| raw_customers |
| raw_orders |
| raw_payments |
3 rows in set (0.00 sec)
what is model?
Before proceeding to the next step, it is necessary for us to understand what role does the model in dbt play?
Model is used in dbt to describe the structure of a set of data tables or views, and there are mainly two types of files: SQL and YML. It should also be noted that in the jaffle_shop project, according to the materialized configuration, the table structure is saved in the models/ directory, and the view structure is saved in the models/staging/ directory.
Take models/orders.sql as an example. It is an SQL query statement that supports jinja syntax. In the next command, the orders table will be created based on this SQL.
$ cat models/orders.sql
{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}
with orders as (
select * from {{ ref('stg_orders') }}
),
payments as (
select * from {{ ref('stg_payments') }}
),
order_payments as (
select
order_id,
{% for payment_method in payment_methods -%}
sum(case when payment_method = '{{ payment_method }}' then amount else 0 end) as {{ payment_method }}_amount,
{% endfor -%}
sum(amount) as total_amount
from payments
group by order_id
),
final as (
select
orders.order_id,
orders.customer_id,
orders.order_date,
orders.status,
{% for payment_method in payment_methods -%}
order_payments.{{ payment_method }}_amount,
{% endfor -%}
order_payments.total_amount as amount
from orders
left join order_payments
on orders.order_id = order_payments.order_id
)
select * from final
And, the constraint information matching this SQL is in the models/schema.yml file.
schema.yml is the registry of all models in the current directory. All models are organized into a tree structure that describes the description and properties of each field. The tests entry represents some constraints of this field, which can be detected by the dbt test command. For more information, please refer to the official website documentation.
cat models/schema.yml
version: 2
...
name: orders
description: This table has basic information about orders, as well as some derived facts based on payments
columns:
- name: order_id
tests:
- unique
- not_null
description: This is a unique identifier for an order
- name: customer_id
description: Foreign key to the customers table
tests:
- not_null
- relationships:
to: ref('customers')
field: customer_id
- name: order_date
description: Date (UTC) that the order was placed
- name: status
description: '{{ doc("orders_status") }}'
tests:
- accepted_values:
values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
- name: amount
description: Total amount (AUD) of the order
tests:
- not_null
- name: credit_card_amount
description: Amount of the order (AUD) paid for by credit card
tests:
- not_null
- name: coupon_amount
description: Amount of the order (AUD) paid for by coupon
tests:
- not_null
- name: bank_transfer_amount
description: Amount of the order (AUD) paid for by bank transfer
tests:
- not_null
- name: gift_card_amount
description: Amount of the order (AUD) paid for by gift card
tests:
- not_null
The running result shows that three views (analytics.stg_customers, analytics.stg_orders, analytics.stg_payments) and two tables (analytics.customers, analytics.orders) are successfully created.
$ dbt run
07:28:43 Running with dbt=1.0.1
07:28:43 Unable to do partial parsing because profile has changed
07:28:43 Unable to do partial parsing because a project dependency has been added
07:28:44 Found 5 models, 20 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
07:28:44
07:28:44 Concurrency: 1 threads (target='dev')
07:28:44
07:28:44 1 of 5 START view model analytics.stg_customers.......................... [RUN]
07:28:44 1 of 5 OK created view model analytics.stg_customers............................ [SUCCESS 0 in 0.12s]
07:28:44 2 of 5 START view model analytics.stg_orders.......................... [RUN]
07:28:44 2 of 5 OK created view model analytics.stg_orders............................... [SUCCESS 0 in 0.08s]
07:28:44 3 of 5 START view model analytics.stg_payments.........................[RUN ]
07:28:44 3 of 5 OK created view model analytics.stg_payments............. [SUCCESS 0 in 0.07s ]
07:28:44 4 of 5 START table model analytics.customers.......................... [RUN]
07:28:44 4 of 5 OK created table model analytics.customers............................... [SUCCESS 0 in 0.16s]
07:28:44 5 of 5 START table model analytics.orders.......................... ... [RUN]
07:28:45 5 of 5 OK created table model analytics.orders.......................... [ SUCCESS 0 in 0.12s]
07:28:45
07:28:45 Finished running 3 view models, 2 table models in 0.64s.
07:28:45
07:28:45 Completed successfully
07:28:45
07:28:45 Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5
Go to the TiDB database to verify whether it is really created successfully.
The result shows that there are five more tables or views such as customers, and the data in the tables or views are also converted. Only part of the data of customers is shown here.
mysql> show tables;
| Tables_in_analytics |
| customers |
| orders |
| raw_customers |
| raw_orders |
| raw_payments |
| stg_customers |
| stg_orders |
| stg_payments |
8 rows in set (0.00 sec)
mysql> select * from customers;
| customer_id | first_name | last_name | first_order | most_recent_order | number_of_orders | customer_lifetime_value |
| 1 | Michael | P. | 2018-01-01 | 2018-02-10 | 2 | 33.0000 |
| 2 | Shawn | M. | 2018-01-11 | 2018-01-11 | 1 | 23.0000 |
| 3 | Kathleen | P. | 2018-01-02 | 2018-03-11 | 3 | 65.0000 |
| 4 | Jimmy | C. | NULL | NULL | NULL | NULL |
| 5 | Katherine | R. | NULL | NULL | NULL | NULL |
| 6 | Sarah | R. | 2018-02-19 | 2018-02-19 | 1 | 8.0000 |
| 7 | Martin | M. | 2018-01-14 | 2018-01-14 | 1 | 26.0000 |
| 8 | Frank | R. | 2018-01-29 | 2018-03-12 | 2 | 45.0000 |
....
Generate documentation
dbt also supports the generation of visual documents, the commands are as follows.
1. Generate documentation
$ dbt docs generate
07:33:59 Running with dbt=1.0.1
07:33:59 Found 5 models, 20 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics
07:33:59
07:33:59 Concurrency: 1 threads (target='dev')
07:33:59
07:33:59 Done.
07:33:59 Building catalog
07:33:59 Catalog written to /home/ubuntu/jaffle_shop/target/catalog.json
2. Turn on the service
$ dbt docs serve
07:43:01 Running with dbt=1.0.1
07:43:01 Serving docs at 0.0.0.0:8080
07:43:01 To access from your browser, navigate to: http://localhost:8080
07:43:01
07:43:01
07:43:01 Press Ctrl+C to exit.
The documentation, which can be viewed through a browser, contains the overall structure of the jaffle_shop project and descriptions of all tables and views.
Summarize
The use of TiDB in dbt mainly includes the following steps:
Install dbt and dbt-tidb
Configure the project Write SQL and YML files Run the project Currently, TiDB supports dbt version above 4.0, but according to the dbt-tidb project documentation, there are still some problems in the use of lower versions of TiDB in combination with dbt, for example: temporary tables are not supported and temporary views, WITH syntax is not supported, etc. If you want to use dbt happily, it is recommended to use TiDB 5.3 or later, which supports all the functions of dbt.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。