dbt (data build tool) is a popular open source data conversion tool that can realize data conversion through SQL, convert commands into tables or views, and improve the work efficiency of data analysts. The TiDB community recently launched the dbt-tidb plugin, which realizes the compatibility between TiDB and dbt. This article will introduce how to implement simple analysis of data in TiDB through dbt through a simple case.

The main function of dbt is to transform the data in the database or data warehouse. In the process of E (Extract), L (Load), and T (Transform), it is only responsible for the transformation (transform) process. Through the dbt-tidb plug-in, data analysts can directly create forms and match data through SQL during the process of using TiDB, without paying attention to the process of creating tables or views, and can visually see the flow of data; at the same time, they can use dbt Jinja can write SQL, test, package management and other functions, which greatly improves work efficiency.

(Image source: https://blog.getdbt.com/what-exactly-is-dbt/ )

Next, I will take the official dbt tutorial as an example to introduce the combination of TiDB and dbt.

The relevant software used in this example and its version requirements:

TiDB 5.3 or later
dbt 1.0.1 or later
dbt-tidb 1.0.0
Install
In addition to local CLI tools, dbt also supports dbt Cloud (currently, dbt Cloud only supports adapters officially maintained by dbt-lab), among which there are multiple installation methods for local CLI tools. Here we directly use pypi to install the dbt and dbt-tidb plugins.

To install dbt and dbt-tidb, only one command is required, because dbt will be installed as a dependency when installing dbt-tidb.

$ pip install dbt-tidb
dbt can also be installed by yourself, please refer to the official installation tutorial for the installation method.

Create project: jaffle_shop
jaffle_shop is an engineering project provided by dbt-lab to demonstrate the functions of dbt, and you can get it directly from GitHub.

$ git clone https://github.com/dbt-labs/jaffle_shop

$ cd jaffle_shop
Expand all the files in the jaffle_shop project directory here.

dbt_project.yml is the configuration file of the dbt project, which saves the project name, the path information of the database configuration file, and so on.
The SQL model and table constraints of the project are stored in the models directory. Note that this part is written by the data analyst.
The seed directory stores CSV files. Such files can be derived from database export tools. For example, TiDB can export the data in the table to CSV files through Dumpling. In the jaffle_shop project, these CSV files are used as raw data to be processed.
Regarding their more specific content, I will explain in more detail after using one of the above files or directories.

ubuntu@ubuntu:~/jaffle_shop$ tree
.

├── dbt_project.yml

├── etc

│ ├── dbdiagram_definition.txt

│ └── jaffle_shop_erd.png

├── LICENSE

├── models

│ ├── customers.sql

│ ├── docs.md

│ ├── orders.sql

│ ├── overview.md

│ ├── schema.yml

│ └── staging

│ ├── schema.yml

│ ├── stg_customers.sql

│ ├── stg_orders.sql

│ └── stg_payments.sql

├── README.md

└── seeds

 ├── raw_customers.csv

├── raw_orders.csv

└── raw_payments.csv

configuration item
1. Global configuration

dbt has a default global configuration file: ~/.dbt/profiles.yml, we first create this file in the user directory and configure the connection information of the TiDB database.

$ vi ~/.dbt/profiles.yml

jaffle_shop_tidb: # Project name

target: dev # target

outputs:

 dev:

  type: tidb                         # 适配器类型

  server: 127.0.0.1                  # 地址

  port: 4000                         # 端口号

  schema: analytics                  # 数据库名称

  username: root                     # 用户名

  password: ""                       # 密码

2. Project configuration

In the jaffle_shop project directory, there is a configuration file for this project named dbt_project.yml. Change the profile configuration item to jaffle_shop_tidb, which is the project name in the profiles.yml file. In this way, the project will query the database connection configuration in the ~/.dbt/profiles.yml file.

$ cat dbt_project.yml

name: 'jaffle_shop'

config-version: 2

version: '0.1'

profile: 'jaffle_shop_tidb' # Note the modification here

model-paths: ["models"] # model path

seed-paths: ["seeds"] # seed path

test-paths: ["tests"]

analysis-paths: ["analysis"]

macro-paths: ["macros"]

target-path: "target"

clean-targets:

 - "target"

- "dbt_modules"

- "logs"

require-dbt-version: [">=1.0.0", "<2.0.0"]

models:

jaffle_shop:

 materialized: table            # models/ 中的 *.sql 物化为表

  staging:           

    materialized: view           # models/staging/ 中的 *.sql 物化为视图

3. Verify the configuration

You can use the following commands to check whether the database and project configuration is correct.

$ dbt debug

06:59:18 Running with dbt=1.0.1

dbt version: 1.0.1

python version: 3.8.10

python path: /usr/bin/python3

os info: Linux-5.4.0-97-generic-x86_64-with-glibc2.29

Using profiles.yml file at /home/ubuntu/.dbt/profiles.yml

Using dbt_project.yml file at /home/ubuntu/jaffle_shop/dbt_project.yml

Configuration:

profiles.yml file [OK found and valid]

dbt_project.yml file [OK found and valid]

Configuration:

profiles.yml file [OK found and valid]

dbt_project.yml file [OK found and valid]

Required dependencies:

  • git [OK found]

Connection:

server: 127.0.0.1

port: 4000

database: None

schema: analytics

user: root

Connection test: [OK connection ok]

All checks passed!
Load CSV
Load the CSV data and materialize the CSV as a table in the target database. Note: In general, this step is not required for dbt projects, because the data for your pending project is all in the database.

$ dbt seed

07:03:24 Running with dbt=1.0.1

07:03:24 Partial parse save file not found. Starting full parse.

07:03:25 Found 5 models, 20 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics

07:03:25

07:03:25 Concurrency: 1 threads (target='dev')

07:03:25

07:03:25 1 of 3 START seed file analytics.raw_customers.........................[RUN ]

07:03:25 1 of 3 OK loaded seed file analytics.raw_customers............................... [INSERT 100 in 0.19 s]

07:03:25 2 of 3 START seed file analytics.raw_orders.......................... . [RUN]

07:03:25 2 of 3 OK loaded seed file analytics.raw_orders....... [INSERT 99 in 0.14s]

07:03:25 3 of 3 START seed file analytics.raw_payments............. [ RUN]

07:03:26 3 of 3 OK loaded seed file analytics.raw_payments............................... [INSERT 113 in 0.24s]

07:03:26

07:03:26 Finished running 3 seeds in 0.71s.

07:03:26

07:03:26 Completed successfully

07:03:26

07:03:26 Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
In the above results, it can be clearly seen that three tasks have been executed, and three tables of analytics.raw_customers, analytics.raw_orders, and analytics.raw_payments are loaded respectively.

Next, go to the TiDB database and see what's going on.

It was found that there is an additional analytics database, which is the engineering database created by dbt for us.

mysql> show databases;

| Database |

| INFORMATION_SCHEMA |

|METRICS_SCHEMA|

| PERFORMANCE_SCHEMA |

| analytics |

| mysql |

| test |

6 rows in set (0.00 sec)
There are three tables in the analytics database, corresponding to the results of the above three tasks.

mysql> show tables;

| Tables_in_analytics |

| raw_customers |

| raw_orders |

| raw_payments |

3 rows in set (0.00 sec)
what is model?
Before proceeding to the next step, it is necessary for us to understand what role does the model in dbt play?

Model is used in dbt to describe the structure of a set of data tables or views, and there are mainly two types of files: SQL and YML. It should also be noted that in the jaffle_shop project, according to the materialized configuration, the table structure is saved in the models/ directory, and the view structure is saved in the models/staging/ directory.

Take models/orders.sql as an example. It is an SQL query statement that supports jinja syntax. In the next command, the orders table will be created based on this SQL.

$ cat models/orders.sql

{% set payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %}

with orders as (

 select * from {{ ref('stg_orders') }}

),

payments as (

 select * from {{ ref('stg_payments') }}

),

order_payments as (

 select

    order_id,



    {% for payment_method in payment_methods -%}

    sum(case when payment_method = '{{ payment_method }}' then amount else 0 end) as {{ payment_method }}_amount,

    {% endfor -%}



    sum(amount) as total_amount



from payments



group by order_id

),

final as (

 select

    orders.order_id,

    orders.customer_id,

    orders.order_date,

    orders.status,



    {% for payment_method in payment_methods -%}



    order_payments.{{ payment_method }}_amount,



    {% endfor -%}



    order_payments.total_amount as amount



from orders





left join order_payments

    on orders.order_id = order_payments.order_id

)

select * from final
And, the constraint information matching this SQL is in the models/schema.yml file.

schema.yml is the registry of all models in the current directory. All models are organized into a tree structure that describes the description and properties of each field. The tests entry represents some constraints of this field, which can be detected by the dbt test command. For more information, please refer to the official website documentation.

cat models/schema.yml

version: 2

...

  • name: orders

    description: This table has basic information about orders, as well as some derived facts based on payments

 columns:

  - name: order_id

    tests:

      - unique

      - not_null

    description: This is a unique identifier for an order



  - name: customer_id

    description: Foreign key to the customers table

    tests:

      - not_null

      - relationships:

          to: ref('customers')

          field: customer_id



  - name: order_date

    description: Date (UTC) that the order was placed



  - name: status

    description: '{{ doc("orders_status") }}'

    tests:

      - accepted_values:

          values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']



  - name: amount

    description: Total amount (AUD) of the order

    tests:

      - not_null



  - name: credit_card_amount

    description: Amount of the order (AUD) paid for by credit card

    tests:

      - not_null



  - name: coupon_amount

    description: Amount of the order (AUD) paid for by coupon

    tests:

      - not_null



  - name: bank_transfer_amount

    description: Amount of the order (AUD) paid for by bank transfer

    tests:

      - not_null



  - name: gift_card_amount

    description: Amount of the order (AUD) paid for by gift card

    tests:

      - not_null

The running result shows that three views (analytics.stg_customers, analytics.stg_orders, analytics.stg_payments) and two tables (analytics.customers, analytics.orders) are successfully created.

$ dbt run

07:28:43 Running with dbt=1.0.1

07:28:43 Unable to do partial parsing because profile has changed

07:28:43 Unable to do partial parsing because a project dependency has been added

07:28:44 Found 5 models, 20 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics

07:28:44

07:28:44 Concurrency: 1 threads (target='dev')

07:28:44

07:28:44 1 of 5 START view model analytics.stg_customers.......................... [RUN]

07:28:44 1 of 5 OK created view model analytics.stg_customers............................ [SUCCESS 0 in 0.12s]

07:28:44 2 of 5 START view model analytics.stg_orders.......................... [RUN]

07:28:44 2 of 5 OK created view model analytics.stg_orders............................... [SUCCESS 0 in 0.08s]

07:28:44 3 of 5 START view model analytics.stg_payments.........................[RUN ]

07:28:44 3 of 5 OK created view model analytics.stg_payments............. [SUCCESS 0 in 0.07s ]

07:28:44 4 of 5 START table model analytics.customers.......................... [RUN]

07:28:44 4 of 5 OK created table model analytics.customers............................... [SUCCESS 0 in 0.16s]

07:28:44 5 of 5 START table model analytics.orders.......................... ... [RUN]

07:28:45 5 of 5 OK created table model analytics.orders.......................... [ SUCCESS 0 in 0.12s]

07:28:45

07:28:45 Finished running 3 view models, 2 table models in 0.64s.

07:28:45

07:28:45 Completed successfully

07:28:45

07:28:45 Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5
Go to the TiDB database to verify whether it is really created successfully.

The result shows that there are five more tables or views such as customers, and the data in the tables or views are also converted. Only part of the data of customers is shown here.

mysql> show tables;

| Tables_in_analytics |

| customers |

| orders |

| raw_customers |

| raw_orders |

| raw_payments |

| stg_customers |

| stg_orders |

| stg_payments |

8 rows in set (0.00 sec)

mysql> select * from customers;

| customer_id | first_name | last_name | first_order | most_recent_order | number_of_orders | customer_lifetime_value |

| 1 | Michael | P. | 2018-01-01 | 2018-02-10 | 2 | 33.0000 |

| 2 | Shawn | M. | 2018-01-11 | 2018-01-11 | 1 | 23.0000 |

| 3 | Kathleen | P. | 2018-01-02 | 2018-03-11 | 3 | 65.0000 |

| 4 | Jimmy | C. | NULL | NULL | NULL | NULL |

| 5 | Katherine | R. | NULL | NULL | NULL | NULL |

| 6 | Sarah | R. | 2018-02-19 | 2018-02-19 | 1 | 8.0000 |

| 7 | Martin | M. | 2018-01-14 | 2018-01-14 | 1 | 26.0000 |

| 8 | Frank | R. | 2018-01-29 | 2018-03-12 | 2 | 45.0000 |

....
Generate documentation
dbt also supports the generation of visual documents, the commands are as follows.

1. Generate documentation

$ dbt docs generate

07:33:59 Running with dbt=1.0.1

07:33:59 Found 5 models, 20 tests, 0 snapshots, 0 analyses, 172 macros, 0 operations, 3 seed files, 0 sources, 0 exposures, 0 metrics

07:33:59

07:33:59 Concurrency: 1 threads (target='dev')

07:33:59

07:33:59 Done.

07:33:59 Building catalog

07:33:59 Catalog written to /home/ubuntu/jaffle_shop/target/catalog.json
2. Turn on the service

$ dbt docs serve

07:43:01 Running with dbt=1.0.1

07:43:01 Serving docs at 0.0.0.0:8080

07:43:01 To access from your browser, navigate to: http://localhost:8080

07:43:01

07:43:01

07:43:01 Press Ctrl+C to exit.
The documentation, which can be viewed through a browser, contains the overall structure of the jaffle_shop project and descriptions of all tables and views.


Summarize
The use of TiDB in dbt mainly includes the following steps:

Install dbt and dbt-tidb
Configure the project Write SQL and YML files Run the project Currently, TiDB supports dbt version above 4.0, but according to the dbt-tidb project documentation, there are still some problems in the use of lower versions of TiDB in combination with dbt, for example: temporary tables are not supported and temporary views, WITH syntax is not supported, etc. If you want to use dbt happily, it is recommended to use TiDB 5.3 or later, which supports all the functions of dbt.


PingCAP
1.9k 声望4.9k 粉丝

PingCAP 是国内开源的新型分布式数据库公司,秉承开源是基础软件的未来这一理念,PingCAP 持续扩大社区影响力,致力于前沿技术领域的创新实现。其研发的分布式关系型数据库 TiDB 项目,具备「分布式强一致性事务...