Data Engineering  /  dbt

πŸ”„ dbt β€” Data Build Tool 23 guides Β· updated 2026

Analytics engineering with SQL β€” models, tests, sources, and Jinja macros that turn raw warehouse tables into trustworthy, documented data products.

dbt: Build and Test Data Models Without the Chaos

Raw data is rarely analysis-ready. Column names clash, duplicates sneak in, data types disagree β€” and that’s before you even get to business logic. dbt (Data Build Tool) was built specifically to solve this problem. It gives analytics engineers a disciplined way to write, test, and document SQL transformations that run inside your data warehouse.

This guide walks through the full picture: what dbt is, why teams adopt it, and how to wire up a working transformation pipeline from scratch.


Why dbt Became the Standard for Data Transformation

Traditional ETL workflows required dedicated infrastructure to transform data before it landed in a warehouse. That worked when storage was expensive and compute was slow. Cloud warehouses flipped that equation. Snowflake, BigQuery, Redshift, and Databricks can process enormous datasets on demand β€” so it made more sense to load raw data first and transform it inside the warehouse.

dbt was designed for exactly this ELT pattern. Instead of orchestrating separate transform jobs, you write plain SQL files, and dbt compiles and runs them against your warehouse. No Java, no Spark clusters, no custom pipeline code.

ELT Pattern with dbt
---------------------
[Source Systems] --> [Load Raw Data] --> [Warehouse] --> [dbt Transforms] --> [Analytics Layer]
APIs Fivetran BigQuery .sql models BI Tools
DBs Airbyte Snowflake tests + docs Dashboards
Files custom Redshift lineage graph

The result is a codebase of SQL models that are version-controlled, peer-reviewed, tested, and documented β€” the same engineering discipline applied to data that software teams use for application code.


What You Need Before You Start

dbt does not extract or load data. It assumes your raw data already lives in a warehouse. Before you build anything, make sure you have:

Install dbt with the adapter that matches your warehouse:

Terminal window
pip install dbt-snowflake # for Snowflake
pip install dbt-bigquery # for BigQuery
pip install dbt-redshift # for Redshift
pip install dbt-postgres # for PostgreSQL
pip install dbt-duckdb # for DuckDB (great for local dev)

Setting Up a dbt Project

Initialize a new project with a single command:

Terminal window
dbt init my_project
cd my_project

This creates a folder structure like this:

my_project/
β”œβ”€β”€ dbt_project.yml # project config
β”œβ”€β”€ models/ # your SQL transformation files
β”‚ β”œβ”€β”€ staging/ # clean raw sources
β”‚ └── marts/ # business-ready tables
β”œβ”€β”€ seeds/ # static CSV reference data
β”œβ”€β”€ snapshots/ # slowly changing dimension tracking
β”œβ”€β”€ tests/ # custom data tests
β”œβ”€β”€ macros/ # reusable Jinja snippets
└── analyses/ # one-off SQL explorations

Open profiles.yml (usually stored at ~/.dbt/profiles.yml) and add your warehouse connection details. This file stays off version control to protect credentials.


Defining Raw Data Sources

Before you transform anything, you tell dbt where your raw data lives. Create a sources.yml file inside the models/ folder:

version: 2
sources:
- name: ecommerce
schema: raw
tables:
- name: orders
description: "Raw order records loaded by Fivetran"
- name: customers
description: "Customer records from the CRM"

Now you can reference these tables in models using {{ source('ecommerce', 'orders') }}. dbt tracks this reference and includes it in the lineage graph automatically.


Writing a Staging Model

Staging models are your first transformation layer. They clean and standardize raw data without applying any business logic. Create models/staging/stg_orders.sql:

with source as (
select * from {{ source('ecommerce', 'orders') }}
),
renamed as (
select
order_id,
customer_id,
cast(order_date as date) as order_date,
status,
amount as order_amount_usd
from source
where order_id is not null
)
select * from renamed

Run just this model to check it:

Terminal window
dbt run --select stg_orders

Building a Mart Model

Mart models sit on top of staging models and answer specific business questions. Create models/marts/monthly_revenue.sql:

with orders as (
select * from {{ ref('stg_orders') }}
where status = 'completed'
),
aggregated as (
select
date_trunc('month', order_date) as month,
count(distinct order_id) as total_orders,
sum(order_amount_usd) as total_revenue_usd
from orders
group by 1
)
select * from aggregated
order by month

The ref() function is how models connect to each other. dbt uses these references to build a DAG (directed acyclic graph) that determines the correct execution order.

DAG Example
-----------
[source: orders] [source: customers]
| |
[stg_orders] [stg_customers]
\ /
[fct_orders_enriched]
|
[monthly_revenue]

Adding Tests

dbt has two types of tests: generic tests (built-in) and singular tests (custom SQL).

Generic tests go in a YAML file alongside your models:

version: 2
models:
- name: stg_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ['pending', 'completed', 'cancelled', 'refunded']
- name: customer_id
tests:
- relationships:
to: ref('stg_customers')
field: customer_id

Run all tests:

Terminal window
dbt test

Or target a specific model:

Terminal window
dbt test --select stg_orders

In 2025, dbt also supports unit tests β€” you define expected outputs for a model given a mocked input, letting you test transformation logic independently of live data.


Running and Deploying

Build everything in dependency order:

Terminal window
dbt build # runs models, seeds, snapshots, and tests together

Run only models that have changed since the last run (using dbt’s state comparison):

Terminal window
dbt build --select state:modified+

For production, most teams deploy through dbt Cloud, which provides a scheduler, a browser-based IDE, CI/CD integration, and a hosted documentation site. Teams that prefer self-hosting wire dbt jobs into Airflow, Prefect, or Dagster.


Documentation and Lineage

Generate and serve docs locally:

Terminal window
dbt docs generate
dbt docs serve

This opens a browser with a searchable catalog of every model, source, column, and test β€” plus an interactive lineage graph. In dbt Cloud, this is hosted automatically and stays current with every deployment.


Key dbt Concepts at a Glance

ConceptWhat It Does
ModelA .sql file that defines one transformation
SourceA reference to a raw table in your warehouse
RefLinks one model to another, builds the DAG
TestValidates data quality (unique, not_null, custom SQL)
SeedA static CSV file loaded into the warehouse
SnapshotTracks row-level changes over time (SCD Type 2)
MacroReusable Jinja code block
MaterializationHow a model is stored: table, view, incremental, ephemeral

What Changed in 2025-2026

dbt has moved quickly over the past year. A few things worth knowing:

dbt Fusion β€” A new Rust-based compiler introduced in 2025 that dramatically speeds up project compilation. Large projects that took minutes now compile in seconds.

Python models β€” dbt now supports .py model files for transformations that are awkward in SQL β€” think ML feature engineering, pandas-style reshaping, or calling external APIs mid-pipeline.

Unit testing β€” Formally added to the dbt spec, letting you write deterministic tests for model logic without touching production data.

MetricFlow integration β€” Semantic layer definitions now live inside dbt projects, letting BI tools query metrics consistently regardless of which tool is used.

dbt Mesh β€” Cross-project references that let large organizations split a monolithic dbt project into domain-owned sub-projects that still share data contracts.


dbt has become the default choice for transformation work in modern data stacks because it applies software engineering practices to SQL β€” version control, testing, documentation, and CI/CD β€” without requiring analytics engineers to learn a new language. If your team is still managing transformation logic in ad hoc scripts or stored procedures, dbt is worth evaluating seriously.