Data Engineering  /  dbt

๐Ÿ”„ dbt โ€” Data Build Tool 23 guides ยท updated 2026

Analytics engineering with SQL โ€” models, tests, sources, and Jinja macros that turn raw warehouse tables into trustworthy, documented data products.

A Practical dbt Workflow for Data Ingestion and Transformation

Every data team eventually hits the same wall: raw data piles up in a warehouse, analysts write one-off queries to clean it, and nobody agrees on which numbers are correct. dbt solves this by turning transformation logic into version-controlled SQL models that anyone on the team can read, test, and improve.

This guide walks through a complete dbt ingestion and transformation workflow โ€” from raw source tables to tested, documented, business-ready datasets.


How dbt Fits Into a Modern Data Stack

dbt does not move data. That job belongs to ingestion tools like Fivetran, Airbyte, or custom loaders. dbt takes over once raw data lands in your warehouse and handles everything that comes after.

Modern ELT Stack with dbt
--------------------------
[Source Systems]
APIs, DBs, Files
|
v
[Ingestion Layer]
Fivetran / Airbyte / custom loaders
|
v
[Raw Layer in Warehouse]
Snowflake / BigQuery / Redshift / DuckDB
|
v
[dbt Transformation Layer]
staging models --> mart models --> metric views
|
v
[Analytics Layer]
BI tools, notebooks, APIs

This separation keeps concerns clean. Ingestion teams own the raw layer; analytics engineers own the transformation layer.


What You Need Before Starting

Install the adapter for your warehouse:

Terminal window
pip install dbt-snowflake # Snowflake
pip install dbt-bigquery # BigQuery
pip install dbt-duckdb # DuckDB (great for local dev and testing)

Step 1: Initialize Your dbt Project

Terminal window
dbt init analytics_project
cd analytics_project

Your project structure will look like this:

analytics_project/
โ”œโ”€โ”€ dbt_project.yml
โ”œโ”€โ”€ models/
โ”‚ โ”œโ”€โ”€ staging/ # one model per raw source table
โ”‚ โ””โ”€โ”€ marts/ # business-domain aggregations
โ”œโ”€โ”€ seeds/ # static CSV reference data
โ”œโ”€โ”€ snapshots/ # SCD Type 2 history tracking
โ”œโ”€โ”€ macros/ # reusable Jinja functions
โ””โ”€โ”€ tests/ # custom singular tests

Configure your warehouse connection in ~/.dbt/profiles.yml. Keep this file out of version control.


Step 2: Define Your Raw Data Sources

Tell dbt where your raw data lives. In models/staging/sources.yml:

version: 2
sources:
- name: ecommerce
schema: raw
description: "Raw tables loaded by Fivetran from the ecommerce platform"
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
tables:
- name: orders
description: "One row per customer order"
loaded_at_field: _fivetran_synced
- name: customers
description: "Customer master records"
loaded_at_field: _fivetran_synced

The freshness block is worth adding early. It lets you run dbt source freshness to detect stale ingestion before it breaks downstream models.


Step 3: Build a Staging Model

Staging models do one thing: take a raw table and clean it up. No business logic, just renaming, casting, filtering obvious garbage, and deduplicating.

models/staging/stg_orders.sql:

with source as (
select * from {{ source('ecommerce', 'orders') }}
),
cleaned as (
select
order_id,
customer_id,
cast(order_date as date) as order_date,
lower(trim(status)) as status,
coalesce(amount_usd, 0) as order_amount_usd,
_fivetran_synced as loaded_at
from source
where order_id is not null
and order_date >= '2020-01-01'
)
select * from cleaned

Run it:

Terminal window
dbt run --select stg_orders

Step 4: Build a Mart Model

Mart models answer real business questions. They reference staging models using ref() so dbt knows the execution order.

models/marts/fct_monthly_revenue.sql:

with orders as (
select * from {{ ref('stg_orders') }}
where status = 'completed'
),
monthly as (
select
date_trunc('month', order_date) as revenue_month,
count(distinct order_id) as order_count,
count(distinct customer_id) as unique_customers,
sum(order_amount_usd) as total_revenue_usd,
avg(order_amount_usd) as avg_order_value_usd
from orders
group by 1
)
select * from monthly
order by revenue_month

The DAG that dbt builds from these refs:

[source: ecommerce.orders]
|
[stg_orders]
|
[fct_monthly_revenue]

Step 5: Add Tests to Catch Bad Data

Tests go in YAML files alongside your models. models/staging/stg_orders.yml:

version: 2
models:
- name: stg_orders
description: "Cleaned and standardized order records"
columns:
- name: order_id
description: "Unique identifier for each order"
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- name: status
tests:
- accepted_values:
values: ['pending', 'completed', 'cancelled', 'refunded']
- name: order_amount_usd
tests:
- not_null

And for the mart model, models/marts/fct_monthly_revenue.yml:

version: 2
models:
- name: fct_monthly_revenue
description: "Monthly revenue aggregated from completed orders"
columns:
- name: revenue_month
tests:
- unique
- not_null
- name: total_revenue_usd
tests:
- not_null

Run tests:

Terminal window
dbt test
dbt test --select stg_orders # test one model only

Step 6: Use dbt Build to Run Everything Together

Instead of running dbt run and dbt test separately, use dbt build which runs models, seeds, snapshots, and tests in the correct order:

Terminal window
dbt build

To run only models that changed since your last production deployment:

Terminal window
dbt build --select state:modified+

This is especially useful in CI pipelines โ€” you only pay to run what actually changed.


Step 7: Deploy and Automate

dbt Cloud is the easiest path for most teams. You set up a job with a schedule (cron), point it at your repo, and dbt handles the rest. Each run produces logs, test results, and updated docs.

For teams that self-host orchestration, dbt integrates cleanly with:

A typical production schedule might look like:

06:00 UTC -- Fivetran syncs raw data from source systems
07:00 UTC -- dbt build runs staging + marts + tests
07:30 UTC -- Failure alerts sent to Slack if tests fail
08:00 UTC -- BI dashboards refresh with updated data

2025-2026 Workflow Improvements Worth Knowing

Incremental models by default โ€” Most production mart models should use incremental materialization to avoid reprocessing full history on every run. The merge strategy is now the recommended default over insert_overwrite for most warehouses.

dbt unit tests โ€” Introduced in dbt Core 1.8, unit tests let you define expected model outputs for mocked inputs. You can now test transformation logic without touching production data.

dbt Mesh for large teams โ€” If your project grows beyond a few hundred models, dbt Mesh lets you split into domain-owned sub-projects that share models through public contracts. Teams own their slice and publish stable interfaces for others to consume.

Catalog integration โ€” dbt now integrates with data catalogs like Atlan, DataHub, and Alation through metadata APIs, so documentation generated by dbt docs generate flows into your broader governance tooling automatically.


Common Mistakes to Avoid

Skipping staging models โ€” Jumping straight to mart models makes it hard to reuse logic and track lineage. Always clean before you aggregate.

No freshness checks โ€” Without source freshness, a broken ingestion job can silently serve stale data to dashboards for hours.

Giant SQL files โ€” If a model exceeds 100 lines, consider breaking it into CTEs or intermediate models. Smaller models are easier to test and debug.

Hardcoding schema names โ€” Use {{ source() }} and {{ ref() }} everywhere. Hardcoded schema names break when you switch between dev and prod environments.


A well-structured dbt workflow turns raw warehouse tables into a reliable, tested data layer that the whole organization can trust. The pattern scales from a single analyst to a team of fifty โ€” the folder structure stays consistent, the testing philosophy stays the same, and every transformation lives in version control where it belongs.