A Practical dbt Workflow for Data Ingestion and Transformation
Every data team eventually hits the same wall: raw data piles up in a warehouse, analysts write one-off queries to clean it, and nobody agrees on which numbers are correct. dbt solves this by turning transformation logic into version-controlled SQL models that anyone on the team can read, test, and improve.
This guide walks through a complete dbt ingestion and transformation workflow โ from raw source tables to tested, documented, business-ready datasets.
How dbt Fits Into a Modern Data Stack
dbt does not move data. That job belongs to ingestion tools like Fivetran, Airbyte, or custom loaders. dbt takes over once raw data lands in your warehouse and handles everything that comes after.
Modern ELT Stack with dbt--------------------------[Source Systems] APIs, DBs, Files | v[Ingestion Layer] Fivetran / Airbyte / custom loaders | v[Raw Layer in Warehouse] Snowflake / BigQuery / Redshift / DuckDB | v[dbt Transformation Layer] staging models --> mart models --> metric views | v[Analytics Layer] BI tools, notebooks, APIsThis separation keeps concerns clean. Ingestion teams own the raw layer; analytics engineers own the transformation layer.
What You Need Before Starting
- Raw data already loaded into a warehouse (Snowflake, BigQuery, Redshift, PostgreSQL, DuckDB)
- Python 3.9+ installed
- SQL knowledge โ dbt is entirely SQL-driven for most use cases
- A git repository for your dbt project
Install the adapter for your warehouse:
pip install dbt-snowflake # Snowflakepip install dbt-bigquery # BigQuerypip install dbt-duckdb # DuckDB (great for local dev and testing)Step 1: Initialize Your dbt Project
dbt init analytics_projectcd analytics_projectYour project structure will look like this:
analytics_project/โโโ dbt_project.ymlโโโ models/โ โโโ staging/ # one model per raw source tableโ โโโ marts/ # business-domain aggregationsโโโ seeds/ # static CSV reference dataโโโ snapshots/ # SCD Type 2 history trackingโโโ macros/ # reusable Jinja functionsโโโ tests/ # custom singular testsConfigure your warehouse connection in ~/.dbt/profiles.yml. Keep this file out of version control.
Step 2: Define Your Raw Data Sources
Tell dbt where your raw data lives. In models/staging/sources.yml:
version: 2
sources: - name: ecommerce schema: raw description: "Raw tables loaded by Fivetran from the ecommerce platform" freshness: warn_after: {count: 12, period: hour} error_after: {count: 24, period: hour} tables: - name: orders description: "One row per customer order" loaded_at_field: _fivetran_synced - name: customers description: "Customer master records" loaded_at_field: _fivetran_syncedThe freshness block is worth adding early. It lets you run dbt source freshness to detect stale ingestion before it breaks downstream models.
Step 3: Build a Staging Model
Staging models do one thing: take a raw table and clean it up. No business logic, just renaming, casting, filtering obvious garbage, and deduplicating.
models/staging/stg_orders.sql:
with source as ( select * from {{ source('ecommerce', 'orders') }}),
cleaned as ( select order_id, customer_id, cast(order_date as date) as order_date, lower(trim(status)) as status, coalesce(amount_usd, 0) as order_amount_usd, _fivetran_synced as loaded_at from source where order_id is not null and order_date >= '2020-01-01')
select * from cleanedRun it:
dbt run --select stg_ordersStep 4: Build a Mart Model
Mart models answer real business questions. They reference staging models using ref() so dbt knows the execution order.
models/marts/fct_monthly_revenue.sql:
with orders as ( select * from {{ ref('stg_orders') }} where status = 'completed'),
monthly as ( select date_trunc('month', order_date) as revenue_month, count(distinct order_id) as order_count, count(distinct customer_id) as unique_customers, sum(order_amount_usd) as total_revenue_usd, avg(order_amount_usd) as avg_order_value_usd from orders group by 1)
select * from monthlyorder by revenue_monthThe DAG that dbt builds from these refs:
[source: ecommerce.orders] | [stg_orders] | [fct_monthly_revenue]Step 5: Add Tests to Catch Bad Data
Tests go in YAML files alongside your models. models/staging/stg_orders.yml:
version: 2
models: - name: stg_orders description: "Cleaned and standardized order records" columns: - name: order_id description: "Unique identifier for each order" tests: - unique - not_null - name: customer_id tests: - not_null - name: status tests: - accepted_values: values: ['pending', 'completed', 'cancelled', 'refunded'] - name: order_amount_usd tests: - not_nullAnd for the mart model, models/marts/fct_monthly_revenue.yml:
version: 2
models: - name: fct_monthly_revenue description: "Monthly revenue aggregated from completed orders" columns: - name: revenue_month tests: - unique - not_null - name: total_revenue_usd tests: - not_nullRun tests:
dbt testdbt test --select stg_orders # test one model onlyStep 6: Use dbt Build to Run Everything Together
Instead of running dbt run and dbt test separately, use dbt build which runs models, seeds, snapshots, and tests in the correct order:
dbt buildTo run only models that changed since your last production deployment:
dbt build --select state:modified+This is especially useful in CI pipelines โ you only pay to run what actually changed.
Step 7: Deploy and Automate
dbt Cloud is the easiest path for most teams. You set up a job with a schedule (cron), point it at your repo, and dbt handles the rest. Each run produces logs, test results, and updated docs.
For teams that self-host orchestration, dbt integrates cleanly with:
- Airflow โ use the
DbtRunOperatorfrom the official provider - Prefect โ native dbt task library
- Dagster โ first-class dbt integration with asset-based orchestration
A typical production schedule might look like:
06:00 UTC -- Fivetran syncs raw data from source systems07:00 UTC -- dbt build runs staging + marts + tests07:30 UTC -- Failure alerts sent to Slack if tests fail08:00 UTC -- BI dashboards refresh with updated data2025-2026 Workflow Improvements Worth Knowing
Incremental models by default โ Most production mart models should use incremental materialization to avoid reprocessing full history on every run. The merge strategy is now the recommended default over insert_overwrite for most warehouses.
dbt unit tests โ Introduced in dbt Core 1.8, unit tests let you define expected model outputs for mocked inputs. You can now test transformation logic without touching production data.
dbt Mesh for large teams โ If your project grows beyond a few hundred models, dbt Mesh lets you split into domain-owned sub-projects that share models through public contracts. Teams own their slice and publish stable interfaces for others to consume.
Catalog integration โ dbt now integrates with data catalogs like Atlan, DataHub, and Alation through metadata APIs, so documentation generated by dbt docs generate flows into your broader governance tooling automatically.
Common Mistakes to Avoid
Skipping staging models โ Jumping straight to mart models makes it hard to reuse logic and track lineage. Always clean before you aggregate.
No freshness checks โ Without source freshness, a broken ingestion job can silently serve stale data to dashboards for hours.
Giant SQL files โ If a model exceeds 100 lines, consider breaking it into CTEs or intermediate models. Smaller models are easier to test and debug.
Hardcoding schema names โ Use {{ source() }} and {{ ref() }} everywhere. Hardcoded schema names break when you switch between dev and prod environments.
A well-structured dbt workflow turns raw warehouse tables into a reliable, tested data layer that the whole organization can trust. The pattern scales from a single analyst to a team of fifty โ the folder structure stays consistent, the testing philosophy stays the same, and every transformation lives in version control where it belongs.