A Practical dbt Workflow for Data Ingestion and Transformation

Every data team eventually hits the same wall: raw data piles up in a warehouse, analysts write one-off queries to clean it, and nobody agrees on which numbers are correct. dbt solves this by turning transformation logic into version-controlled SQL models that anyone on the team can read, test, and improve.

This guide walks through a complete dbt ingestion and transformation workflow — from raw source tables to tested, documented, business-ready datasets.

How dbt Fits Into a Modern Data Stack

dbt does not move data. That job belongs to ingestion tools like Fivetran, Airbyte, or custom loaders. dbt takes over once raw data lands in your warehouse and handles everything that comes after.

Modern ELT Stack with dbt
--------------------------
[Source Systems]
  APIs, DBs, Files
       |
       v
[Ingestion Layer]
  Fivetran / Airbyte / custom loaders
       |
       v
[Raw Layer in Warehouse]
  Snowflake / BigQuery / Redshift / DuckDB
       |
       v
[dbt Transformation Layer]
  staging models --> mart models --> metric views
       |
       v
[Analytics Layer]
  BI tools, notebooks, APIs

This separation keeps concerns clean. Ingestion teams own the raw layer; analytics engineers own the transformation layer.

What You Need Before Starting

Raw data already loaded into a warehouse (Snowflake, BigQuery, Redshift, PostgreSQL, DuckDB)
Python 3.9+ installed
SQL knowledge — dbt is entirely SQL-driven for most use cases
A git repository for your dbt project

Install the adapter for your warehouse:

pip install dbt-snowflake    # Snowflake
pip install dbt-bigquery     # BigQuery
pip install dbt-duckdb       # DuckDB (great for local dev and testing)

Step 1: Initialize Your dbt Project

dbt init analytics_project
cd analytics_project

Your project structure will look like this:

analytics_project/
├── dbt_project.yml
├── models/
│   ├── staging/         # one model per raw source table
│   └── marts/           # business-domain aggregations
├── seeds/               # static CSV reference data
├── snapshots/           # SCD Type 2 history tracking
├── macros/              # reusable Jinja functions
└── tests/               # custom singular tests

Configure your warehouse connection in ~/.dbt/profiles.yml. Keep this file out of version control.

Step 2: Define Your Raw Data Sources

Tell dbt where your raw data lives. In models/staging/sources.yml:

version: 2

sources:
  - name: ecommerce
    schema: raw
    description: "Raw tables loaded by Fivetran from the ecommerce platform"
    freshness:
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    tables:
      - name: orders
        description: "One row per customer order"
        loaded_at_field: _fivetran_synced
      - name: customers
        description: "Customer master records"
        loaded_at_field: _fivetran_synced

The freshness block is worth adding early. It lets you run dbt source freshness to detect stale ingestion before it breaks downstream models.

Step 3: Build a Staging Model

Staging models do one thing: take a raw table and clean it up. No business logic, just renaming, casting, filtering obvious garbage, and deduplicating.

models/staging/stg_orders.sql:

with source as (
    select * from {{ source('ecommerce', 'orders') }}
),

cleaned as (
    select
        order_id,
        customer_id,
        cast(order_date as date)      as order_date,
        lower(trim(status))           as status,
        coalesce(amount_usd, 0)       as order_amount_usd,
        _fivetran_synced              as loaded_at
    from source
    where order_id is not null
      and order_date >= '2020-01-01'
)

select * from cleaned

Run it:

dbt run --select stg_orders

Step 4: Build a Mart Model

Mart models answer real business questions. They reference staging models using ref() so dbt knows the execution order.

models/marts/fct_monthly_revenue.sql:

with orders as (
    select * from {{ ref('stg_orders') }}
    where status = 'completed'
),

monthly as (
    select
        date_trunc('month', order_date)   as revenue_month,
        count(distinct order_id)          as order_count,
        count(distinct customer_id)       as unique_customers,
        sum(order_amount_usd)             as total_revenue_usd,
        avg(order_amount_usd)             as avg_order_value_usd
    from orders
    group by 1
)

select * from monthly
order by revenue_month

The DAG that dbt builds from these refs:

[source: ecommerce.orders]
          |
    [stg_orders]
          |
  [fct_monthly_revenue]

Step 5: Add Tests to Catch Bad Data

Tests go in YAML files alongside your models. models/staging/stg_orders.yml:

version: 2

models:
  - name: stg_orders
    description: "Cleaned and standardized order records"
    columns:
      - name: order_id
        description: "Unique identifier for each order"
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'completed', 'cancelled', 'refunded']
      - name: order_amount_usd
        tests:
          - not_null

And for the mart model, models/marts/fct_monthly_revenue.yml:

version: 2

models:
  - name: fct_monthly_revenue
    description: "Monthly revenue aggregated from completed orders"
    columns:
      - name: revenue_month
        tests:
          - unique
          - not_null
      - name: total_revenue_usd
        tests:
          - not_null

Run tests:

dbt test
dbt test --select stg_orders    # test one model only

Step 6: Use dbt Build to Run Everything Together

Instead of running dbt run and dbt test separately, use dbt build which runs models, seeds, snapshots, and tests in the correct order:

dbt build

To run only models that changed since your last production deployment:

dbt build --select state:modified+

This is especially useful in CI pipelines — you only pay to run what actually changed.

Step 7: Deploy and Automate

dbt Cloud is the easiest path for most teams. You set up a job with a schedule (cron), point it at your repo, and dbt handles the rest. Each run produces logs, test results, and updated docs.

For teams that self-host orchestration, dbt integrates cleanly with:

Airflow — use the DbtRunOperator from the official provider
Prefect — native dbt task library
Dagster — first-class dbt integration with asset-based orchestration

A typical production schedule might look like:

06:00 UTC  -- Fivetran syncs raw data from source systems
07:00 UTC  -- dbt build runs staging + marts + tests
07:30 UTC  -- Failure alerts sent to Slack if tests fail
08:00 UTC  -- BI dashboards refresh with updated data

2025-2026 Workflow Improvements Worth Knowing

Incremental models by default — Most production mart models should use incremental materialization to avoid reprocessing full history on every run. The merge strategy is now the recommended default over insert_overwrite for most warehouses.

dbt unit tests — Introduced in dbt Core 1.8, unit tests let you define expected model outputs for mocked inputs. You can now test transformation logic without touching production data.

dbt Mesh for large teams — If your project grows beyond a few hundred models, dbt Mesh lets you split into domain-owned sub-projects that share models through public contracts. Teams own their slice and publish stable interfaces for others to consume.

Catalog integration — dbt now integrates with data catalogs like Atlan, DataHub, and Alation through metadata APIs, so documentation generated by dbt docs generate flows into your broader governance tooling automatically.

Common Mistakes to Avoid

Skipping staging models — Jumping straight to mart models makes it hard to reuse logic and track lineage. Always clean before you aggregate.

No freshness checks — Without source freshness, a broken ingestion job can silently serve stale data to dashboards for hours.

Giant SQL files — If a model exceeds 100 lines, consider breaking it into CTEs or intermediate models. Smaller models are easier to test and debug.

Hardcoding schema names — Use {{ source() }} and {{ ref() }} everywhere. Hardcoded schema names break when you switch between dev and prod environments.

A well-structured dbt workflow turns raw warehouse tables into a reliable, tested data layer that the whole organization can trust. The pattern scales from a single analyst to a team of fifty — the folder structure stays consistent, the testing philosophy stays the same, and every transformation lives in version control where it belongs.