source and ref Functions in dbt

In data transformation landscape, dbt (Data Build Tool) has emerged as a powerful tool that simplifies how organizations manage their data models and analytics workflows. Two fundamental functions in dbt, source and ref, play a crucial role in structuring and maintaining scalable, efficient, and well-documented data pipelines.

source Function – Helps reference raw data tables from an external source.
ref Function – Enables referencing models within a dbt project to ensure modularity and dependency management.

By the end of this guide, you will understand how source and ref improve dbt projects, where to use them, and how to implement them in real-world scenarios.


1. Understanding source and ref Functions in dbt

1.1 What is the source Function in dbt?

The source function in dbt is used to reference raw tables that exist in an external database (e.g., Snowflake, BigQuery, Redshift). It helps document data sources and ensures transparency by defining where raw data originates.

Example Usage:

SELECT * FROM {{ source('ecommerce', 'orders') }}

✔ Refers to the “orders” table from the “ecommerce” schema.

1.2 What is the ref Function in dbt?

The ref function is used to reference other dbt models within a project. It ensures dependencies are automatically handled and maintains a modular and reusable codebase.

Example Usage:

SELECT * FROM {{ ref('customer_orders') }}

✔ Refers to the transformed customer_orders model within dbt.


2. Key Benefits of source and ref in dbt

Featuresourceref
References external raw tables
References transformed dbt models
Improves data lineage tracking
Ensures modular and reusable queries
Works with automated documentation
Helps in dependency management

Using source ensures clear documentation of raw data origins.
Using ref helps maintain model relationships dynamically.


3. Some useful Examples of source and ref in dbt

Example 1: Defining a source for Orders Data in an E-commerce Store

Scenario: An e-commerce business loads raw order data into a warehouse (Snowflake/BigQuery). We want to reference it correctly using the source function.

Step 1: Define the source in sources.yml

version: 2

sources:
  - name: ecommerce
    description: "Raw e-commerce data"
    schema: raw_data
    tables:
      - name: orders
        description: "Contains all order transactions"
        columns:
          - name: order_id
            description: "Unique order identifier"
          - name: customer_id
            description: "ID of the customer who placed the order"

Step 2: Use the source function in a dbt model

SELECT 
    order_id, 
    customer_id, 
    total_amount, 
    order_date
FROM {{ source('ecommerce', 'orders') }}

✔ References orders from the ecommerce source.
✔ Ensures proper documentation and clear lineage.


Example 2: Creating a Transformed Model Using ref

Scenario: We need to create a cleaned customer orders dataset from the raw orders table.

Step 1: Create a dbt model (customer_orders.sql) using ref

WITH orders AS (
    SELECT * FROM {{ source('ecommerce', 'orders') }}
)

SELECT 
    order_id,
    customer_id,
    total_amount,
    order_date,
    CASE 
        WHEN total_amount > 100 THEN 'VIP'
        ELSE 'Regular'
    END AS customer_type
FROM orders

Uses source to reference raw orders.
Transforms and categorizes customers into VIP and Regular.


Example 3: Building a Sales Aggregation Model Using ref

Scenario: The finance team needs a report summarizing monthly revenue.

Step 1: Create monthly_revenue.sql Model Using ref

WITH customer_orders AS (
    SELECT * FROM {{ ref('customer_orders') }}
)

SELECT 
    DATE_TRUNC('month', order_date) AS month,
    SUM(total_amount) AS monthly_revenue
FROM customer_orders
GROUP BY month
ORDER BY month

✔ Uses ref to pull transformed customer orders.
✔ Computes monthly revenue for financial analysis.


Example 4: Linking Customers with Orders Using ref

Scenario: A marketing team wants to combine customer details with order history for targeted campaigns.

Step 1: Create a dbt model (customer_order_summary.sql)

WITH customers AS (
    SELECT * FROM {{ source('ecommerce', 'customers') }}
)

, orders AS (
    SELECT * FROM {{ ref('customer_orders') }}
)

SELECT 
    customers.customer_id,
    customers.full_name,
    COUNT(orders.order_id) AS total_orders,
    SUM(orders.total_amount) AS total_spent
FROM customers
LEFT JOIN orders ON customers.customer_id = orders.customer_id
GROUP BY customers.customer_id, customers.full_name

Uses source for customers and ref for orders.
Links orders and customer data for personalized marketing.


Example 5: Data Quality Check with source and ref

Scenario: The data team wants to validate data completeness by counting records in sources vs. models.

Step 1: Create a dbt test model (data_quality_check.sql)

WITH source_orders AS (
    SELECT COUNT(*) AS source_count FROM {{ source('ecommerce', 'orders') }}
)

, transformed_orders AS (
    SELECT COUNT(*) AS model_count FROM {{ ref('customer_orders') }}
)

SELECT 
    source_orders.source_count, 
    transformed_orders.model_count,
    CASE 
        WHEN source_orders.source_count = transformed_orders.model_count THEN 'Valid'
        ELSE 'Data Mismatch'
    END AS data_check_status
FROM source_orders, transformed_orders

Compares raw vs. transformed row counts.
Flags potential data loss or transformation issues.


4. When and Where to Use source and ref

When to Use source

✔ When referencing raw tables from external databases.
✔ When defining data lineage and documentation.
✔ When ensuring data validation and transparency.

When to Use ref

✔ When referencing other dbt models within a project.
✔ When ensuring dependencies are correctly handled.
✔ When maintaining modular and reusable SQL code.


5. How to Use source and ref in Your dbt Project

Step 1: Install dbt

pip install dbt

Step 2: Configure dbt Connection (profiles.yml)

Define database credentials for Snowflake, BigQuery, or Redshift.

Step 3: Define sources.yml to Document Raw Data

Create a sources file and specify raw tables.

Step 4: Use source and ref in dbt Models

Write SQL transformations using source (for raw data) and ref (for transformed models).

Step 5: Run dbt Models

dbt run

Step 6: Test Data Quality

dbt test

Mastering source and ref in dbt ensures clean, structured, and scalable data transformations. These functions simplify data management, improve model dependencies, and enhance transparency in modern analytics workflows.

Start using source and ref today to build robust, scalable, and well-documented dbt projects! 🚀