Data Build Tools
Core Foundations of dbt
- What is dbt
- ELT vs ETL
- dbt Project Structure
- dbt_project.yml
- profiles.yml
- dbt Core vs dbt Cloud
- dbt Environment Management
Models and Materializations
π§ ELT vs ETL β Understanding the Core Difference in Modern Data Pipelines
In the world of data engineering, two acronyms define how data moves from sources to analytics: ETL and ELT.
Both processes involve three key steps:
- Extract β Pull data from sources.
- Transform β Clean and format data.
- Load β Store it for analysis.
The difference lies in the order of these steps.
π§© Definition Summary
| Concept | Full Form | Order | Where Transformation Happens |
|---|---|---|---|
| ETL | Extract β Transform β Load | Before loading | In external servers/tools |
| ELT | Extract β Load β Transform | After loading | Inside the data warehouse |
π§ Simple Explanation
- In ETL, data is cleaned and transformed before being loaded into the data warehouse.
- In ELT, raw data is loaded first, and transformation happens inside the warehouse using SQL or tools like dbt.
π‘ Real-World Analogy
Imagine youβre baking cookies.
- ETL: You mix ingredients (transform) before putting them in the oven (load).
- ELT: You put all ingredients into the oven first (load), then bake and shape them inside (transform).
β Result: ELT leverages the βovenβsβ power β your cloud warehouse β to do the heavy lifting.
π§ ** ETL Process**
β Explanation: ETL performs transformations outside the warehouse using ETL tools (e.g., Informatica, Talend, Pentaho).
π§ ** ELT Process**
β Explanation: ELT performs transformations inside the warehouse using its compute power.
βοΈ ETL Process β Step-by-Step
- Extract β Gather data from sources like APIs, databases, and flat files.
- Transform β Apply cleaning, deduplication, and enrichment logic before loading.
- Load β Store the processed data into the data warehouse.
βοΈ ELT Process β Step-by-Step
- Extract β Pull data from various systems.
- Load β Dump raw data directly into the data warehouse.
- Transform β Perform transformations inside the warehouse using SQL or dbt.
π Comparison Table β ETL vs ELT
| Feature | ETL | ELT |
|---|---|---|
| Transformation Location | Outside warehouse | Inside warehouse |
| Speed | Slower | Faster (uses cloud compute) |
| Scalability | Limited by ETL server | Highly scalable |
| Maintenance | Complex | Easier |
| Real-time Processing | Harder | Easier |
| Data Lake Compatibility | Limited | Excellent |
| Tools | Informatica, Talend, SSIS | dbt, Snowflake, BigQuery |
| Best for | Legacy on-premise systems | Modern cloud data warehouses |
π§ Example 1 β ETL Python Program (Traditional Approach)
Hereβs a simple ETL example using Python and Pandas:
import pandas as pdfrom sqlalchemy import create_engine
# Extractorders = pd.read_csv('orders.csv')
# Transformorders_clean = orders.dropna()orders_clean['total_price'] = orders_clean['quantity'] * orders_clean['price']
# Loadengine = create_engine('postgresql://user:pass@localhost:5432/analytics')orders_clean.to_sql('orders_cleaned', engine, if_exists='replace', index=False)β Explanation: All transformations are performed before loading into the database.
π‘ Example 2 β ELT using SQL (Modern Approach)
In ELT, data is loaded as-is, and transformations happen later in SQL inside the warehouse.
-- Load step (already loaded raw data)SELECT * FROM raw.orders;
-- Transform step inside warehouseCREATE OR REPLACE TABLE analytics.orders_cleaned ASSELECT order_id, customer_id, quantity * price AS total_price, order_dateFROM raw.ordersWHERE status = 'shipped';β Explanation: Data is first loaded raw, then transformed within the warehouse (e.g., Snowflake, BigQuery).
π‘ Example 3 β ELT with dbt
dbt helps automate the transformation step inside the warehouse.
-- models/orders_cleaned.sqlSELECT order_id, customer_id, quantity * price AS total_price, order_dateFROM {{ source('raw', 'orders') }}WHERE status = 'shipped';Then run:
dbt run --select orders_cleanedβ Result: dbt creates the transformed table within the data warehouse.
π§ ** Side-by-Side Comparison**
β Key takeaway: ELT uses warehouse compute power β more efficient and scalable.
π§ How to Remember the Difference
| Trick | Description |
|---|---|
| Order Trick | ETL = βT before Lβ β Transform first; ELT = βL before Tβ β Load first. |
| Location Trick | ETL = Transform outside; ELT = Transform inside warehouse. |
| Analogy | ETL = Kitchen outside home; ELT = Kitchen inside your home. |
| Keyword Mnemonic | ETL = Extract, Then Load later; ELT = Extract, Load, Then Transform. |
π‘ Memory Hook:
βELT = Modern Cloud, ETL = Legacy Ground.β
π§© Example 4 β ETL with Airflow + Python
Airflow orchestrates ETL jobs with external transformations.
from airflow import DAGfrom airflow.operators.python import PythonOperatorfrom datetime import datetime
def extract(): print("Extracting data...")
def transform(): print("Transforming data...")
def load(): print("Loading data into warehouse...")
with DAG('etl_pipeline', start_date=datetime(2025,1,1)) as dag: t1 = PythonOperator(task_id='extract', python_callable=extract) t2 = PythonOperator(task_id='transform', python_callable=transform) t3 = PythonOperator(task_id='load', python_callable=load)
t1 >> t2 >> t3β Explanation: Each task runs sequentially β transformations happen before loading.
π§© Example 5 β ELT using dbt + Snowflake
# Load raw data using Fivetranfivetran sync
# Then transform data using dbtdbt run --models stg_ordersdbt test --models stg_ordersβ Explanation: Fivetran loads data into Snowflake (L), and dbt transforms it inside (T).
π§© Example 6 β ELT using SQL Stored Procedures
-- Step 1: Raw data already loadedCREATE OR REPLACE TABLE raw.sales ASSELECT * FROM external_source.sales;
-- Step 2: Transform inside warehouseCREATE OR REPLACE TABLE analytics.sales_summary ASSELECT region, SUM(amount) AS total_salesFROM raw.salesGROUP BY region;β Result: All transformations occur inside the data warehouse β classic ELT.
π§ ** Data Flow in ELT (Modern Data Stack)**
β Explanation: Modern ELT stacks combine Fivetran (Load) + dbt (Transform) seamlessly.
π Why ELT is the Future
| Feature | ETL | ELT |
|---|---|---|
| Scalability | Limited | Auto-scaling in cloud |
| Maintenance | Complex pipelines | Simplified SQL models |
| Speed | Dependent on ETL servers | High-speed compute |
| Storage Costs | Expensive | Cost-efficient in cloud |
| Data Volume | Small-medium | Big data-ready |
| Tooling | On-premise | Cloud-native (dbt, Snowflake) |
β Reason: ELT uses the power of modern cloud warehouses that separate storage and compute, enabling faster, cheaper, and more flexible transformations.
π― Importance of Learning ELT vs ETL
1. Core Data Engineering Concept
Every data professional must understand how data moves, transforms, and loads.
2. Modern Stack Alignment
Companies are moving from ETL to ELT + dbt workflows.
3. Better Optimization
Knowing both helps you design efficient, maintainable pipelines.
4. Interview Relevance
Questions like:
- βWhatβs the difference between ETL and ELT?β
- βWhy do modern teams prefer ELT?β
- βHow does dbt fit into ELT?β are extremely common in data engineering interviews.
π‘ How to Remember for Exams & Interviews
π§ Mnemonic:
βETL cleans before the party, ELT cleans after the party.β
π Flashcard Method:
- Front: βWhere does transformation happen in ELT?β
- Back: βInside the warehouse.β
π Visual Trick:
Picture ETL = old factory (manual) and ELT = smart factory (automated cloud).
π ** Historical Evolution**
β Summary: The rise of Snowflake, BigQuery, and Databricks made ELT the new standard.
π§ Interview Q&A Cheatsheet
| Question | Answer |
|---|---|
| What is ETL? | Process where data is transformed before loading into storage. |
| What is ELT? | Process where data is loaded raw and transformed in the warehouse. |
| Why ELT is faster? | It leverages warehouse compute power. |
| Which is best for cloud? | ELT |
| Example of ELT tools? | dbt, Fivetran, Airbyte, Snowflake |
| Example of ETL tools? | Talend, Informatica, SSIS |
π§© Best Practices for ELT
- Use SQL transformations inside your warehouse.
- Leverage dbt for testing, version control, and documentation.
- Store raw data unchanged for auditing.
- Automate data loads using Fivetran or Airbyte.
- Use incremental models to optimize performance.
π§ Future of Data Transformation
- ETL β ELT β Reverse ETL
- Reverse ETL sends warehouse data back to tools like Salesforce or HubSpot.
- The transformation layer (dbt) is now the core of the data stack.
π§ ** Modern ELT + Reverse ETL**
β Explanation: ELT not only enables analytics but also drives operational workflows.
π Conclusion
ETL was ideal for the on-premise world where compute and storage were expensive. ELT, however, is the hero of the cloud era, where transformation is faster, cheaper, and more scalable.
ETL prepares data before storage; ELT transforms it after storage β inside the warehouse.
Understanding this distinction helps data professionals design efficient, future-proof pipelines.
π Final Thought:
βETL is history. ELT is the future. Learn ELT to speak the language of modern data.β