Advanced SQL Concepts
Once you’re comfortable with SELECT, JOIN, GROUP BY, and basic subqueries, a second tier of SQL capabilities opens up — one that’s expected of data engineers, analytics engineers, and senior analysts. This guide covers the patterns and techniques that separate intermediate SQL users from those who can solve genuinely complex analytical problems.
Window Functions: Analytics Without Collapsing Rows
Window functions perform calculations across a set of rows related to the current row — without collapsing results the way GROUP BY does. This makes them ideal for ranking, running totals, moving averages, and comparing rows to their neighbors.
Basic syntax:
function_name(expression) OVER ( PARTITION BY column -- reset the window for each group ORDER BY column -- order within the window ROWS/RANGE frame_spec -- optional: define the window size)Ranking functions:
SELECT employee_id, department, salary, RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank, DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_dense_rank, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS row_numFROM employees;The difference:
ROW_NUMBER()— always unique, no tiesRANK()— ties get the same number, next number skips (1, 1, 3)DENSE_RANK()— ties get the same number, next number doesn’t skip (1, 1, 2)
Running totals and moving averages:
SELECT order_date, daily_revenue, SUM(daily_revenue) OVER (ORDER BY order_date) AS running_total, AVG(daily_revenue) OVER ( ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ) AS seven_day_moving_avgFROM daily_sales;Accessing adjacent rows:
SELECT product_id, sale_date, revenue, LAG(revenue, 1) OVER (PARTITION BY product_id ORDER BY sale_date) AS prev_day_revenue, revenue - LAG(revenue, 1) OVER (PARTITION BY product_id ORDER BY sale_date) AS day_over_day_changeFROM product_sales;Common Table Expressions (CTEs)
CTEs use the WITH clause to define named intermediate result sets. They make complex queries readable by breaking them into logical steps.
WITHcustomer_orders AS ( SELECT customer_id, COUNT(*) AS order_count, SUM(amount) AS total_spent, MAX(order_date) AS last_order_date FROM orders GROUP BY customer_id),customer_segments AS ( SELECT customer_id, total_spent, CASE WHEN total_spent >= 5000 THEN 'Platinum' WHEN total_spent >= 1000 THEN 'Gold' WHEN total_spent >= 250 THEN 'Silver' ELSE 'Bronze' END AS segment FROM customer_orders)SELECT cs.segment, COUNT(*) AS customer_count, AVG(co.total_spent) AS avg_spend, AVG(co.order_count) AS avg_ordersFROM customer_segments csJOIN customer_orders co USING (customer_id)GROUP BY cs.segmentORDER BY avg_spend DESC;CTEs can reference each other in sequence, and in most databases multiple CTEs can be defined in the same WITH block, separated by commas.
Recursive CTEs
Recursive CTEs handle hierarchical or graph data — organizational trees, category hierarchies, bill-of-materials, network paths. They’re one of the most powerful but least-used features in SQL.
-- Walk an org chart from a given manager downWITH RECURSIVE org_tree AS ( -- Anchor: start with the given manager SELECT employee_id, name, manager_id, 0 AS depth FROM employees WHERE employee_id = 100 -- starting manager
UNION ALL
-- Recursive: find direct reports of each person in the tree SELECT e.employee_id, e.name, e.manager_id, ot.depth + 1 FROM employees e JOIN org_tree ot ON e.manager_id = ot.employee_id)SELECT name, depthFROM org_treeORDER BY depth, name;Structure:
- Anchor member — the base case (starting rows)
UNION ALL- Recursive member — joins back to the CTE itself, referencing rows from the previous iteration
Lateral Joins
A lateral join (called CROSS APPLY / OUTER APPLY in SQL Server) allows the subquery on the right side to reference columns from the table on the left — something a regular subquery cannot do.
-- For each customer, get their 3 most recent ordersSELECT c.customer_id, c.name, recent.order_id, recent.amount, recent.order_dateFROM customers cCROSS JOIN LATERAL ( SELECT order_id, amount, order_date FROM orders o WHERE o.customer_id = c.customer_id -- references c from outer query ORDER BY order_date DESC LIMIT 3) recent;Lateral joins are particularly useful for:
- Top-N per group without window functions
- Calling set-returning functions row by row
- Unnesting arrays or JSON into rows per record
Advanced Filtering Patterns
FILTER clause on aggregates (cleaner than CASE WHEN):
SELECT COUNT(*) AS total_orders, COUNT(*) FILTER (WHERE status = 'completed') AS completed, COUNT(*) FILTER (WHERE status = 'refunded') AS refunded, SUM(amount) FILTER (WHERE status = 'completed') AS completed_revenueFROM orders;Conditional aggregation with CASE WHEN (works in all databases):
SELECT SUM(CASE WHEN status = 'completed' THEN amount ELSE 0 END) AS completed_revenue, SUM(CASE WHEN status = 'refunded' THEN amount ELSE 0 END) AS refunded_revenueFROM orders;EXISTS vs IN for subqueries:
-- EXISTS is typically faster when the subquery result is largeSELECT customer_id, nameFROM customers cWHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND o.amount > 1000);String and JSON Functions
Modern SQL databases handle semi-structured data natively.
PostgreSQL JSON:
-- Extract fields from JSON columnSELECT id, event_data->>'user_id' AS user_id, (event_data->>'amount')::numeric AS amount, event_data->'metadata'->>'source' AS sourceFROM eventsWHERE event_data->>'type' = 'purchase';
-- Query arrays in JSONSELECT id, jsonb_array_elements_text(event_data->'tags') AS tagFROM events;String manipulation:
SELECT LOWER(TRIM(email)) AS normalized_email, SPLIT_PART(email, '@', 2) AS domain, LEFT(phone, 3) AS area_code, REGEXP_REPLACE(description, '[^a-zA-Z0-9 ]', '', 'g') AS clean_descriptionFROM users;Query Performance and Optimization
Writing a query that returns the right answer is necessary but not sufficient. At scale, performance matters.
Use EXPLAIN ANALYZE to see what the database actually does:
EXPLAIN ANALYZESELECT c.name, SUM(o.amount)FROM customers cJOIN orders o ON c.customer_id = o.customer_idWHERE o.order_date >= '2025-01-01'GROUP BY c.customer_id, c.nameORDER BY SUM(o.amount) DESC;Look for:
- Seq Scan on large tables — usually means a missing index
- Hash Join vs Nested Loop — hash join is better for large result sets
- High rows estimated vs rows actual discrepancy — stale statistics
Common performance anti-patterns:
-- BAD: function on indexed column prevents index useWHERE YEAR(order_date) = 2025
-- GOOD: range comparison uses the indexWHERE order_date >= '2025-01-01' AND order_date < '2026-01-01'
-- BAD: leading wildcard prevents index useWHERE name LIKE '%smith%'
-- GOOD: leading literal uses indexWHERE name LIKE 'smith%'
-- BAD: SELECT * when you need 2 columns from a 50-column tableSELECT * FROM orders WHERE status = 'pending'
-- GOOD: only fetch what you needSELECT order_id, amount FROM orders WHERE status = 'pending'Pivoting Data
SQL doesn’t have native PIVOT syntax in PostgreSQL, but conditional aggregation achieves the same result:
-- Sales by product, pivoted to months as columnsSELECT product_name, SUM(CASE WHEN EXTRACT(MONTH FROM sale_date) = 1 THEN amount END) AS jan, SUM(CASE WHEN EXTRACT(MONTH FROM sale_date) = 2 THEN amount END) AS feb, SUM(CASE WHEN EXTRACT(MONTH FROM sale_date) = 3 THEN amount END) AS mar, SUM(CASE WHEN EXTRACT(MONTH FROM sale_date) = 4 THEN amount END) AS aprFROM salesWHERE EXTRACT(YEAR FROM sale_date) = 2025GROUP BY product_name;Snowflake and SQL Server have native PIVOT operators; BigQuery uses PIVOT as well (added in 2021).
Set Operations
UNION, INTERSECT, and EXCEPT (or MINUS in Oracle) operate on entire result sets:
-- All customers and suppliers in one listSELECT name, 'customer' AS type FROM customersUNIONSELECT company_name, 'supplier' AS type FROM suppliers;
-- Customers who are also registered as suppliersSELECT email FROM customersINTERSECTSELECT contact_email FROM suppliers;
-- Customers who have NOT placed an order in 2025SELECT customer_id FROM customersEXCEPTSELECT DISTINCT customer_id FROM orders WHERE order_date >= '2025-01-01';UNION deduplicates (like DISTINCT). UNION ALL keeps all rows including duplicates — and is faster because it skips the deduplication step.
Date and Time Operations
Date math is database-specific. Common patterns:
-- PostgreSQLSELECT NOW() AS current_timestamp, CURRENT_DATE AS today, CURRENT_DATE - INTERVAL '30 days' AS thirty_days_ago, DATE_TRUNC('week', order_date) AS week_start, EXTRACT(DOW FROM order_date) AS day_of_week, -- 0=Sunday, 6=Saturday order_date + INTERVAL '1 month' AS next_monthFROM orders;
-- Calculate age / days between datesSELECT customer_id, CURRENT_DATE - first_order_date AS days_since_first_order, AGE(last_order_date, first_order_date) AS order_lifespanFROM customer_summary;The jump from basic SQL to advanced SQL is largely a matter of knowing these tools exist and practicing them on real data. Window functions and CTEs will change how you think about analytical problems — once you know them, you’ll find uses for them constantly.