SQL DISTINCT
DISTINCT eliminates duplicate rows from query results. It’s used in SELECT to return only unique combinations of the selected columns.
Basic Usage
-- All unique city valuesSELECT DISTINCT city FROM customers;
-- All unique status valuesSELECT DISTINCT status FROM orders;Without DISTINCT, these queries return one row per customer or order — with many duplicates if multiple customers are in the same city.
DISTINCT Applies to the Entire Row
DISTINCT evaluates the combination of all selected columns, not just the first one:
-- All unique city + country combinationsSELECT DISTINCT city, countryFROM customersORDER BY country, city;
-- A row is only excluded if BOTH city AND country are identical to another rowThere’s no way to apply DISTINCT to only one column in a multi-column SELECT while keeping all other columns — use GROUP BY or a window function for that.
COUNT DISTINCT
Count unique values in a column:
SELECT COUNT(*) AS total_orders, COUNT(DISTINCT customer_id) AS unique_customers, COUNT(DISTINCT product_id) AS unique_products_sold, COUNT(DISTINCT DATE_TRUNC('day', order_date)) AS active_daysFROM ordersWHERE status = 'completed';COUNT(DISTINCT column) ignores NULLs — it counts only unique non-null values.
DISTINCT vs GROUP BY
DISTINCT and GROUP BY both deduplicate, but they serve different purposes:
-- DISTINCT: just deduplicate (no aggregation)SELECT DISTINCT city FROM customers;
-- GROUP BY: deduplicate AND aggregateSELECT city, COUNT(*) AS customer_countFROM customersGROUP BY city;
-- Both return unique cities, but GROUP BY adds the countWhen you only need unique values (no aggregation), DISTINCT is simpler. When you need aggregates, use GROUP BY.
Performance: On large datasets with good indexes, both are similar in cost. Without indexes, both require sorting or hashing to find duplicates.
DISTINCT in Practice: Deduplicating Data
A common use case is finding what distinct values exist in a dimension column:
-- What categories do we have?SELECT DISTINCT category FROM products ORDER BY category;
-- What countries are our customers from?SELECT DISTINCT country FROM customers ORDER BY country;
-- What status values exist in our orders?SELECT DISTINCT status FROM orders ORDER BY status;When DISTINCT Masks a Problem
DISTINCT is sometimes used to “fix” a query that returns duplicates because of a bad join:
-- This returns duplicates if a customer has multiple ordersSELECT DISTINCT c.customer_id, c.nameFROM customers cJOIN orders o ON c.customer_id = o.customer_id;
-- DISTINCT hides the root cause: this should be:SELECT customer_id, name FROM customersWHERE customer_id IN (SELECT customer_id FROM orders);-- Or: check if the join is correctIf you’re adding DISTINCT to remove duplicates and aren’t sure why duplicates appear, investigate the join logic first — DISTINCT may be masking a deeper issue.
DISTINCT ON (PostgreSQL)
PostgreSQL’s DISTINCT ON keeps one row per group of the specified column(s), choosing which row by the ORDER BY clause. This is a PostgreSQL extension — not standard SQL:
-- Get the most recent order for each customerSELECT DISTINCT ON (customer_id) customer_id, order_id, amount, created_atFROM ordersORDER BY customer_id, created_at DESC;DISTINCT ON (customer_id) keeps the first row per customer in the ORDER BY sequence. Since we order by created_at DESC, we get the most recent order.
The equivalent with a window function (standard SQL):
SELECT customer_id, order_id, amount, created_atFROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) AS rn FROM orders) tWHERE rn = 1;DISTINCT in Subqueries
-- Products that have been sold (without duplicates from multiple orders)SELECT DISTINCT product_id FROM order_items;
-- Count of products ever soldSELECT COUNT(DISTINCT product_id) AS products_sold FROM order_items;
-- Customers who ordered in more than one monthSELECT customer_idFROM ordersGROUP BY customer_idHAVING COUNT(DISTINCT DATE_TRUNC('month', order_date)) > 1;Performance Considerations
DISTINCT requires comparing all selected columns across all rows to find duplicates. This involves either:
- Sorting: Sort all rows, then remove consecutive duplicates
- Hashing: Hash rows into buckets, discard duplicates
Both are expensive on large result sets. If you find yourself using DISTINCT on a large table frequently:
- Make sure an index exists on the columns
- Consider whether the query structure could return fewer duplicates before
DISTINCT - Check if
EXISTSor a JOIN restructure could avoid the deduplication entirely