SQL DISTINCT

DISTINCT eliminates duplicate rows from query results. It’s used in SELECT to return only unique combinations of the selected columns.

Basic Usage

-- All unique city values
SELECT DISTINCT city FROM customers;

-- All unique status values
SELECT DISTINCT status FROM orders;

Without DISTINCT, these queries return one row per customer or order — with many duplicates if multiple customers are in the same city.

DISTINCT Applies to the Entire Row

DISTINCT evaluates the combination of all selected columns, not just the first one:

-- All unique city + country combinations
SELECT DISTINCT city, country
FROM customers
ORDER BY country, city;

-- A row is only excluded if BOTH city AND country are identical to another row

There’s no way to apply DISTINCT to only one column in a multi-column SELECT while keeping all other columns — use GROUP BY or a window function for that.

COUNT DISTINCT

Count unique values in a column:

SELECT
    COUNT(*) AS total_orders,
    COUNT(DISTINCT customer_id) AS unique_customers,
    COUNT(DISTINCT product_id) AS unique_products_sold,
    COUNT(DISTINCT DATE_TRUNC('day', order_date)) AS active_days
FROM orders
WHERE status = 'completed';

COUNT(DISTINCT column) ignores NULLs — it counts only unique non-null values.

DISTINCT vs GROUP BY

DISTINCT and GROUP BY both deduplicate, but they serve different purposes:

-- DISTINCT: just deduplicate (no aggregation)
SELECT DISTINCT city FROM customers;

-- GROUP BY: deduplicate AND aggregate
SELECT city, COUNT(*) AS customer_count
FROM customers
GROUP BY city;

-- Both return unique cities, but GROUP BY adds the count

When you only need unique values (no aggregation), DISTINCT is simpler. When you need aggregates, use GROUP BY.

Performance: On large datasets with good indexes, both are similar in cost. Without indexes, both require sorting or hashing to find duplicates.

DISTINCT in Practice: Deduplicating Data

A common use case is finding what distinct values exist in a dimension column:

-- What categories do we have?
SELECT DISTINCT category FROM products ORDER BY category;

-- What countries are our customers from?
SELECT DISTINCT country FROM customers ORDER BY country;

-- What status values exist in our orders?
SELECT DISTINCT status FROM orders ORDER BY status;

When DISTINCT Masks a Problem

DISTINCT is sometimes used to “fix” a query that returns duplicates because of a bad join:

-- This returns duplicates if a customer has multiple orders
SELECT DISTINCT c.customer_id, c.name
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id;

-- DISTINCT hides the root cause: this should be:
SELECT customer_id, name FROM customers
WHERE customer_id IN (SELECT customer_id FROM orders);
-- Or: check if the join is correct

If you’re adding DISTINCT to remove duplicates and aren’t sure why duplicates appear, investigate the join logic first — DISTINCT may be masking a deeper issue.

DISTINCT ON (PostgreSQL)

PostgreSQL’s DISTINCT ON keeps one row per group of the specified column(s), choosing which row by the ORDER BY clause. This is a PostgreSQL extension — not standard SQL:

-- Get the most recent order for each customer
SELECT DISTINCT ON (customer_id)
    customer_id,
    order_id,
    amount,
    created_at
FROM orders
ORDER BY customer_id, created_at DESC;

DISTINCT ON (customer_id) keeps the first row per customer in the ORDER BY sequence. Since we order by created_at DESC, we get the most recent order.

The equivalent with a window function (standard SQL):

SELECT customer_id, order_id, amount, created_at
FROM (
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) AS rn
    FROM orders
) t
WHERE rn = 1;

DISTINCT in Subqueries

-- Products that have been sold (without duplicates from multiple orders)
SELECT DISTINCT product_id FROM order_items;

-- Count of products ever sold
SELECT COUNT(DISTINCT product_id) AS products_sold FROM order_items;

-- Customers who ordered in more than one month
SELECT customer_id
FROM orders
GROUP BY customer_id
HAVING COUNT(DISTINCT DATE_TRUNC('month', order_date)) > 1;

Performance Considerations

DISTINCT requires comparing all selected columns across all rows to find duplicates. This involves either:

Sorting: Sort all rows, then remove consecutive duplicates
Hashing: Hash rows into buckets, discard duplicates

Both are expensive on large result sets. If you find yourself using DISTINCT on a large table frequently:

Make sure an index exists on the columns
Consider whether the query structure could return fewer duplicates before DISTINCT
Check if EXISTS or a JOIN restructure could avoid the deduplication entirely