Azure Cosmos DB: Multi-Model Globally Distributed Database With 99.999% SLA

Azure Cosmos DB is Microsoft’s planet-scale, fully managed database service. It replicates data synchronously or asynchronously across any number of Azure regions, serves reads and writes from the nearest region to the client, and guarantees sub-10-millisecond latency at the 99th percentile. The 99.999% SLA for multi-region write configurations is among the strongest availability commitments in the cloud database market.

Cosmos DB is not a single database type. It exposes multiple wire-compatible APIs, so teams can use familiar drivers and query languages — SQL for JSON documents, MongoDB BSON drivers, Cassandra CQL, Gremlin for graphs, and the Azure Table Storage API. All of these land in the same underlying ARS (Atom-Record-Sequence) storage engine.

Real-World Scenario

An e-commerce platform operates in North America, Europe, and Asia Pacific. Product catalogue reads happen from all three regions; writes originate from a central operations team. The team configures Cosmos DB with three regional replicas (East US, West Europe, Southeast Asia) using Session consistency. Users in Tokyo read from the Southeast Asia replica with sub-10 ms latency. Product updates from the operations team write to East US and replicate to all regions within seconds. When East US has a brief outage, Azure automatically fails over — the 99.999% SLA covers multi-region write configurations.

Global Distribution Architecture

Cosmos DB Global Distribution
-------------------------------
[Write Region: East US]
      |
  Replication (async or sync depending on consistency)
      |
+-----+------+
|            |
[Read Region: West Europe]   [Read Region: Southeast Asia]

Client in Tokyo -> resolves to Southeast Asia endpoint
Client in London -> resolves to West Europe endpoint
Client in New York -> resolves to East US endpoint

Multi-region writes (multi-master):
  All three regions accept writes simultaneously
  Conflicts resolved by LWW (Last Write Wins) or custom policy

Adding a new region is a portal toggle or CLI command — Cosmos DB provisions the replica and backfills data automatically.

Consistency Levels

Cosmos DB offers five consistency levels, trading between data freshness and performance/cost:

Strongest          STRONG
                     |  All reads guaranteed to see latest write
                     |  Latency: 2x round-trip to farthest replica
                     |
                   BOUNDED STALENESS
                     |  Reads lag behind writes by K versions or T seconds
                     |  Good for globally consistent reads with bounded lag
                     |
                   SESSION  (default, most popular)
                     |  Within a single client session: reads see own writes
                     |  Best balance of consistency and performance
                     |
                   CONSISTENT PREFIX
                     |  Reads never see out-of-order writes
                     |  No guarantee on how far behind they are
                     |
Weakest            EVENTUAL
                     |  No ordering or freshness guarantee
                     |  Highest throughput, lowest cost

Session consistency is the right choice for most OLTP applications. Each client session sees its own writes immediately (read-your-own-writes guarantee), while different clients may briefly see stale data.

Request Units (RU/s)

Cosmos DB abstracts compute as Request Units. One RU equals approximately the cost of reading a 1 KB document by its point key. Writes cost more (roughly 5-10x a read), and cross-partition queries can cost hundreds of RUs.

RU Cost Examples (approximate)
--------------------------------
Point read (1 KB document)           1 RU
Point write (1 KB document)          5 RU
Query with index hit (10 results)   10-20 RU
Cross-partition query (1000 results) 100-500 RU
Stored procedure (complex)          50-200 RU

Throughput can be provisioned at the database or container level, or set to autoscale (scales from 10% to 100% of max RU/s automatically, billed per maximum reached per hour).

Partitioning

Every Cosmos DB container has a partition key. All data with the same partition key value lives in the same logical partition. Physical partitions group multiple logical partitions and are managed automatically.

Container: Orders
Partition Key: /customerId

Logical Partition "CUST-001"
  {"id": "ORD-A1", "customerId": "CUST-001", "total": 49.99}
  {"id": "ORD-A2", "customerId": "CUST-001", "total": 120.00}

Logical Partition "CUST-002"
  {"id": "ORD-B1", "customerId": "CUST-002", "total": 75.50}

Physical partitions (managed by Cosmos DB):
  [P1: CUST-001, CUST-005, CUST-009...]
  [P2: CUST-002, CUST-006, CUST-010...]
  ...

Good partition key:
  - High cardinality (many unique values)
  - Evenly distributed writes
  - Appears in most queries (avoids cross-partition fan-out)

A poor partition key choice (e.g., a boolean flag) creates hot partitions where one value receives most of the traffic. The maximum logical partition size is 20 GB.

Change Feed

The change feed is an ordered log of inserts and updates to a Cosmos DB container. Every write is appended to the feed and can be read by one or more consumers. Deletes are not captured by default (use soft delete with a TTL field to work around this).

Change Feed Use Cases
----------------------
Materialized views:
  Product updates in ProductCatalog container
  -> Change feed consumer reads changes
  -> Writes denormalised view to SearchIndex container

Event sourcing:
  All order state changes appended to change feed
  -> Multiple microservices consume independently
  -> Each builds its own projection (email service, analytics, inventory)

Cache invalidation:
  Price updates in ProductDB
  -> Change feed consumer detects price change
  -> Invalidates corresponding Redis cache keys

Change feed is consumed via Azure Functions (CosmosDBTrigger), the SDK’s change feed processor library, or Azure Stream Analytics.

Working With Cosmos DB (Python, SQL API)

from azure.cosmos import CosmosClient, PartitionKey, exceptions

endpoint = "https://myaccount.documents.azure.com:443/"
key      = "<primary_key>"

client    = CosmosClient(endpoint, key)
db        = client.create_database_if_not_exists("ecommerce")
container = db.create_container_if_not_exists(
    id="orders",
    partition_key=PartitionKey(path="/customerId"),
    offer_throughput=1000   # 1000 RU/s provisioned
)

# Insert a document
order = {
    "id": "ORD-5001",
    "customerId": "CUST-042",
    "items": [{"sku": "PHONE-X", "qty": 1, "price": 999}],
    "total": 999,
    "status": "confirmed"
}
container.upsert_item(order)

# Point read (cheapest query: 1 RU)
item = container.read_item(item="ORD-5001", partition_key="CUST-042")
print(item["status"])

# Cross-partition query (more expensive)
for order in container.query_items(
    query="SELECT * FROM c WHERE c.status = 'confirmed'",
    enable_cross_partition_query=True
):
    print(order["id"])

Key Interview Points

RU/s is not the same as IOPS: RU/s accounts for document size, indexing cost, and query complexity, not just I/O operations. A document read is 1 RU regardless of whether the disk did one I/O or many.
Consistency changes at runtime: Unlike most databases where consistency is a deployment choice, Cosmos DB allows setting consistency per request using override headers — weaker than the account default, never stronger.
Multi-master conflict resolution: In multi-region write mode, two clients in different regions can write to the same document simultaneously. Last-Write-Wins uses the _ts (server timestamp) field. Custom conflict handlers allow application-defined merge logic.
TTL (Time-to-Live): Set defaultTtl on a container to auto-expire documents after N seconds. Useful for session data, temporary data, and implementing soft-delete patterns visible in the change feed.
Analytical store (HTAP): Cosmos DB supports an analytical store — a column-oriented copy of the data maintained automatically, separate from the transactional store — usable by Synapse Link for zero-ETL analytics without impacting OLTP performance.

Best Practices

Choose a partition key with high cardinality and uniform write distribution — the choice is permanent and changing it requires a data migration.
Enable autoscale throughput rather than manual provisioned RU/s for workloads with unpredictable traffic patterns.
Use point reads (read_item) instead of queries whenever you know the id and partition key — the cost difference is 10-100x.
Monitor normalized RU consumption per partition in Azure Monitor; values consistently above 80% indicate a hot partition that needs a different key design.
Use the Cosmos DB SDK’s integrated cache and bulk executor library for high-throughput import scenarios to reduce RU consumption per document.