Introduction to NoSQL Databases

NoSQL (“Not Only SQL”) describes databases that store and retrieve data using models other than the relational table-and-row structure. They traded some relational guarantees — joins, ACID transactions, strict schemas — for flexibility, horizontal scalability, and performance at specific access patterns.

The important framing: NoSQL isn’t a replacement for SQL. It’s a set of different tools for different problems. Many production systems use both.

Why NoSQL Emerged

Relational databases were built in the 1970s for structured business data. The constraints they impose — fixed schemas, rows and columns, vertical scaling — became pain points at internet scale:

Schema rigidity: Adding a field to a table with 100 million rows requires a table-locking ALTER TABLE
Horizontal scaling: Relational databases scale up (bigger servers) more naturally than out (more servers)
Semi-structured data: JSON documents with variable fields don’t map cleanly to fixed columns
High write throughput: ACID guarantees come at the cost of write speed

NoSQL databases solve these specific problems — often by relaxing one or more of these constraints.

The Four Main NoSQL Types

1. Document Stores

Store data as self-contained documents — typically JSON or BSON. Each document can have a different structure, and documents are grouped into collections (analogous to tables).

Examples: MongoDB, CouchDB, Firestore, Amazon DocumentDB

{
  "_id": "prod_8821",
  "name": "Mechanical Keyboard",
  "category": "electronics",
  "price": 149.99,
  "specs": {
    "switch_type": "Cherry MX Blue",
    "layout": "TKL",
    "backlit": true
  },
  "tags": ["gaming", "mechanical", "rgb"],
  "inventory": { "us": 142, "uk": 38, "de": 0 }
}

Good for: Product catalogs, user profiles, CMS content, event data — anything with variable structure per entity.

2. Key-Value Stores

The simplest NoSQL model: a key maps to an opaque value. Extremely fast reads and writes. The database doesn’t understand the value’s structure.

Examples: Redis, DynamoDB (in its simplest usage), Memcached

SET session:user_12345 '{"user_id": 12345, "role": "admin", "expires": "2025-07-01"}'
GET session:user_12345
TTL session:user_12345

HSET user:12345 name "Alice" email "alice@example.com"
LPUSH queue:jobs '{"task": "send_email", "to": "user@example.com"}'

Good for: Session caching, rate limiting, leaderboards, message queues, real-time counters. Redis is used for caching in the majority of high-traffic web applications.

3. Column-Family Stores (Wide Column)

Store data in rows, but each row can have different columns. Designed for massive write throughput and time-series workloads.

Examples: Apache Cassandra, HBase, Google Bigtable

-- Cassandra CQL
CREATE TABLE sensor_readings (
    sensor_id UUID,
    recorded_at TIMESTAMP,
    temperature FLOAT,
    humidity FLOAT,
    PRIMARY KEY (sensor_id, recorded_at)
) WITH CLUSTERING ORDER BY (recorded_at DESC);

SELECT * FROM sensor_readings
WHERE sensor_id = ? LIMIT 1000;

Good for: IoT sensor data, time-series metrics, activity feeds, audit logs — high write volume, query by a primary partition key.

4. Graph Databases

Store data as nodes (entities) and edges (relationships). Optimized for queries that traverse relationships at depth.

Examples: Neo4j, Amazon Neptune, ArangoDB

-- Find mutual friends between Alice and Bob (Neo4j Cypher)
MATCH (alice:User {name: "Alice"})-[:FRIENDS_WITH]->(friend:User)
      <-[:FRIENDS_WITH]-(bob:User {name: "Bob"})
RETURN friend.name AS mutual_friend;

Good for: Social networks, recommendation engines, fraud detection, knowledge graphs.

NoSQL vs SQL: When to Use Which

Use SQL (relational) when:
  - Data is structured with well-defined relationships
  - You need ACID transactions across multiple tables
  - Queries are complex and vary (reporting, ad-hoc analysis)

Use NoSQL when:
  - Schema is flexible or evolves rapidly
  - You need to scale writes horizontally
  - Access patterns are simple and known upfront
  - Working with document, graph, or time-series data
  - Extreme read/write performance is required (caching, sessions)

Consistency Trade-offs: CAP Theorem

Distributed systems must trade between consistency, availability, and partition tolerance:

C — Consistency: all nodes see the same data at the same time
A — Availability: every request gets a response (not necessarily current)
P — Partition Tolerance: system works despite network splits

P is required in distributed systems, so the real trade-off is C vs A:

CP databases: HBase, Zookeeper, MongoDB (default)
AP databases: Cassandra, CouchDB, DynamoDB (default)

SQL and NoSQL Together

Most production systems use both — each serving its strongest use case:

Component	Technology	Reason
User sessions	Redis (key-value)	Fast reads, TTL expiry
Product catalog	MongoDB (document)	Variable product attributes
Orders, inventory	PostgreSQL (relational)	ACID transactions required
Analytics	Snowflake / BigQuery	Complex SQL queries
Activity feed	Cassandra	High write throughput

NoSQL in Data Engineering

Data engineers encounter NoSQL in two main contexts:

As sources: NoSQL databases (MongoDB, DynamoDB) are common operational sources that need ingesting into a warehouse. Change data capture (CDC) via Debezium or AWS DMS is a standard pipeline pattern.

As infrastructure: Redis for pipeline state caching, Kafka (a log-structured store) for event streaming. Core SQL skills still transfer — Snowflake, BigQuery, and Redshift are relational and use standard SQL for analytics.