Introduction to NoSQL Databases
NoSQL (“Not Only SQL”) describes databases that store and retrieve data using models other than the relational table-and-row structure. They traded some relational guarantees — joins, ACID transactions, strict schemas — for flexibility, horizontal scalability, and performance at specific access patterns.
The important framing: NoSQL isn’t a replacement for SQL. It’s a set of different tools for different problems. Many production systems use both.
Why NoSQL Emerged
Relational databases were built in the 1970s for structured business data. The constraints they impose — fixed schemas, rows and columns, vertical scaling — became pain points at internet scale:
- Schema rigidity: Adding a field to a table with 100 million rows requires a table-locking ALTER TABLE
- Horizontal scaling: Relational databases scale up (bigger servers) more naturally than out (more servers)
- Semi-structured data: JSON documents with variable fields don’t map cleanly to fixed columns
- High write throughput: ACID guarantees come at the cost of write speed
NoSQL databases solve these specific problems — often by relaxing one or more of these constraints.
The Four Main NoSQL Types
1. Document Stores
Store data as self-contained documents — typically JSON or BSON. Each document can have a different structure, and documents are grouped into collections (analogous to tables).
Examples: MongoDB, CouchDB, Firestore, Amazon DocumentDB
{ "_id": "prod_8821", "name": "Mechanical Keyboard", "category": "electronics", "price": 149.99, "specs": { "switch_type": "Cherry MX Blue", "layout": "TKL", "backlit": true }, "tags": ["gaming", "mechanical", "rgb"], "inventory": { "us": 142, "uk": 38, "de": 0 }}Good for: Product catalogs, user profiles, CMS content, event data — anything with variable structure per entity.
2. Key-Value Stores
The simplest NoSQL model: a key maps to an opaque value. Extremely fast reads and writes. The database doesn’t understand the value’s structure.
Examples: Redis, DynamoDB (in its simplest usage), Memcached
SET session:user_12345 '{"user_id": 12345, "role": "admin", "expires": "2025-07-01"}'GET session:user_12345TTL session:user_12345
HSET user:12345 name "Alice" email "alice@example.com"LPUSH queue:jobs '{"task": "send_email", "to": "user@example.com"}'Good for: Session caching, rate limiting, leaderboards, message queues, real-time counters. Redis is used for caching in the majority of high-traffic web applications.
3. Column-Family Stores (Wide Column)
Store data in rows, but each row can have different columns. Designed for massive write throughput and time-series workloads.
Examples: Apache Cassandra, HBase, Google Bigtable
-- Cassandra CQLCREATE TABLE sensor_readings ( sensor_id UUID, recorded_at TIMESTAMP, temperature FLOAT, humidity FLOAT, PRIMARY KEY (sensor_id, recorded_at)) WITH CLUSTERING ORDER BY (recorded_at DESC);
SELECT * FROM sensor_readingsWHERE sensor_id = ? LIMIT 1000;Good for: IoT sensor data, time-series metrics, activity feeds, audit logs — high write volume, query by a primary partition key.
4. Graph Databases
Store data as nodes (entities) and edges (relationships). Optimized for queries that traverse relationships at depth.
Examples: Neo4j, Amazon Neptune, ArangoDB
-- Find mutual friends between Alice and Bob (Neo4j Cypher)MATCH (alice:User {name: "Alice"})-[:FRIENDS_WITH]->(friend:User) <-[:FRIENDS_WITH]-(bob:User {name: "Bob"})RETURN friend.name AS mutual_friend;Good for: Social networks, recommendation engines, fraud detection, knowledge graphs.
NoSQL vs SQL: When to Use Which
Use SQL (relational) when: - Data is structured with well-defined relationships - You need ACID transactions across multiple tables - Queries are complex and vary (reporting, ad-hoc analysis)
Use NoSQL when: - Schema is flexible or evolves rapidly - You need to scale writes horizontally - Access patterns are simple and known upfront - Working with document, graph, or time-series data - Extreme read/write performance is required (caching, sessions)Consistency Trade-offs: CAP Theorem
Distributed systems must trade between consistency, availability, and partition tolerance:
C — Consistency: all nodes see the same data at the same timeA — Availability: every request gets a response (not necessarily current)P — Partition Tolerance: system works despite network splits
P is required in distributed systems, so the real trade-off is C vs A:
CP databases: HBase, Zookeeper, MongoDB (default)AP databases: Cassandra, CouchDB, DynamoDB (default)SQL and NoSQL Together
Most production systems use both — each serving its strongest use case:
| Component | Technology | Reason |
|---|---|---|
| User sessions | Redis (key-value) | Fast reads, TTL expiry |
| Product catalog | MongoDB (document) | Variable product attributes |
| Orders, inventory | PostgreSQL (relational) | ACID transactions required |
| Analytics | Snowflake / BigQuery | Complex SQL queries |
| Activity feed | Cassandra | High write throughput |
NoSQL in Data Engineering
Data engineers encounter NoSQL in two main contexts:
As sources: NoSQL databases (MongoDB, DynamoDB) are common operational sources that need ingesting into a warehouse. Change data capture (CDC) via Debezium or AWS DMS is a standard pipeline pattern.
As infrastructure: Redis for pipeline state caching, Kafka (a log-structured store) for event streaming. Core SQL skills still transfer — Snowflake, BigQuery, and Redshift are relational and use standard SQL for analytics.