Skip to content

Database Fundamentals

From AWS Re/Start (August 2025) and apprenticeship coursework.


Ultra-Short Summary

Databases organise data for efficient storage and retrieval. The two main families are relational (structured tables, SQL, ACID transactions) and non-relational (flexible schemas, optimised for specific access patterns). Knowing when to use which — and why — is one of the most recurring decision points in cloud architecture.


Relational vs Non-Relational

Relational (SQL)
  → Data in tables with rows and columns
  → Relationships via foreign keys
  → Schema defined upfront
  → ACID transactions (Atomicity, Consistency, Isolation, Durability)
  → Best for: structured data, complex queries, financial records

Non-Relational (NoSQL)
  → Flexible schema — each record can have different fields
  → Optimised for specific access patterns
  → Trades ACID for speed/scale in most cases
  → Types: key-value, document, wide-column, graph
  → Best for: high-throughput, variable structure, caching

SQL Fundamentals

-- Basic query
SELECT name, email FROM users WHERE created_at > '2025-01-01';

-- Join two tables
SELECT orders.id, users.name
FROM orders
JOIN users ON orders.user_id = users.id;

-- Insert
INSERT INTO users (name, email) VALUES ('Cain', 'cain@example.com');

-- Update
UPDATE users SET email = 'new@email.com' WHERE id = 42;

-- Delete
DELETE FROM users WHERE id = 42;

ACID Transactions

The guarantee that makes relational databases safe for financial and critical data:

Property What It Means Example
Atomicity All-or-nothing — if any part fails, the whole transaction rolls back Bank transfer: debit AND credit must both succeed
Consistency Database always goes from one valid state to another A constraint (e.g. balance >= 0) is never violated
Isolation Concurrent transactions don't interfere with each other Two people booking the last seat can't both succeed
Durability Once committed, data survives crashes Written to disk, not just memory

Database Types

Type Structure AWS Service Best For
Relational Tables, SQL RDS, Aurora Transactions, reporting, structured data
Key-Value Key → Value DynamoDB, ElastiCache Session data, caching, leaderboards
Document JSON-like documents DynamoDB, DocumentDB User profiles, product catalogs
Wide-Column Rows with many columns Keyspaces (Cassandra) Time-series, IoT
Graph Nodes and edges Neptune Social networks, fraud detection
In-Memory RAM-based ElastiCache (Redis/Memcached) Caching, session state, real-time
Data Warehouse Analytics, columnar Redshift Business intelligence, large-scale queries
Time-Series Timestamped metrics Timestream IoT, monitoring, metrics

The Hashing Connection

Hashing appears in databases too:

Password storage:
  NEVER store plaintext password
  Hash it: bcrypt("password123") → $2b$12$7HVD...
  On login: hash the input, compare hashes

Index hashing:
  Hash indexes for fast equality lookups
  B-tree indexes for range queries

Content addressing:
  S3 ETags = MD5 hash of the object
  Used to check if content changed

AWS Database Services

RDS (Relational Database Service)

Managed relational database — AWS handles patching, backups, failover.

Supported engines: MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, Db2

Key features:

  • Multi-AZ — synchronous standby replica in another AZ for HA and automatic failover
  • Read Replicas — asynchronous copies for read scaling (up to 15 for Aurora)
  • Automated backups — daily snapshot + transaction logs → point-in-time recovery
  • Encryption — at rest (KMS) and in transit (SSL/TLS)

Aurora

AWS's own relational engine — MySQL and PostgreSQL compatible but designed for cloud:

Aurora architecture:
  Shared distributed storage layer (6 copies across 3 AZs)
  Up to 15 read replicas
  Writer endpoint → points to primary
  Reader endpoint → load-balances across read replicas
  Auto-scales storage from 10GB to 128TB
  Up to 5x faster than MySQL RDS

Aurora Serverless v2 — scales capacity automatically (no fixed instance size).

DynamoDB

AWS's fully managed NoSQL service — key-value and document:

Table → Items (like rows) → Attributes (like columns, but flexible)
Every item needs a Partition Key
Optional Sort Key for range queries within a partition

Key concepts:

Concept What It Is
Partition Key Determines which partition stores the item (choose high-cardinality values)
Sort Key Secondary key for ordering within a partition
Global Secondary Index (GSI) Query on non-key attributes
DAX In-memory cache for DynamoDB — microsecond response
DynamoDB Streams Change data capture — trigger Lambda on record changes

ElastiCache

Managed in-memory caching:

Engine Best For
Redis Persistent cache, pub/sub, sorted sets, geospatial, Lua scripts
Memcached Simple cache, multi-threaded, no persistence

Pattern: App queries ElastiCache first → if cache miss → query RDS/DynamoDB → store result in cache.

Redshift

Data warehouse for analytics — columnar storage, petabyte scale, SQL interface.

OLTP (transactions) → RDS / Aurora / DynamoDB
OLAP (analytics)    → Redshift

Don't use RDS for "how much revenue did we make last year across all orders"
— use Redshift or Athena (S3 query engine)

The Sakila Database

A sample MySQL database used throughout Re/Start for SQL practice. Models a fictional DVD rental company: film, actor, customer, rental, payment tables with realistic relationships.

Good for practicing: JOINs, aggregations, subqueries, stored procedures.


Mental Model

Relational DB = spreadsheet with rules
  → Rows are related across tables via IDs
  → Transactions ensure nothing is "half-done"
  → Good when data structure is known and stable

NoSQL DB = optimised filing cabinet
  → Fast because it's designed for your specific access pattern
  → Flexible because schema is not enforced
  → Bad for complex queries across many different attributes

Cache = whiteboard by your desk
  → Fastest possible access
  → Temporary — wiped when power goes off (or TTL expires)
  → Only store what you read frequently

SAA Patterns

Scenario Answer
Transactional app, need ACID RDS (MySQL/PostgreSQL) or Aurora
Best performance relational DB, MySQL compatible Aurora
Serverless NoSQL, millisecond latency DynamoDB
Reduce RDS read load RDS Read Replicas
Session storage, sub-millisecond ElastiCache Redis
Analytics on terabytes of data Redshift
RDS Multi-AZ vs Read Replicas Multi-AZ = HA/failover, Read Replica = read scaling
Social graph queries Neptune (graph)

Self-Quiz

  1. What does ACID stand for and why does "Atomicity" matter for a bank transfer?
  2. What's the difference between RDS Multi-AZ and Read Replicas?
  3. When would you choose DynamoDB over RDS?
  4. Why do you never store plaintext passwords? What do you store instead?
  5. What's the Partition Key in DynamoDB and why should it be high-cardinality?
  6. What's the difference between OLTP and OLAP workloads?
  7. Why is columnar storage faster for analytics than row storage?
  8. What does ElastiCache solve that RDS can't on its own?