Database Fundamentals¶

From AWS Re/Start (August 2025) and apprenticeship coursework.

Ultra-Short Summary¶

Databases organise data for efficient storage and retrieval. The two main families are relational (structured tables, SQL, ACID transactions) and non-relational (flexible schemas, optimised for specific access patterns). Knowing when to use which — and why — is one of the most recurring decision points in cloud architecture.

Relational vs Non-Relational¶

Relational (SQL)
  → Data in tables with rows and columns
  → Relationships via foreign keys
  → Schema defined upfront
  → ACID transactions (Atomicity, Consistency, Isolation, Durability)
  → Best for: structured data, complex queries, financial records

Non-Relational (NoSQL)
  → Flexible schema — each record can have different fields
  → Optimised for specific access patterns
  → Trades ACID for speed/scale in most cases
  → Types: key-value, document, wide-column, graph
  → Best for: high-throughput, variable structure, caching

SQL Fundamentals¶

-- Basic query
SELECT name, email FROM users WHERE created_at > '2025-01-01';

-- Join two tables
SELECT orders.id, users.name
FROM orders
JOIN users ON orders.user_id = users.id;

-- Insert
INSERT INTO users (name, email) VALUES ('Cain', 'cain@example.com');

-- Update
UPDATE users SET email = 'new@email.com' WHERE id = 42;

-- Delete
DELETE FROM users WHERE id = 42;

ACID Transactions¶

The guarantee that makes relational databases safe for financial and critical data:

Property	What It Means	Example
Atomicity	All-or-nothing — if any part fails, the whole transaction rolls back	Bank transfer: debit AND credit must both succeed
Consistency	Database always goes from one valid state to another	A constraint (e.g. balance >= 0) is never violated
Isolation	Concurrent transactions don't interfere with each other	Two people booking the last seat can't both succeed
Durability	Once committed, data survives crashes	Written to disk, not just memory

Database Types¶

Type	Structure	AWS Service	Best For
Relational	Tables, SQL	RDS, Aurora	Transactions, reporting, structured data
Key-Value	Key → Value	DynamoDB, ElastiCache	Session data, caching, leaderboards
Document	JSON-like documents	DynamoDB, DocumentDB	User profiles, product catalogs
Wide-Column	Rows with many columns	Keyspaces (Cassandra)	Time-series, IoT
Graph	Nodes and edges	Neptune	Social networks, fraud detection
In-Memory	RAM-based	ElastiCache (Redis/Memcached)	Caching, session state, real-time
Data Warehouse	Analytics, columnar	Redshift	Business intelligence, large-scale queries
Time-Series	Timestamped metrics	Timestream	IoT, monitoring, metrics

The Hashing Connection¶

Hashing appears in databases too:

Password storage:
  NEVER store plaintext password
  Hash it: bcrypt("password123") → $2b$12$7HVD...
  On login: hash the input, compare hashes

Index hashing:
  Hash indexes for fast equality lookups
  B-tree indexes for range queries

Content addressing:
  S3 ETags = MD5 hash of the object
  Used to check if content changed

AWS Database Services¶

RDS (Relational Database Service)¶

Managed relational database — AWS handles patching, backups, failover.

Supported engines: MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, Db2

Key features:

Multi-AZ — synchronous standby replica in another AZ for HA and automatic failover
Read Replicas — asynchronous copies for read scaling (up to 15 for Aurora)
Automated backups — daily snapshot + transaction logs → point-in-time recovery
Encryption — at rest (KMS) and in transit (SSL/TLS)

Aurora¶

AWS's own relational engine — MySQL and PostgreSQL compatible but designed for cloud:

Aurora architecture:
  Shared distributed storage layer (6 copies across 3 AZs)
  Up to 15 read replicas
  Writer endpoint → points to primary
  Reader endpoint → load-balances across read replicas
  Auto-scales storage from 10GB to 128TB
  Up to 5x faster than MySQL RDS

Aurora Serverless v2 — scales capacity automatically (no fixed instance size).

DynamoDB¶

AWS's fully managed NoSQL service — key-value and document:

Table → Items (like rows) → Attributes (like columns, but flexible)
Every item needs a Partition Key
Optional Sort Key for range queries within a partition

Key concepts:

Concept	What It Is
Partition Key	Determines which partition stores the item (choose high-cardinality values)
Sort Key	Secondary key for ordering within a partition
Global Secondary Index (GSI)	Query on non-key attributes
DAX	In-memory cache for DynamoDB — microsecond response
DynamoDB Streams	Change data capture — trigger Lambda on record changes

ElastiCache¶

Managed in-memory caching:

Engine	Best For
Redis	Persistent cache, pub/sub, sorted sets, geospatial, Lua scripts
Memcached	Simple cache, multi-threaded, no persistence

Pattern: App queries ElastiCache first → if cache miss → query RDS/DynamoDB → store result in cache.

Redshift¶

Data warehouse for analytics — columnar storage, petabyte scale, SQL interface.

OLTP (transactions) → RDS / Aurora / DynamoDB
OLAP (analytics)    → Redshift

Don't use RDS for "how much revenue did we make last year across all orders"
— use Redshift or Athena (S3 query engine)

The Sakila Database¶

A sample MySQL database used throughout Re/Start for SQL practice. Models a fictional DVD rental company: film, actor, customer, rental, payment tables with realistic relationships.

Good for practicing: JOINs, aggregations, subqueries, stored procedures.

Mental Model¶

Relational DB = spreadsheet with rules
  → Rows are related across tables via IDs
  → Transactions ensure nothing is "half-done"
  → Good when data structure is known and stable

NoSQL DB = optimised filing cabinet
  → Fast because it's designed for your specific access pattern
  → Flexible because schema is not enforced
  → Bad for complex queries across many different attributes

Cache = whiteboard by your desk
  → Fastest possible access
  → Temporary — wiped when power goes off (or TTL expires)
  → Only store what you read frequently

SAA Patterns¶

Scenario	Answer
Transactional app, need ACID	RDS (MySQL/PostgreSQL) or Aurora
Best performance relational DB, MySQL compatible	Aurora
Serverless NoSQL, millisecond latency	DynamoDB
Reduce RDS read load	RDS Read Replicas
Session storage, sub-millisecond	ElastiCache Redis
Analytics on terabytes of data	Redshift
RDS Multi-AZ vs Read Replicas	Multi-AZ = HA/failover, Read Replica = read scaling
Social graph queries	Neptune (graph)

Self-Quiz¶

What does ACID stand for and why does "Atomicity" matter for a bank transfer?
What's the difference between RDS Multi-AZ and Read Replicas?
When would you choose DynamoDB over RDS?
Why do you never store plaintext passwords? What do you store instead?
What's the Partition Key in DynamoDB and why should it be high-cardinality?
What's the difference between OLTP and OLAP workloads?
Why is columnar storage faster for analytics than row storage?
What does ElastiCache solve that RDS can't on its own?