AWS Solutions Architect — Domain 2: Design Resilient Architectures

Domain 2 Overview

What You Need to Know

Task 2.1 — Scalable & Loosely Coupled

Auto Scaling groups & launch templates
Horizontal vs. vertical scaling
SQS, SNS, EventBridge — decoupling patterns
Serverless: Lambda, Fargate, API Gateway
Step Functions for orchestration

Task 2.2 — Highly Available & Fault Tolerant

Elastic Load Balancing (ALB, NLB, CLB, GWLB)
Multi-AZ deployment patterns
Route 53 health checks & routing policies
Health checks & automatic failover
Global Accelerator for global HA

Task 2.3 — Decoupled Architectures

SQS queue types (Standard vs. FIFO)
SNS fan-out patterns
Kinesis for streaming data
EventBridge event-driven patterns
Async vs. sync communication patterns

Task 2.4 — Resilient Storage & Databases

S3 storage classes & lifecycle policies
EBS, EFS, FSx — block & file storage
RDS, Aurora, DynamoDB HA patterns
ElastiCache for read scaling
Backup, PITR, and snapshot strategies

Task 2.1 — Auto Scaling

EC2 Auto Scaling

Launch Templates · Scaling Policies · Lifecycle

ASG Components

Launch Template: AMI, instance type, SGs, user data — the "what to launch"
Min / Desired / Max: Capacity bounds for the group
AZ Distribution: Spread instances across ≥2 AZs for HA
Health Checks: EC2 status or ELB health checks trigger replacement
Lifecycle Hooks: Pause launch/terminate for custom actions

Scaling Policy Types

Target Tracking: Keep CPU at 50% — simplest, recommended
Step Scaling: Add N instances per alarm threshold breach
Simple Scaling: One action per alarm (cooldown period)
Scheduled Scaling: Predictable load (e.g., 9am–6pm weekdays)
Predictive Scaling: ML-based forecast + proactive scaling

Horizontal vs. Vertical Scaling

Horizontal (Scale Out): Add more instances. Requires stateless app design. Works with ELB. No downtime. Preferred for web tiers.
Vertical (Scale Up): Increase instance size. Requires downtime (stop/start). Has an upper limit. Use for databases that can't distribute horizontally.

Warm Pools

Pre-initialized instances waiting in a stopped state. Reduces cold-start latency for scale-out events. Useful for apps with long initialization.

Target Tracking is the recommended default policy — it adjusts scaling continuously, not just at alarm thresholds. Use Scheduled Scaling for known traffic patterns. Never rely on Simple Scaling alone.

Auto Scaling Groups are fundamental to resilient architecture. Key exam concepts: Launch Templates are preferred over Launch Configurations — they support versioning, multiple instance types, and Spot instances. For the ASG to function across AZ failures, always span at least two AZs and let the ASG rebalance automatically. Health checks: by default ASG uses EC2 health checks (is the instance running?). Enable ELB health checks so the ASG also replaces instances that are running but returning errors (unhealthy at app level). Cooldown period (Simple Scaling): ASG waits before another scaling action. Target Tracking manages this automatically. Lifecycle Hooks are powerful: you can pause an instance entering "InService" to run custom configuration scripts, or pause a terminating instance to drain connections or collect logs.

Task 2.1 — Serverless

Serverless Compute Options

Lambda · Fargate · API Gateway · Step Functions

AWS Lambda Event-driven functions. Max 15 min execution, up to 10 GB memory. Scales to thousands of concurrent executions automatically. Triggered by S3, SQS, DynamoDB streams, API Gateway, EventBridge, and more. Pay per 100ms of compute.

AWS Fargate Serverless container compute for ECS and EKS. No EC2 instances to manage. Define CPU/memory per task. Scales automatically. Use when workloads need containers but you want to avoid cluster management overhead.

Amazon API Gateway Fully managed API layer. REST, HTTP, and WebSocket APIs. Integrates directly with Lambda, Step Functions, and AWS services. Handles throttling, auth (Cognito, Lambda authorizer), caching, and TLS termination. Regional or Edge-optimized deployments.

AWS Step Functions Serverless workflow orchestration using state machines. Coordinates Lambda, ECS, DynamoDB, SNS, SQS, and more. Standard workflows (1-year max) for audit trails; Express workflows for high-volume, short-duration tasks. Handles retries, error catching, parallel branches.

Lambda concurrency limit is 1,000 per region by default (can be raised). Reserved concurrency guarantees capacity for critical functions; provisioned concurrency eliminates cold starts. Fargate is EC2-free containers; Lambda is function-level serverless.

Serverless options provide inherent scalability and resilience without managing infrastructure. Lambda key limits: 15-minute max execution time (not suitable for long-running jobs), 10 GB memory, 10 GB ephemeral storage (/tmp), 6 MB payload size for synchronous invocations. For Lambda cold starts: provisioned concurrency keeps execution environments warm — use for latency-sensitive APIs. Reserved concurrency prevents a function from consuming the entire account concurrency limit. API Gateway + Lambda is the canonical serverless API pattern. API Gateway can also integrate directly with DynamoDB, S3, SQS, and Step Functions without Lambda in the middle. Step Functions is the answer when a question mentions "orchestrate multiple services," "retry on failure," "conditional branching," or "audit trail of workflow execution." Don't confuse it with SQS — Step Functions orchestrates, SQS decouples.

Task 2.2 — Load Balancing

Elastic Load Balancer Types

ALB · NLB · CLB · GWLB — choosing the right load balancer

Load Balancer	Layer	Protocols	Best For	Key Feature
ALB Application	Layer 7	HTTP, HTTPS, WebSocket, gRPC	Web apps, microservices, containers	Path & host-based routing, Lambda targets, WAF integration
NLB Network	Layer 4	TCP, UDP, TLS	Ultra-low latency, static IP, gaming, IoT	Static Elastic IP per AZ, millions of req/sec, PrivateLink
CLB Classic	L4 + L7	HTTP, HTTPS, TCP, SSL	Legacy EC2-Classic only	Deprecated — migrate to ALB or NLB
GWLB Gateway	Layer 3	IP (GENEVE)	Third-party virtual appliances (firewalls, IDS)	Bump-in-the-wire; transparent traffic inspection

ALB for HTTP/S web traffic and microservices routing. NLB for TCP/UDP, static IPs, or extreme performance. GWLB for routing traffic through security appliances. CLB is legacy — always migrate away.

Load balancer selection is a high-frequency exam topic. The key discriminators: ALB: the default choice for web applications. Supports path-based routing (/api → service A, /web → service B), host-based routing (app.example.com → service A, api.example.com → service B), and can target Lambda functions directly. NLB: chosen when you need a static IP address that doesn't change (for whitelisting), ultra-low latency (sub-millisecond), or need to handle millions of concurrent connections. Also required to expose services via AWS PrivateLink. GWLB: relatively new, but appearing on exams. It transparently routes traffic to a fleet of virtual appliances (like Palo Alto or Check Point firewalls) for inspection, then sends traffic to its destination. The key word is "third-party virtual appliance." All ELBs: distribute traffic across targets in multiple AZs. Enable cross-zone load balancing (on by default for ALB) so traffic is distributed evenly across AZs regardless of instance count per AZ.

Task 2.2 — ALB

ALB Routing, Target Groups & Sticky Sessions

Advanced ALB patterns for microservices and containers

Routing Rules

Path-based: /api/* → API service, /images/* → S3/media service
Host-based: app.example.com → app servers, api.example.com → API servers
Header-based: Route by HTTP header value
Query string: Route by URL query params
Source IP: Route by CIDR range

Target Group Types

Instances: EC2 instances (by ID)
IP addresses: Any IP (including on-premises)
Lambda: Invoke Lambda per request
ALB: Nested ALB (via NLB)

Sticky Sessions (Session Affinity)

ALB-generated cookie (AWSALB) or app-based cookie
Routes same user to same target for session duration
Can cause uneven load distribution
Avoid when possible — prefer stateless apps with ElastiCache for session state

Connection Draining (Deregistration Delay)

ALB waits for in-flight requests before deregistering a target
Default: 300 seconds (5 min)
Reduce for fast-cycling Lambda / short-lived requests
Ensures graceful scale-in without dropped connections

"Route different URL paths to different microservices behind one ALB" → path-based routing with multiple target groups. "Users lose session state when hitting different instances" → add ElastiCache for distributed session storage, not sticky sessions.

ALB routing rules are commonly tested in microservices architecture scenarios. Path-based routing is the canonical answer for "one ALB, multiple backend services." Each rule evaluates listener conditions in priority order and forwards to a specific target group. Sticky sessions are a trap answer in many exam questions. The question will describe a stateful application that loses session data — the naive answer is "enable sticky sessions," but the architecturally correct answer is "move session state to ElastiCache Redis, then remove sticky sessions." This makes the app horizontally scalable. Connection draining ensures zero dropped connections during scale-in or deployments. The 300-second default is appropriate for most web apps. For Lambda targets, reduce it significantly since Lambda functions don't hold long connections.

Task 2.2 — Global Routing

Global Accelerator vs. CloudFront

Two very different global distribution services

AWS Global Accelerator

2 static anycast IP addresses globally
Routes traffic via AWS private backbone (not public internet)
Works with TCP & UDP (non-HTTP too)
Instant failover to healthy endpoint (~30 sec)
Targets: ALB, NLB, EC2, Elastic IPs
No caching — pure routing acceleration
Best for: gaming, IoT, VoIP, non-HTTP workloads, static IP requirement

Amazon CloudFront

400+ edge locations worldwide
Content caching at the edge (CDN)
HTTP/S only
Reduces origin load by serving cached content
Integrates with WAF, Shield, ACM, S3
Custom cache behaviors by path
Best for: web content, API acceleration, static asset delivery, S3 distribution

CloudFront caches content at the edge — reduces origin load and latency for repeat requests. Global Accelerator does NOT cache — it just routes traffic faster via AWS backbone. Static IP requirement → Global Accelerator. HTTP caching → CloudFront.

This is one of the most commonly confused comparisons on the exam. Both improve global performance, but through completely different mechanisms. CloudFront is a CDN — it stores cached copies of content at edge locations. When a user requests a cached object, CloudFront serves it from the nearest edge without hitting the origin. Great for static assets, API responses, and S3 content. Global Accelerator doesn't cache anything. It takes user traffic from the nearest AWS edge location and routes it over AWS's private global network to the closest healthy application endpoint. It reduces latency by avoiding the public internet's variable routing. The two static IPs make it easy to whitelist in firewalls. Exam trigger phrases: "static Elastic IP addresses for a global application" → Global Accelerator. "Cache images and videos at edge locations" → CloudFront. "Accelerate TCP/UDP traffic globally" → Global Accelerator (CloudFront is HTTP/S only).

Task 2.3 — SQS

Amazon SQS — Standard vs. FIFO

The foundational decoupling service

Feature	Standard Queue	FIFO Queue
Throughput	Unlimited (nearly)	300 msg/sec; 3,000 with batching
Ordering	Best-effort (may vary)	Strict first-in, first-out
Delivery	At least once (duplicates possible)	Exactly once (deduplication ID)
Deduplication	Consumer must handle duplicates	5-minute deduplication window
Message groups	Not supported	Message Group ID for parallel ordering
Use case	High throughput; order not critical	Financial transactions, order processing

Key SQS Parameters

Visibility Timeout: Time a message is hidden after being received (default 30s, max 12h). Extend if processing takes longer.
Message Retention: 1 min to 14 days (default 4 days)
Max Message Size: 256 KB (use S3 + pointer for larger)
Long Polling: Wait up to 20s for messages — reduces empty receives and cost

Dead Letter Queues (DLQ)

Catches messages that fail processing repeatedly
Configure maxReceiveCount threshold (e.g., 3 attempts)
Enables debugging without losing failed messages
DLQ must be same type as source queue (FIFO DLQ for FIFO queue)
Set up CloudWatch alarm on DLQ depth

SQS is the most tested messaging service on the SAA exam. Know these concepts deeply. Visibility Timeout is the most commonly tested parameter. When a consumer receives a message, it's hidden from other consumers for the visibility timeout period. If the consumer doesn't delete it before the timeout expires, it becomes visible again and another consumer can pick it up. This is how SQS provides at-least-once delivery. Set the timeout long enough to comfortably complete processing. Dead Letter Queues: if a message fails processing more than maxReceiveCount times, SQS moves it to the DLQ. This prevents a bad message from blocking the queue indefinitely. Long polling is almost always better than short polling — it waits up to 20 seconds for messages to arrive, reducing the number of empty API calls and lowering costs. Enable it by setting ReceiveMessageWaitTimeSeconds > 0. FIFO queues: use when order matters (e.g., financial transactions must be processed in submission order) or when exactly-once processing is required (e.g., charge a credit card exactly once).

Task 2.3 — SNS

Amazon SNS & the Fan-out Pattern

Pub/Sub messaging and one-to-many delivery

SNS Core Concepts

Pub/Sub: publishers send to topics; subscribers receive
Push-based delivery (unlike SQS pull)
Up to 12.5M subscriptions per topic
Message filtering by attribute
FIFO topics available (ordered + deduplication)

SNS Subscribers

SQS queues (most common)
Lambda functions
HTTP/S endpoints
Email / SMS / Mobile push
Kinesis Data Firehose

SNS + SQS Fan-out Pattern

The canonical multi-consumer architecture

📥 Publisher → SNS Topic

↓ fan-out ↓

SQS Queue A
Email Service

SQS Queue B
Analytics

Lambda
Real-time

One publish → delivered to all subscribers in parallel
Each SQS queue buffers for its own consumer independently
Failure in one consumer doesn't affect others
Add new consumers without changing publisher

"Send one event to multiple downstream services" → SNS + SQS fan-out. SNS alone doesn't buffer — add SQS between SNS and slow consumers to absorb bursts. Message filtering avoids creating separate topics per subscriber.

The SNS + SQS fan-out pattern is one of the most important patterns on the exam. The problem it solves: you have one event (e.g., "order placed") that needs to trigger multiple independent workflows (send confirmation email, update inventory, notify fulfillment, trigger analytics). If you call each service synchronously, they're tightly coupled — a slow analytics service holds up the email. The fan-out solution: publish once to SNS, which delivers to multiple SQS queues simultaneously. Each queue has its own consumer processing at its own pace. Completely decoupled. SNS message filtering: each subscriber can set filter policies so they only receive messages with specific attributes. This means you don't need separate topics for different event types — one topic with filtered subscribers. SNS is push-based: it pushes to subscribers immediately. SQS is pull-based: consumers poll for messages. The fan-out combines both — SNS pushes into SQS, then consumers pull from SQS at their own rate.

Task 2.3 — Event Streaming

EventBridge & Kinesis

Event-driven routing and real-time data streaming

Amazon EventBridge

Serverless event bus for AWS services, SaaS, custom apps
Routes events via rules to targets (Lambda, SQS, SNS, Step Functions, etc.)
Schema registry: discover and validate event shapes
Event Archive: replay past events for debugging
Scheduled rules: cron/rate expressions (replaces CloudWatch Events)
Pipes: point-to-point event enrichment and filtering

EventBridge for complex event routing logic across services. SQS for simple point-to-point queuing. SNS for fan-out pub/sub.

Amazon Kinesis Family

Kinesis Data Streams: Real-time streaming; custom consumers; 1–365 day retention; ordered within shard; replay capability
Kinesis Data Firehose: Fully managed delivery to S3, Redshift, OpenSearch, Splunk; near-real-time (60s buffer min); no consumer code
Kinesis Data Analytics: SQL or Apache Flink on streaming data; real-time aggregations
MSK (Kafka): Managed Apache Kafka for existing Kafka workloads

Kinesis Streams = custom real-time processing with replay. Firehose = managed delivery to data stores, no code. SQS = general queuing, not streaming.

EventBridge vs SNS vs SQS is a common exam comparison. Here's the mental model: SQS: simple queue, one consumer group pulls from it. Great for decoupling and load leveling. SNS: push notifications to multiple subscribers simultaneously. Fan-out. EventBridge: sophisticated event routing based on event content. Can filter by event source, detail type, and content. Connects AWS services, SaaS providers, and custom apps. The modern replacement for CloudWatch Events. Kinesis vs SQS: Kinesis is for real-time streaming data (clickstreams, logs, IoT telemetry, financial transactions) where order matters and you need to process data in real-time or replay it. SQS is for task queuing between services. Key Kinesis Streams concept: shards provide parallelism. Each shard processes 1 MB/s inbound, 2 MB/s outbound. Records within a shard are ordered. Scale by adding shards (resharding). Consumers can read from checkpoints and replay. Kinesis Firehose: zero code required. Just configure the source and destination. It buffers data (by size or time) and delivers in batches. Near-real-time means ~60-second minimum latency.

Task 2.4 — S3

S3 Storage Classes & Lifecycle Policies

Matching access patterns to cost-optimal storage tiers

Storage Class	Availability	Min Duration	Retrieval	Use Case
S3 Standard	99.99% (3+ AZs)	None	Instant	Frequently accessed data
S3 Intelligent-Tiering	99.9%	None	Instant / async	Unknown or changing access patterns
S3 Standard-IA	99.9% (3+ AZs)	30 days	Instant	Infrequently accessed; backups, DR copies
S3 One Zone-IA	99.5% (1 AZ)	30 days	Instant	Re-creatable infrequent data; lower cost
S3 Glacier Instant Retrieval	99.9%	90 days	Instant (ms)	Archive accessed once per quarter
S3 Glacier Flexible Retrieval	99.99%	90 days	Minutes–hours	Archival with flexible retrieval time
S3 Glacier Deep Archive	99.99%	180 days	12–48 hours	Long-term regulatory compliance archives

Lifecycle Policies automate transitions: Standard → Standard-IA (30+ days) → Glacier (90+ days) → Deep Archive (180+ days). Minimum storage duration charges apply even if deleted early. Standard-IA has per-GB retrieval fees — cost more if accessed frequently.

S3 storage classes appear in cost-optimization AND resilience questions. Key scenarios: "Store backups that may need quick recovery" → Standard-IA (instant retrieval, lower cost than Standard). "Archive compliance records accessed maybe once a year, must retrieve within hours" → Glacier Flexible Retrieval. "7-year regulatory archive, retrieval time of 24 hours is acceptable" → Glacier Deep Archive (cheapest storage tier). "Unknown access pattern — sometimes frequent, sometimes months with no access" → Intelligent-Tiering (automatically moves objects between tiers based on access; no retrieval fee for frequent/infrequent tiers). One Zone-IA stores data in a single AZ — lower cost, but if that AZ is destroyed, data is lost. Only use for data that can be recreated (e.g., thumbnails generated from source images stored in Standard). Lifecycle policies are rules: transition to a cheaper class after N days, or expire (delete) objects after N days. Critical for cost management of log archives.

Task 2.4 — Storage Types

EBS · EFS · FSx — Block & File Storage

Choosing the right persistent storage for compute workloads

Amazon EBS Block storage for a single EC2 instance. AZ-scoped — same AZ as the instance. Snapshot to S3 for backup. Types: gp3 (general, recommended), io2 Block Express (highest IOPS, up to 256,000), st1 (throughput HDD), sc1 (cold HDD). Encrypt with KMS. Multi-Attach for io1/io2 (up to 16 Nitro instances, same AZ).

Amazon EFS Managed NFS file system. Multi-AZ, multi-instance access simultaneously. Grows and shrinks automatically (no provisioning). Linux only (NFSv4). Performance modes: General Purpose and Max I/O. Throughput modes: Bursting, Provisioned, Elastic. EFS Infrequent Access tier for cost savings.

Amazon FSx for Windows Managed Windows file server (SMB protocol). Active Directory integration. Supports DFS namespaces, NTFS, ACLs. Multi-AZ deployment available. Use for Windows workloads that need shared file storage (home directories, SQL Server backups).

Amazon FSx for Lustre High-performance parallel file system for HPC, ML, financial modeling. Sub-millisecond latency, hundreds of GB/s throughput. Can link to S3 as a data repository. Ideal for compute-intensive workloads needing fast shared storage.

EBS = one instance, one AZ, block storage. EFS = many Linux instances, multi-AZ, file storage. FSx for Windows = Windows SMB, AD-integrated. FSx for Lustre = HPC / ML high-performance. If you see "shared file system for Linux EC2 fleet" → EFS.

Storage service selection is fundamental to Domain 2 questions. The decision tree: Single EC2 instance needs fast persistent storage → EBS (gp3 for most workloads). Multiple Linux EC2 instances need shared file storage → EFS. Windows workloads need shared SMB file storage with AD integration → FSx for Windows File Server. HPC, ML training, genomics workloads need ultra-fast parallel file system → FSx for Lustre. Object storage (images, backups, logs, static web) → S3. EBS volumes are AZ-scoped: you cannot attach an EBS volume to an EC2 instance in a different AZ. To move data, take a snapshot (which goes to S3) and create a new volume in the target AZ from the snapshot. EFS automatically scales storage capacity up and down. You pay for what you use. No need to pre-provision. Great for content management systems, development environments, and web serving where multiple instances share files.

Task 2.4 — Databases

RDS High Availability & Read Scaling

Multi-AZ · Read Replicas · RDS Proxy · Aurora

RDS Multi-AZ

Purpose: High availability (not read scaling)
Replication: Synchronous — standby is always in sync
Failover: Automatic, ~1–2 minutes. DNS endpoint flips.
Standby: Not readable (no traffic served)
Cost: 2× instance cost (active + standby)

RDS Read Replicas

Purpose: Read scaling + cross-region DR
Replication: Asynchronous — small replication lag
Failover: Manual promotion — not automatic
Readable: Yes — serve SELECT queries from replica
Cross-region: Yes — enables cross-region DR
Up to 5 replicas per RDS instance (15 for Aurora)

Amazon Aurora

MySQL/PostgreSQL compatible, 5× faster
Shared storage: 6 copies across 3 AZs automatically
Up to 15 Aurora Replicas (sub-10ms replica lag)
Auto-scales storage 10GB → 128TB
Aurora Serverless v2: scale to zero (dev/test)
Aurora Global Database: ~1s cross-region replication
Failover to replica: <30 seconds

RDS Proxy

Connection pooling for Lambda → RDS patterns
Reduces connection overhead during burst
Automatic failover routing (faster than DNS TTL)
Secrets Manager integration for IAM auth

Multi-AZ = HA (sync, automatic failover, standby not readable). Read Replica = scale reads (async, manual failover, replica is readable). Never use Multi-AZ for read scaling — use Read Replicas. Aurora for anything needing >5 read replicas, global DR, or faster failover.

The Multi-AZ vs. Read Replica distinction is one of the most tested concepts in the entire exam. Multi-AZ: strictly for high availability. The standby is not accessible for queries. It exists solely to take over if the primary fails. Failover is automatic — DNS endpoint for the RDS instance points to the new primary automatically. Read Replicas: for scaling read-heavy workloads. They are readable databases. If the primary fails, you must manually promote a replica to be the new primary — it doesn't happen automatically. Can you combine them? Yes! You can have Multi-AZ enabled (for HA) AND have Read Replicas (for read scaling). Aurora completely changes the game: its storage layer automatically maintains 6 copies across 3 AZs. Failover to an Aurora Replica takes under 30 seconds. Aurora Global Database replicates to a secondary region in under 1 second — this is the answer for "near-zero RPO across regions." RDS Proxy is especially important for Lambda-to-RDS scenarios. Lambda can spin up thousands of concurrent executions, each wanting its own database connection. Without a proxy, this overwhelms RDS's max connection limit. RDS Proxy pools and reuses connections.

Task 2.4 — NoSQL & Caching

DynamoDB & ElastiCache Resilience Patterns

NoSQL HA · Global Tables · Redis vs. Memcached

DynamoDB Resilience

Built-in Multi-AZ: Data stored across 3 AZs by default
Global Tables: Multi-Region Active/Active; last-writer-wins; <1s replication
PITR: Point-in-time recovery; restore to any second in last 35 days
On-demand backups: Full table backup anytime; no performance impact
Streams: Ordered change log; triggers Lambda; 24h retention
DAX: In-memory DynamoDB accelerator; microsecond reads

ElastiCache: Redis vs. Memcached

Redis: Persistence, replication, pub/sub, Lua scripting, complex data structures (sorted sets, lists). Multi-AZ with automatic failover. Global Datastore for cross-region. Use for sessions, leaderboards, pub/sub, distributed locks.
Memcached: Simple key-value, multi-threaded, no persistence, no replication. Pure caching, simpler ops. Use when you only need a dumb cache and don't need any Redis features.

Caching Patterns

Lazy Loading (Cache-Aside): Check cache first; miss → load from DB → write to cache
Write-Through: Write to cache AND DB simultaneously; always fresh
TTL: Time-to-live prevents stale data

"Reduce RDS read load" → ElastiCache (lazy loading pattern). "Global users need <1ms reads" → DynamoDB + DAX. "Multi-region Active/Active NoSQL" → DynamoDB Global Tables. Redis = persistence + features. Memcached = simple cache only.

DynamoDB and ElastiCache questions often appear together in read-scaling scenarios. DynamoDB is inherently resilient — you don't configure Multi-AZ, it's just how it works. Your design decisions are around capacity mode (on-demand vs provisioned), GSIs for query flexibility, and Global Tables for multi-region. DynamoDB DAX: a write-through cache for DynamoDB. Reduces read latency from milliseconds to microseconds. API-compatible with DynamoDB — just change the endpoint in your code. Use for extremely read-heavy tables. ElastiCache Redis vs Memcached: Redis is almost always the better answer unless the question specifically says "simple caching, no persistence, no replication needed." Redis supports persistence (RDB snapshots, AOF logs), replication with automatic failover (like RDS Multi-AZ), and advanced data structures. Lazy Loading: cache miss → fetch from DB → cache it. Stale data is possible; set a TTL to limit staleness. Write-Through: every write goes to cache and DB simultaneously — cache is always fresh, but you cache data even if never read.

Task 2.4 — Backup & Recovery

AWS Backup & Recovery Strategies

Centralized backup management and DR patterns

AWS Backup Service

Centralized backup policy management across: EBS, RDS, Aurora, DynamoDB, EFS, FSx, Storage Gateway, S3
Backup plans: schedules, retention rules, lifecycle to cold storage
Cross-region backup copies for DR
Cross-account backups via AWS Organizations
Backup Vault Lock: WORM protection for backups
Tags-based resource selection

Snapshot Strategy by Service

EBS: Snapshots to S3 (incremental); schedule via DLM or AWS Backup
RDS/Aurora: Automated daily backups + manual snapshots; retain 0–35 days
DynamoDB: PITR (35 days) + on-demand backups (indefinite)
S3: Versioning + CRR; no snapshots needed

DR Strategy ↔ AWS Patterns

Backup & Restore: AWS Backup to S3 / cross-region. High RTO (hours).
Pilot Light: RDS read replica in DR region; EC2 AMIs ready to launch. Scale up on disaster.
Warm Standby: Scaled-down ASG + DB in DR region running. Fast scale-up on failover.
Active/Active: Route 53 latency routing; Aurora Global DB; DynamoDB Global Tables. Near-zero RTO/RPO.

AWS Backup is the single-pane-of-glass answer for "centrally manage backups across services and accounts." Data Lifecycle Manager (DLM) is EBS-snapshots-only. For RPO minutes with cross-region: Aurora Global DB + Route 53 failover routing.

AWS Backup simplifies compliance by centralizing backup policies. Instead of configuring backup schedules in RDS, DynamoDB, EFS, and EBS separately, you define one backup plan in AWS Backup and apply it across all supported services via tags or resource IDs. Backup Vault Lock (WORM) prevents anyone — including root — from deleting backups before the retention period expires. Critical for financial and healthcare compliance. Data Lifecycle Manager (DLM) is the older service for EBS snapshot scheduling. AWS Backup is preferred as it covers more services. For the DR strategies: the question will give you an RTO and RPO requirement and ask which pattern fits. Memorize the spectrum: Backup/Restore (highest RTO, lowest cost) → Pilot Light → Warm Standby → Active/Active (lowest RTO/RPO, highest cost).

Quick Review

Exam Checklist — Domain 2

Can you answer these?

Task 2.1 — Scalable & Loosely Coupled

Target Tracking vs. Step vs. Scheduled scaling policies
Why horizontal scaling is preferred over vertical
Lambda limits: 15 min max, concurrency, cold starts
Fargate (containers) vs. Lambda (functions) tradeoff
Step Functions for workflow orchestration with retry logic

Task 2.2 — Highly Available

ALB (L7, HTTP/S, path routing) vs. NLB (L4, static IP, TCP/UDP)
GWLB for third-party virtual security appliances
CloudFront (CDN, caching) vs. Global Accelerator (routing, no cache)
ALB sticky sessions vs. ElastiCache for distributed session state
Cross-zone load balancing behavior per ELB type

Task 2.3 — Decoupled Architectures

SQS Standard (high throughput, at-least-once) vs. FIFO (ordered, exactly-once)
Visibility timeout and Dead Letter Queues
SNS + SQS fan-out pattern for multi-consumer events
Kinesis Streams (replay, custom consumers) vs. Firehose (managed delivery)
EventBridge for event routing + scheduled rules (cron)

Task 2.4 — Storage & Databases

S3 storage class selection: Standard → IA → Glacier tiers
EBS (one instance) vs. EFS (multi-instance Linux NFS) vs. FSx
RDS Multi-AZ (sync HA) vs. Read Replica (async, readable)
Aurora advantages: 15 replicas, Global DB, storage auto-scale
ElastiCache Redis (persistence + features) vs. Memcached (simple)

Quick Reference

Service → Scenario Quick Map

Scaling & Compute

EC2 ASG → auto scale compute
Lambda → event-driven, short tasks
Fargate → containers, no EC2 mgmt
Step Functions → workflow orchestration
API Gateway → managed API front-end

Load Balancing

ALB → HTTP/S, path routing, WAF
NLB → TCP/UDP, static IP, PrivateLink
GWLB → virtual firewall appliances
Global Accelerator → static IP, TCP/UDP global
CloudFront → CDN, HTTP caching at edge

Decoupling

SQS Standard → high-throughput queuing
SQS FIFO → ordered, exactly-once
SNS → pub/sub fan-out
EventBridge → event routing + scheduling
Kinesis → real-time streaming

Storage

S3 → objects, backups, static web
EBS → single EC2 block storage
EFS → shared Linux NFS file system
FSx Windows → SMB, AD-integrated
FSx Lustre → HPC / ML high perf

Databases

RDS Multi-AZ → sync HA, auto failover
RDS Read Replica → read scaling + DR
Aurora → 15 replicas, fast failover
Aurora Global DB → cross-region <1s RPO
DynamoDB Global Tables → Active/Active

Caching & Backup

ElastiCache Redis → sessions, leaderboards
ElastiCache Memcached → simple cache
DAX → DynamoDB microsecond reads
AWS Backup → centralized backup policy
S3 Lifecycle → archive & expiry rules

Design Resilient
Architectures

What You Need to Know

Scalable & Loosely Coupled Architectures

EC2 Auto Scaling

Serverless Compute Options

Highly Available & Fault-Tolerant Architectures

Elastic Load Balancer Types

ALB Routing, Target Groups & Sticky Sessions

Global Accelerator vs. CloudFront

Decoupled & Event-Driven Architectures

Amazon SQS — Standard vs. FIFO

Amazon SNS & the Fan-out Pattern

EventBridge & Kinesis

Resilient Storage & Databases

S3 Storage Classes & Lifecycle Policies

EBS · EFS · FSx — Block & File Storage

RDS High Availability & Read Scaling

DynamoDB & ElastiCache Resilience Patterns

AWS Backup & Recovery Strategies

Exam Checklist — Domain 2

Service → Scenario Quick Map

You're ready for Domain 2

Design ResilientArchitectures

What You Need to Know

Scalable & Loosely Coupled Architectures

EC2 Auto Scaling

Serverless Compute Options

Highly Available & Fault-Tolerant Architectures

Elastic Load Balancer Types

ALB Routing, Target Groups & Sticky Sessions

Global Accelerator vs. CloudFront

Decoupled & Event-Driven Architectures

Amazon SQS — Standard vs. FIFO

Amazon SNS & the Fan-out Pattern

EventBridge & Kinesis

Resilient Storage & Databases

S3 Storage Classes & Lifecycle Policies

EBS · EFS · FSx — Block & File Storage

RDS High Availability & Read Scaling

DynamoDB & ElastiCache Resilience Patterns

AWS Backup & Recovery Strategies

Exam Checklist — Domain 2

Service → Scenario Quick Map

You're ready for Domain 2

Design Resilient
Architectures