AWS Certified Solutions Architect — Associate · SAA-C03

Design Resilient
Architectures

Domain 2 — Comprehensive Study Guide
Task Statements 2.1 · 2.2 · 2.3 · 2.4

📋 26% of Exam Score — Second Highest-Weight Domain
Domain 2 Overview

What You Need to Know

Task 2.1 — Scalable & Loosely Coupled
  • Auto Scaling groups & launch templates
  • Horizontal vs. vertical scaling
  • SQS, SNS, EventBridge — decoupling patterns
  • Serverless: Lambda, Fargate, API Gateway
  • Step Functions for orchestration
Task 2.2 — Highly Available & Fault Tolerant
  • Elastic Load Balancing (ALB, NLB, CLB, GWLB)
  • Multi-AZ deployment patterns
  • Route 53 health checks & routing policies
  • Health checks & automatic failover
  • Global Accelerator for global HA
Task 2.3 — Decoupled Architectures
  • SQS queue types (Standard vs. FIFO)
  • SNS fan-out patterns
  • Kinesis for streaming data
  • EventBridge event-driven patterns
  • Async vs. sync communication patterns
Task 2.4 — Resilient Storage & Databases
  • S3 storage classes & lifecycle policies
  • EBS, EFS, FSx — block & file storage
  • RDS, Aurora, DynamoDB HA patterns
  • ElastiCache for read scaling
  • Backup, PITR, and snapshot strategies
2.1

Scalable & Loosely Coupled Architectures

Auto Scaling · Horizontal vs. Vertical · Serverless Compute · Orchestration

Task 2.1 — Auto Scaling

EC2 Auto Scaling

Launch Templates · Scaling Policies · Lifecycle
ASG Components
  • Launch Template: AMI, instance type, SGs, user data — the "what to launch"
  • Min / Desired / Max: Capacity bounds for the group
  • AZ Distribution: Spread instances across ≥2 AZs for HA
  • Health Checks: EC2 status or ELB health checks trigger replacement
  • Lifecycle Hooks: Pause launch/terminate for custom actions
Scaling Policy Types
  • Target Tracking: Keep CPU at 50% — simplest, recommended
  • Step Scaling: Add N instances per alarm threshold breach
  • Simple Scaling: One action per alarm (cooldown period)
  • Scheduled Scaling: Predictable load (e.g., 9am–6pm weekdays)
  • Predictive Scaling: ML-based forecast + proactive scaling
Horizontal vs. Vertical Scaling
  • Horizontal (Scale Out): Add more instances. Requires stateless app design. Works with ELB. No downtime. Preferred for web tiers.
  • Vertical (Scale Up): Increase instance size. Requires downtime (stop/start). Has an upper limit. Use for databases that can't distribute horizontally.
Warm Pools

Pre-initialized instances waiting in a stopped state. Reduces cold-start latency for scale-out events. Useful for apps with long initialization.

Target Tracking is the recommended default policy — it adjusts scaling continuously, not just at alarm thresholds. Use Scheduled Scaling for known traffic patterns. Never rely on Simple Scaling alone.
Task 2.1 — Serverless

Serverless Compute Options

Lambda · Fargate · API Gateway · Step Functions
AWS Lambda Event-driven functions. Max 15 min execution, up to 10 GB memory. Scales to thousands of concurrent executions automatically. Triggered by S3, SQS, DynamoDB streams, API Gateway, EventBridge, and more. Pay per 100ms of compute.
AWS Fargate Serverless container compute for ECS and EKS. No EC2 instances to manage. Define CPU/memory per task. Scales automatically. Use when workloads need containers but you want to avoid cluster management overhead.
Amazon API Gateway Fully managed API layer. REST, HTTP, and WebSocket APIs. Integrates directly with Lambda, Step Functions, and AWS services. Handles throttling, auth (Cognito, Lambda authorizer), caching, and TLS termination. Regional or Edge-optimized deployments.
AWS Step Functions Serverless workflow orchestration using state machines. Coordinates Lambda, ECS, DynamoDB, SNS, SQS, and more. Standard workflows (1-year max) for audit trails; Express workflows for high-volume, short-duration tasks. Handles retries, error catching, parallel branches.
Lambda concurrency limit is 1,000 per region by default (can be raised). Reserved concurrency guarantees capacity for critical functions; provisioned concurrency eliminates cold starts. Fargate is EC2-free containers; Lambda is function-level serverless.
2.2

Highly Available & Fault-Tolerant Architectures

Elastic Load Balancing · Multi-AZ · Global Accelerator · Health Checks

Task 2.2 — Load Balancing

Elastic Load Balancer Types

ALB · NLB · CLB · GWLB — choosing the right load balancer
Load Balancer Layer Protocols Best For Key Feature
ALB
Application
Layer 7 HTTP, HTTPS, WebSocket, gRPC Web apps, microservices, containers Path & host-based routing, Lambda targets, WAF integration
NLB
Network
Layer 4 TCP, UDP, TLS Ultra-low latency, static IP, gaming, IoT Static Elastic IP per AZ, millions of req/sec, PrivateLink
CLB
Classic
L4 + L7 HTTP, HTTPS, TCP, SSL Legacy EC2-Classic only Deprecated — migrate to ALB or NLB
GWLB
Gateway
Layer 3 IP (GENEVE) Third-party virtual appliances (firewalls, IDS) Bump-in-the-wire; transparent traffic inspection
ALB for HTTP/S web traffic and microservices routing. NLB for TCP/UDP, static IPs, or extreme performance. GWLB for routing traffic through security appliances. CLB is legacy — always migrate away.
Task 2.2 — ALB

ALB Routing, Target Groups & Sticky Sessions

Advanced ALB patterns for microservices and containers
Routing Rules
  • Path-based: /api/* → API service, /images/* → S3/media service
  • Host-based: app.example.com → app servers, api.example.com → API servers
  • Header-based: Route by HTTP header value
  • Query string: Route by URL query params
  • Source IP: Route by CIDR range
Target Group Types
  • Instances: EC2 instances (by ID)
  • IP addresses: Any IP (including on-premises)
  • Lambda: Invoke Lambda per request
  • ALB: Nested ALB (via NLB)
Sticky Sessions (Session Affinity)
  • ALB-generated cookie (AWSALB) or app-based cookie
  • Routes same user to same target for session duration
  • Can cause uneven load distribution
  • Avoid when possible — prefer stateless apps with ElastiCache for session state
Connection Draining (Deregistration Delay)
  • ALB waits for in-flight requests before deregistering a target
  • Default: 300 seconds (5 min)
  • Reduce for fast-cycling Lambda / short-lived requests
  • Ensures graceful scale-in without dropped connections
"Route different URL paths to different microservices behind one ALB" → path-based routing with multiple target groups. "Users lose session state when hitting different instances" → add ElastiCache for distributed session storage, not sticky sessions.
Task 2.2 — Global Routing

Global Accelerator vs. CloudFront

Two very different global distribution services
AWS Global Accelerator
  • 2 static anycast IP addresses globally
  • Routes traffic via AWS private backbone (not public internet)
  • Works with TCP & UDP (non-HTTP too)
  • Instant failover to healthy endpoint (~30 sec)
  • Targets: ALB, NLB, EC2, Elastic IPs
  • No caching — pure routing acceleration
  • Best for: gaming, IoT, VoIP, non-HTTP workloads, static IP requirement
Amazon CloudFront
  • 400+ edge locations worldwide
  • Content caching at the edge (CDN)
  • HTTP/S only
  • Reduces origin load by serving cached content
  • Integrates with WAF, Shield, ACM, S3
  • Custom cache behaviors by path
  • Best for: web content, API acceleration, static asset delivery, S3 distribution
CloudFront caches content at the edge — reduces origin load and latency for repeat requests. Global Accelerator does NOT cache — it just routes traffic faster via AWS backbone. Static IP requirement → Global Accelerator. HTTP caching → CloudFront.
2.3

Decoupled & Event-Driven Architectures

SQS · SNS · EventBridge · Kinesis · Fan-out Patterns

Task 2.3 — SQS

Amazon SQS — Standard vs. FIFO

The foundational decoupling service
Feature Standard Queue FIFO Queue
ThroughputUnlimited (nearly)300 msg/sec; 3,000 with batching
OrderingBest-effort (may vary)Strict first-in, first-out
DeliveryAt least once (duplicates possible)Exactly once (deduplication ID)
DeduplicationConsumer must handle duplicates5-minute deduplication window
Message groupsNot supportedMessage Group ID for parallel ordering
Use caseHigh throughput; order not criticalFinancial transactions, order processing
Key SQS Parameters
  • Visibility Timeout: Time a message is hidden after being received (default 30s, max 12h). Extend if processing takes longer.
  • Message Retention: 1 min to 14 days (default 4 days)
  • Max Message Size: 256 KB (use S3 + pointer for larger)
  • Long Polling: Wait up to 20s for messages — reduces empty receives and cost
Dead Letter Queues (DLQ)
  • Catches messages that fail processing repeatedly
  • Configure maxReceiveCount threshold (e.g., 3 attempts)
  • Enables debugging without losing failed messages
  • DLQ must be same type as source queue (FIFO DLQ for FIFO queue)
  • Set up CloudWatch alarm on DLQ depth
Task 2.3 — SNS

Amazon SNS & the Fan-out Pattern

Pub/Sub messaging and one-to-many delivery
SNS Core Concepts
  • Pub/Sub: publishers send to topics; subscribers receive
  • Push-based delivery (unlike SQS pull)
  • Up to 12.5M subscriptions per topic
  • Message filtering by attribute
  • FIFO topics available (ordered + deduplication)
SNS Subscribers
  • SQS queues (most common)
  • Lambda functions
  • HTTP/S endpoints
  • Email / SMS / Mobile push
  • Kinesis Data Firehose
SNS + SQS Fan-out Pattern

The canonical multi-consumer architecture

📥 Publisher → SNS Topic
↓ fan-out ↓
SQS Queue A
Email Service
SQS Queue B
Analytics
Lambda
Real-time
  • One publish → delivered to all subscribers in parallel
  • Each SQS queue buffers for its own consumer independently
  • Failure in one consumer doesn't affect others
  • Add new consumers without changing publisher
"Send one event to multiple downstream services" → SNS + SQS fan-out. SNS alone doesn't buffer — add SQS between SNS and slow consumers to absorb bursts. Message filtering avoids creating separate topics per subscriber.
Task 2.3 — Event Streaming

EventBridge & Kinesis

Event-driven routing and real-time data streaming
Amazon EventBridge
  • Serverless event bus for AWS services, SaaS, custom apps
  • Routes events via rules to targets (Lambda, SQS, SNS, Step Functions, etc.)
  • Schema registry: discover and validate event shapes
  • Event Archive: replay past events for debugging
  • Scheduled rules: cron/rate expressions (replaces CloudWatch Events)
  • Pipes: point-to-point event enrichment and filtering
EventBridge for complex event routing logic across services. SQS for simple point-to-point queuing. SNS for fan-out pub/sub.
Amazon Kinesis Family
  • Kinesis Data Streams: Real-time streaming; custom consumers; 1–365 day retention; ordered within shard; replay capability
  • Kinesis Data Firehose: Fully managed delivery to S3, Redshift, OpenSearch, Splunk; near-real-time (60s buffer min); no consumer code
  • Kinesis Data Analytics: SQL or Apache Flink on streaming data; real-time aggregations
  • MSK (Kafka): Managed Apache Kafka for existing Kafka workloads
Kinesis Streams = custom real-time processing with replay. Firehose = managed delivery to data stores, no code. SQS = general queuing, not streaming.
2.4

Resilient Storage & Databases

S3 Storage Classes · EBS · EFS · RDS HA · Aurora · ElastiCache · Backup

Task 2.4 — S3

S3 Storage Classes & Lifecycle Policies

Matching access patterns to cost-optimal storage tiers
Storage Class Availability Min Duration Retrieval Use Case
S3 Standard 99.99% (3+ AZs) None Instant Frequently accessed data
S3 Intelligent-Tiering 99.9% None Instant / async Unknown or changing access patterns
S3 Standard-IA 99.9% (3+ AZs) 30 days Instant Infrequently accessed; backups, DR copies
S3 One Zone-IA 99.5% (1 AZ) 30 days Instant Re-creatable infrequent data; lower cost
S3 Glacier Instant Retrieval 99.9% 90 days Instant (ms) Archive accessed once per quarter
S3 Glacier Flexible Retrieval 99.99% 90 days Minutes–hours Archival with flexible retrieval time
S3 Glacier Deep Archive 99.99% 180 days 12–48 hours Long-term regulatory compliance archives
Lifecycle Policies automate transitions: Standard → Standard-IA (30+ days) → Glacier (90+ days) → Deep Archive (180+ days). Minimum storage duration charges apply even if deleted early. Standard-IA has per-GB retrieval fees — cost more if accessed frequently.
Task 2.4 — Storage Types

EBS · EFS · FSx — Block & File Storage

Choosing the right persistent storage for compute workloads
Amazon EBS Block storage for a single EC2 instance. AZ-scoped — same AZ as the instance. Snapshot to S3 for backup. Types: gp3 (general, recommended), io2 Block Express (highest IOPS, up to 256,000), st1 (throughput HDD), sc1 (cold HDD). Encrypt with KMS. Multi-Attach for io1/io2 (up to 16 Nitro instances, same AZ).
Amazon EFS Managed NFS file system. Multi-AZ, multi-instance access simultaneously. Grows and shrinks automatically (no provisioning). Linux only (NFSv4). Performance modes: General Purpose and Max I/O. Throughput modes: Bursting, Provisioned, Elastic. EFS Infrequent Access tier for cost savings.
Amazon FSx for Windows Managed Windows file server (SMB protocol). Active Directory integration. Supports DFS namespaces, NTFS, ACLs. Multi-AZ deployment available. Use for Windows workloads that need shared file storage (home directories, SQL Server backups).
Amazon FSx for Lustre High-performance parallel file system for HPC, ML, financial modeling. Sub-millisecond latency, hundreds of GB/s throughput. Can link to S3 as a data repository. Ideal for compute-intensive workloads needing fast shared storage.
EBS = one instance, one AZ, block storage. EFS = many Linux instances, multi-AZ, file storage. FSx for Windows = Windows SMB, AD-integrated. FSx for Lustre = HPC / ML high-performance. If you see "shared file system for Linux EC2 fleet" → EFS.
Task 2.4 — Databases

RDS High Availability & Read Scaling

Multi-AZ · Read Replicas · RDS Proxy · Aurora
RDS Multi-AZ
  • Purpose: High availability (not read scaling)
  • Replication: Synchronous — standby is always in sync
  • Failover: Automatic, ~1–2 minutes. DNS endpoint flips.
  • Standby: Not readable (no traffic served)
  • Cost: 2× instance cost (active + standby)
RDS Read Replicas
  • Purpose: Read scaling + cross-region DR
  • Replication: Asynchronous — small replication lag
  • Failover: Manual promotion — not automatic
  • Readable: Yes — serve SELECT queries from replica
  • Cross-region: Yes — enables cross-region DR
  • Up to 5 replicas per RDS instance (15 for Aurora)
Amazon Aurora
  • MySQL/PostgreSQL compatible, 5× faster
  • Shared storage: 6 copies across 3 AZs automatically
  • Up to 15 Aurora Replicas (sub-10ms replica lag)
  • Auto-scales storage 10GB → 128TB
  • Aurora Serverless v2: scale to zero (dev/test)
  • Aurora Global Database: ~1s cross-region replication
  • Failover to replica: <30 seconds
RDS Proxy
  • Connection pooling for Lambda → RDS patterns
  • Reduces connection overhead during burst
  • Automatic failover routing (faster than DNS TTL)
  • Secrets Manager integration for IAM auth
Multi-AZ = HA (sync, automatic failover, standby not readable). Read Replica = scale reads (async, manual failover, replica is readable). Never use Multi-AZ for read scaling — use Read Replicas. Aurora for anything needing >5 read replicas, global DR, or faster failover.
Task 2.4 — NoSQL & Caching

DynamoDB & ElastiCache Resilience Patterns

NoSQL HA · Global Tables · Redis vs. Memcached
DynamoDB Resilience
  • Built-in Multi-AZ: Data stored across 3 AZs by default
  • Global Tables: Multi-Region Active/Active; last-writer-wins; <1s replication
  • PITR: Point-in-time recovery; restore to any second in last 35 days
  • On-demand backups: Full table backup anytime; no performance impact
  • Streams: Ordered change log; triggers Lambda; 24h retention
  • DAX: In-memory DynamoDB accelerator; microsecond reads
ElastiCache: Redis vs. Memcached
  • Redis: Persistence, replication, pub/sub, Lua scripting, complex data structures (sorted sets, lists). Multi-AZ with automatic failover. Global Datastore for cross-region. Use for sessions, leaderboards, pub/sub, distributed locks.
  • Memcached: Simple key-value, multi-threaded, no persistence, no replication. Pure caching, simpler ops. Use when you only need a dumb cache and don't need any Redis features.
Caching Patterns
  • Lazy Loading (Cache-Aside): Check cache first; miss → load from DB → write to cache
  • Write-Through: Write to cache AND DB simultaneously; always fresh
  • TTL: Time-to-live prevents stale data
"Reduce RDS read load" → ElastiCache (lazy loading pattern). "Global users need <1ms reads" → DynamoDB + DAX. "Multi-region Active/Active NoSQL" → DynamoDB Global Tables. Redis = persistence + features. Memcached = simple cache only.
Task 2.4 — Backup & Recovery

AWS Backup & Recovery Strategies

Centralized backup management and DR patterns
AWS Backup Service
  • Centralized backup policy management across: EBS, RDS, Aurora, DynamoDB, EFS, FSx, Storage Gateway, S3
  • Backup plans: schedules, retention rules, lifecycle to cold storage
  • Cross-region backup copies for DR
  • Cross-account backups via AWS Organizations
  • Backup Vault Lock: WORM protection for backups
  • Tags-based resource selection
Snapshot Strategy by Service
  • EBS: Snapshots to S3 (incremental); schedule via DLM or AWS Backup
  • RDS/Aurora: Automated daily backups + manual snapshots; retain 0–35 days
  • DynamoDB: PITR (35 days) + on-demand backups (indefinite)
  • S3: Versioning + CRR; no snapshots needed
DR Strategy ↔ AWS Patterns
  • Backup & Restore: AWS Backup to S3 / cross-region. High RTO (hours).
  • Pilot Light: RDS read replica in DR region; EC2 AMIs ready to launch. Scale up on disaster.
  • Warm Standby: Scaled-down ASG + DB in DR region running. Fast scale-up on failover.
  • Active/Active: Route 53 latency routing; Aurora Global DB; DynamoDB Global Tables. Near-zero RTO/RPO.
AWS Backup is the single-pane-of-glass answer for "centrally manage backups across services and accounts." Data Lifecycle Manager (DLM) is EBS-snapshots-only. For RPO minutes with cross-region: Aurora Global DB + Route 53 failover routing.
Quick Review

Exam Checklist — Domain 2

Can you answer these?
Task 2.1 — Scalable & Loosely Coupled
  • Target Tracking vs. Step vs. Scheduled scaling policies
  • Why horizontal scaling is preferred over vertical
  • Lambda limits: 15 min max, concurrency, cold starts
  • Fargate (containers) vs. Lambda (functions) tradeoff
  • Step Functions for workflow orchestration with retry logic
Task 2.2 — Highly Available
  • ALB (L7, HTTP/S, path routing) vs. NLB (L4, static IP, TCP/UDP)
  • GWLB for third-party virtual security appliances
  • CloudFront (CDN, caching) vs. Global Accelerator (routing, no cache)
  • ALB sticky sessions vs. ElastiCache for distributed session state
  • Cross-zone load balancing behavior per ELB type
Task 2.3 — Decoupled Architectures
  • SQS Standard (high throughput, at-least-once) vs. FIFO (ordered, exactly-once)
  • Visibility timeout and Dead Letter Queues
  • SNS + SQS fan-out pattern for multi-consumer events
  • Kinesis Streams (replay, custom consumers) vs. Firehose (managed delivery)
  • EventBridge for event routing + scheduled rules (cron)
Task 2.4 — Storage & Databases
  • S3 storage class selection: Standard → IA → Glacier tiers
  • EBS (one instance) vs. EFS (multi-instance Linux NFS) vs. FSx
  • RDS Multi-AZ (sync HA) vs. Read Replica (async, readable)
  • Aurora advantages: 15 replicas, Global DB, storage auto-scale
  • ElastiCache Redis (persistence + features) vs. Memcached (simple)
Quick Reference

Service → Scenario Quick Map

Scaling & Compute
  • EC2 ASG → auto scale compute
  • Lambda → event-driven, short tasks
  • Fargate → containers, no EC2 mgmt
  • Step Functions → workflow orchestration
  • API Gateway → managed API front-end
Load Balancing
  • ALB → HTTP/S, path routing, WAF
  • NLB → TCP/UDP, static IP, PrivateLink
  • GWLB → virtual firewall appliances
  • Global Accelerator → static IP, TCP/UDP global
  • CloudFront → CDN, HTTP caching at edge
Decoupling
  • SQS Standard → high-throughput queuing
  • SQS FIFO → ordered, exactly-once
  • SNS → pub/sub fan-out
  • EventBridge → event routing + scheduling
  • Kinesis → real-time streaming
Storage
  • S3 → objects, backups, static web
  • EBS → single EC2 block storage
  • EFS → shared Linux NFS file system
  • FSx Windows → SMB, AD-integrated
  • FSx Lustre → HPC / ML high perf
Databases
  • RDS Multi-AZ → sync HA, auto failover
  • RDS Read Replica → read scaling + DR
  • Aurora → 15 replicas, fast failover
  • Aurora Global DB → cross-region <1s RPO
  • DynamoDB Global Tables → Active/Active
Caching & Backup
  • ElastiCache Redis → sessions, leaderboards
  • ElastiCache Memcached → simple cache
  • DAX → DynamoDB microsecond reads
  • AWS Backup → centralized backup policy
  • S3 Lifecycle → archive & expiry rules

Domain 2 Complete

You're ready for Domain 2

26% of SAA-C03 · Design Resilient Architectures
Good luck on the exam!

2.1 — Scalable & Loosely Coupled 2.2 — Highly Available & Fault Tolerant 2.3 — Decoupled Architectures 2.4 — Resilient Storage & Databases