Cloud/ AWS / AWS Certified Solutions Architect โ€” Professional (SAP-C02) / SAP-C02 Continuous Improvement: Well-Architected Reviews at Scale

AWS Amazon Web Services Professional Step 3 of 5 106 guides ยท updated 2026

Hands-on guides to compute, storage, databases, networking, and serverless on the world's most widely adopted cloud platform.

Step 3 โ€” Continuous Improvement

Every architecture youโ€™ve designed so far in this track was greenfield โ€” a blank canvas. Most of the real work of a Professional-level architect happens after that canvas already has three years of production traffic on it, a backlog of โ€œweโ€™ll fix that laterโ€ decisions, and a team thatโ€™s afraid to touch half of it. This step is about that harder, messier job: looking at something that already works and deciding what to change, in what order, without breaking it.


The Well-Architected Framework as a Diagnostic Tool, Not a Checklist

You met the six pillars early in the Associate track. At Professional level, the pillars stop being something you memorize and start being something you apply as a structured interrogation of an existing system. AWS formalizes this with the Well-Architected Tool and, for deeper engagements, Well-Architected Reviews run against workload-specific and industry lenses (Serverless Lens, SaaS Lens, Data Analytics Lens) layered on top of the general framework.

A review produces two buckets of findings, and the exam consistently distinguishes them:

HIGH RISK ISSUE (HRI) โ”‚ MEDIUM RISK ISSUE (MRI)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Single points of failure in prod โ”‚ Missing tags on non-critical resources
No tested backup/restore process โ”‚ Slightly oversized instances
Hardcoded credentials in application code โ”‚ Missing CloudWatch dashboards
Public database with no encryption โ”‚ No lifecycle policy on log bucket

The Professional exam does not ask you to fix everything found in a review simultaneously โ€” it asks you to prioritize. High-risk issues that threaten data loss, security exposure, or total outage get remediated first, generally regardless of remediation cost, because the downside is asymmetric. This prioritization logic โ€” impact and likelihood before cost โ€” is the actual skill being tested, more than familiarity with any specific pillar.


Finding the Bottleneck Before You Fix It

A recurring failure mode in real reviews (and in exam distractor answers) is treating a symptom as the root cause. A system thatโ€™s โ€œslowโ€ gets more compute thrown at it, when the actual constraint was a database connection pool, a synchronous downstream call, or a single-threaded queue consumer. Professional-level troubleshooting starts by tracing the request path end-to-end and instrumenting each hop.

Client โ”€โ”€โ–ถ CloudFront โ”€โ”€โ–ถ ALB โ”€โ”€โ–ถ ECS Service โ”€โ”€โ–ถ RDS (single AZ, db.m5.large)
โ”‚
โ””โ”€โ”€โ–ถ External payment API (synchronous call, 800ms p99)

In this shape, the ECS service can autoscale all day and the p99 latency wonโ€™t improve, because every request blocks on an 800ms synchronous call to a third party. The fix isnโ€™t more compute โ€” itโ€™s decoupling: accept the order, return quickly, and process the payment confirmation asynchronously via a queue and webhook or polling. X-Ray is the tool that actually reveals this in production rather than in theory โ€” its service map shows you exactly where time is spent across a distributed request, and Professional-level scenario questions frequently describe symptoms (โ€œhigh latency, low CPU utilization across the fleetโ€) that only make sense once you recognize the bottleneck is I/O-bound waiting, not compute-bound processing.

CloudWatch Application Insights and Compute Optimizer round out the instrumentation picture โ€” Application Insights automatically surfaces anomalies for common application frameworks without hand-built dashboards, and Compute Optimizer recommends resource right-sizing based on actual utilization history rather than guesswork, which matters when a review finds instances provisioned for a peak load that happens twice a year.


Automating Operational Excellence: Self-Healing Systems

The Operational Excellence pillarโ€™s most testable idea is this: humans responding to alarms manually does not scale, and it introduces the exact delay and error rate youโ€™re trying to design out. A mature architecture treats common failure classes as automatable, not escalatable.

Failure SignalManual Response (avoid)Automated Response (prefer)
EC2 instance fails health checkOn-call engineer investigates and rebootsAuto Scaling replaces the instance automatically
Lambda function throwing errorsEngineer reads logs, redeploysCloudWatch Alarm triggers rollback via CodeDeploy
Disk approaching capacityTicket filed to increase volumeEventBridge rule triggers Lambda to expand EBS volume
Unusual account activitySOC analyst manually investigatesGuardDuty finding triggers automated Lambda remediation (isolate instance, revoke credentials)

Systems Manager Automation documents and runbooks let you codify the remediation steps once and trigger them from EventBridge rules or CloudWatch Alarms, turning โ€œthe wiki page says to do these seven stepsโ€ into โ€œthis runs automatically at 3 a.m. without waking anyone up.โ€ The exam frequently frames this as โ€œreduce mean time to recovery without increasing headcountโ€ โ€” that phrasing is a strong signal the expected answer involves EventBridge plus Systems Manager Automation or a Lambda-based remediation, not a bigger on-call rotation.

AWS Fault Injection Simulator (FIS) pushes this further, into chaos engineering: rather than waiting for production to fail and finding out whether your self-healing actually works, FIS deliberately injects failure โ€” terminating instances, throttling API calls, introducing latency, simulating an AZ outage โ€” inside guardrails you define (stop conditions tied to CloudWatch Alarms so an experiment aborts if it causes real customer impact). A well-run continuous improvement program schedules these experiments regularly rather than treating resilience as something you built once and can assume still works after eighteen months of unrelated changes.


Modernization: Monolith to Microservices, Without a Big Bang

Rewriting a monolith from scratch is the answer nobody should give and the exam consistently rejects it as a distractor. The tested pattern is the strangler fig: build the new service alongside the old monolith, route a slice of traffic to it, and only decommission the corresponding monolith code once the new path is proven.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
Client โ”€โ”€โ”€โ–ถ API GW โ”€โ”€โ–ถโ”‚ Routing Rule โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Legacy Monolith โ”‚ โ”‚ New Microservice โ”‚
โ”‚ (still handles โ”‚ โ”‚ (handles the โ”‚
โ”‚ everything else)โ”‚ โ”‚ "checkout" path) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

API Gateway (or an ALB with path-based routing) sits in front of both, and the routing rule moves traffic incrementally โ€” 5%, then 25%, then 100% โ€” to the new service as confidence builds. This is the same mechanism as a canary deployment, applied at the architecture level instead of the deployment level, and recognizing that overlap is worth internalizing: modernization and safe deployment are the same underlying technique used at different time scales.


Database Modernization

Databases resist the strangler pattern more than application code does, because data has gravity โ€” you canโ€™t easily run โ€œhalfโ€ a database. The common professional-level path is a staged one: first move off a commercial engine onto a compatible open-source-based engine (commonly using DMS with schema conversion tooling to handle the engine translation), then only afterward consider decomposing a single large relational database into per-service databases as part of a broader microservices effort.

Trying to do both simultaneously โ€” engine migration and schema decomposition in one step โ€” is a common exam distractor precisely because it maximizes risk in a single cutover. The tested best practice sequences these efforts:

  1. Assess and convert schema/application compatibility issues (schema conversion tooling flags stored procedures, proprietary functions, and other engine-specific dependencies)
  2. Migrate data with continuous replication so cutover downtime is minimal
  3. Validate functional and performance parity against the source engine
  4. Only then, as a separate initiative, evaluate decomposing shared tables into per-service ownership if a broader microservices migration is underway

Exam Focus: What Questions Test From This Step