Step 2 โ New Solutions Design
Give ten Professional-level candidates the same greenfield requirements and youโll get ten different architecture diagrams โ and the exam is fine with that, because it isnโt grading a single correct diagram. Itโs grading whether your choices trace back to the stated constraints: latency budget, consistency requirements, failure tolerance, cost ceiling. This step works through the recurring building blocks youโll assemble differently depending on which constraint is dominant.
Compute Selection Is a Tradeoff Table, Not a Checklist
At Associate level, โwhen do I use Lambda vs EC2โ has a reasonably short answer. At Professional level, the question arrives buried inside a paragraph of business requirements, and you have to extract it yourself. Frame every compute decision along four axes: control, operational overhead, cost model, and startup latency.
| Compute Option | Control Level | Ops Overhead | Cold Start Concern | Best Fit |
|---|---|---|---|---|
| EC2 (self-managed) | Full OS access | High | None | Licensing-bound software, custom kernels |
| ECS/EKS on EC2 | Container orchestration | Medium | Low | Existing container investment, need for GPU/specialized instances |
| Fargate | None below task | Low | Low-medium | Containerized workloads without capacity planning |
| Lambda | None | Lowest | Can matter at scale | Event-driven, spiky, sub-15-minute execution |
A pattern that shows up repeatedly in scenario questions: a company migrating from EC2 wants โless operational burdenโ but also needs GPU instances for an ML inference workload. Fargate doesnโt support arbitrary GPU types the way EC2-backed ECS/EKS does, so the โleast operational overheadโ answer isnโt always the right one โ the constraint (GPU) eliminates it. Always let the hard constraint prune the option list before you optimize for the soft one (operational simplicity).
Designing for Global Applications
A single-Region deployment is the default assumption at Associate level. At Professional level, โcustomers in Tokyo, Frankfurt, and Sรฃo Paulo all need sub-100ms readsโ is a normal opening sentence, and it forces a genuinely different topology.
Route 53 (latency-based routing) โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ โ โโโโโโผโโโโโโ โโโโโโโโผโโโโโโ โโโโโโโโผโโโโโโ โ us-east-1โ โ eu-central-1โ โ ap-northeast-1โ โ ALB โ โ ALB โ โ ALB โ โ ECS โ โ ECS โ โ ECS โ โโโโโโฌโโโโโโ โโโโโโโโฌโโโโโโโ โโโโโโโโฌโโโโโโโ โ โ โ โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโ โ DynamoDB Global Table (multi-active, last-writer-wins) โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโDynamoDB Global Tables replicate a table across Regions with active-active writes in every Region โ any Region can accept a write, and replication catches the others up within roughly a second under normal conditions. The tradeoff the exam wants you to articulate: this buys you low local write latency everywhere at the cost of eventual consistency and the possibility of conflicting writes resolved by last-writer-wins. If your application cannot tolerate that (financial ledgers, inventory counts that must never go negative), Global Tables is the wrong answer regardless of how attractive the latency profile looks.
Aurora Global Database takes the opposite shape: one primary Region handles all writes, and up to several secondary Regions get physical-layer replication with typically sub-second lag, providing fast local reads everywhere and a disaster-recovery target with a low RPO. It does not give you multi-Region writes. When a scenario says โreporting queries in three Regions, but all order processing happens centrally,โ Aurora Global Database is the fit; when it says โeach Region needs to accept writes independently,โ youโre back to Global Tables or a custom conflict-resolution scheme.
Global Accelerator solves a different layer of the problem entirely โ itโs not a data replication tool, itโs a network entry point. It anycasts two static IPs from the AWS global network edge, then routes user traffic over AWSโs backbone rather than the public internet, improving both latency and jitter, and it fails traffic over to a healthy Region automatically if an endpoint group becomes unhealthy. Compare that against CloudFront, which is built for caching content at edge locations โ Global Accelerator is for TCP/UDP traffic that isnโt cacheable, like gaming, VoIP, or API traffic that needs consistent low-latency routing rather than content caching.
Event-Driven Architecture at Enterprise Scale
Once you have more than a handful of services, direct service-to-service calls create a dependency graph nobody can reason about. Event-driven design decouples producers from consumers through an intermediary, and the Professional exam expects fluency in choosing the right intermediary for the traffic shape.
Order Service โโeventโโโถ EventBridge Bus โโruleโโโถ Inventory Service โ โโโruleโโโถ Notification Service (SQS) โ โโโruleโโโถ Analytics Pipeline (Kinesis Firehose)SNS fans a single message out to multiple subscribers immediately โ think alerting, or triggering several independent Lambda functions off one event. SQS buffers work for a single consumer group and lets it process at its own pace, with visibility timeouts and dead-letter queues protecting against poison messages. EventBridge goes further than either: itโs a schema-aware event bus with content-based routing rules, native integration with dozens of AWS services and SaaS partners, and support for archiving and replaying events โ useful when you stand up a new consumer service six months later and need to backfill it against historical events. Kinesis Data Streams is for when order matters and throughput is continuous rather than discrete messages โ clickstream data, IoT telemetry, anything youโll process with multiple independent consumer applications reading the same ordered stream at their own checkpoint position.
A frequent scenario trap: a question describes โmany independent teams need to react to the same business event, and new consumers get added regularly without changing the producer.โ That phrase โ new consumers added without touching the producer โ is the signature of EventBridge or SNS, not a direct API call or a tightly coupled SQS queue per consumer.
Microservices Decomposition and Data Ownership
Breaking a monolith into services is as much a data design problem as a compute design problem, and Professional-level questions increasingly hinge on the data side. The rule that trips people up: each microservice owns its data exclusively, and other services never reach directly into that data store. They ask through an API or react to an event.
Monolith: [ Single App ] โโโโ [ Single Shared Database ]
Microservices: [ Orders Service ] โโ owns โโ [ Orders DB ] [ Inventory Service ] โโ owns โโ [ Inventory DB ] [ Shipping Service ] โโ owns โโ [ Shipping DB ] โ โ โโโโโโโโโ events via EventBridge โโโโโโโโโThis is why a service mesh or API Gateway alone doesnโt solve microservices decomposition โ the exam sometimes offers โput API Gateway in front of the monolithโ as a distractor answer for a question thatโs actually asking about decomposing the data layer, not just the request routing layer. Fronting a monolith with API Gateway changes nothing about coupling if every โserviceโ still writes to the same tables.
Saga pattern handles transactions that span multiple services, since you can no longer rely on a single database transaction. Orchestration-based sagas (a Step Functions state machine coordinating each step and compensating on failure) are generally easier to reason about and test than choreography-based sagas (each service reacting to the previous serviceโs event with no central coordinator), and the exam tends to favor Step Functions orchestration as the โwell-architectedโ answer when a workflow has more than three or four steps or requires visible compensating transactions.
Elasticity and Scalability Tradeoffs
Scaling isnโt free, and Professional-level design has to reckon with the failure modes of scaling itself, not just the mechanism.
| Pattern | Scales Fast? | Cost Behavior | Risk |
|---|---|---|---|
| EC2 Auto Scaling (target tracking) | Minutes | Pay for running capacity | Scaling lag during sudden spikes |
| Application Auto Scaling on ECS/Fargate | Minutes | Pay for running tasks | Same lag, smaller blast radius per task |
| Lambda concurrency | Seconds | Pay per invocation | Downstream systems (RDS connections) can be overwhelmed by sudden concurrency |
| DynamoDB on-demand | Instant | Pay per request | Cost can spike unexpectedly under sustained high load |
The recurring exam scenario: Lambda scales so quickly that it saturates a downstream RDS connection pool during a traffic spike, causing errors that look like a database problem but are actually a compute-scaling problem. The fix is RDS Proxy, which pools and multiplexes connections so thousands of concurrent Lambda invocations donโt each open a direct database connection. Recognizing โLambda + RDS + connection exhaustionโ as the setup for โthe answer is RDS Proxyโ is one of the more reliable pattern-matches on this exam.
Exam Focus: What Questions Test From This Step
- Choosing DynamoDB Global Tables (multi-active writes, eventual consistency) versus Aurora Global Database (single-writer Region, fast cross-Region reads, DR target)
- Global Accelerator versus CloudFront โ network-layer routing for non-cacheable traffic versus edge content caching
- Matching SNS (fan-out), SQS (buffered single-consumer), EventBridge (schema-aware routing, replay), and Kinesis (ordered continuous streams) to the described traffic pattern
- Recognizing โnew consumers added without changing the producerโ as an EventBridge/SNS signal
- Microservices data ownership โ no shared database across service boundaries
- Saga pattern: Step Functions orchestration versus event-driven choreography, and when each is preferred
- Lambda-to-RDS connection exhaustion under scale, and RDS Proxy as the fix
- Pruning compute options by hard constraints (GPU, licensing, execution duration) before optimizing for operational simplicity