Step 2 — New Solutions Design

Give ten Professional-level candidates the same greenfield requirements and you’ll get ten different architecture diagrams — and the exam is fine with that, because it isn’t grading a single correct diagram. It’s grading whether your choices trace back to the stated constraints: latency budget, consistency requirements, failure tolerance, cost ceiling. This step works through the recurring building blocks you’ll assemble differently depending on which constraint is dominant.

Compute Selection Is a Tradeoff Table, Not a Checklist

At Associate level, “when do I use Lambda vs EC2” has a reasonably short answer. At Professional level, the question arrives buried inside a paragraph of business requirements, and you have to extract it yourself. Frame every compute decision along four axes: control, operational overhead, cost model, and startup latency.

Compute Option	Control Level	Ops Overhead	Cold Start Concern	Best Fit
EC2 (self-managed)	Full OS access	High	None	Licensing-bound software, custom kernels
ECS/EKS on EC2	Container orchestration	Medium	Low	Existing container investment, need for GPU/specialized instances
Fargate	None below task	Low	Low-medium	Containerized workloads without capacity planning
Lambda	None	Lowest	Can matter at scale	Event-driven, spiky, sub-15-minute execution

A pattern that shows up repeatedly in scenario questions: a company migrating from EC2 wants “less operational burden” but also needs GPU instances for an ML inference workload. Fargate doesn’t support arbitrary GPU types the way EC2-backed ECS/EKS does, so the “least operational overhead” answer isn’t always the right one — the constraint (GPU) eliminates it. Always let the hard constraint prune the option list before you optimize for the soft one (operational simplicity).

Designing for Global Applications

A single-Region deployment is the default assumption at Associate level. At Professional level, “customers in Tokyo, Frankfurt, and São Paulo all need sub-100ms reads” is a normal opening sentence, and it forces a genuinely different topology.

                         Route 53 (latency-based routing)
                                    │
        ┌───────────────────────────┼───────────────────────────┐
        │                           │                           │
   ┌────▼─────┐              ┌──────▼─────┐              ┌──────▼─────┐
   │ us-east-1│              │ eu-central-1│              │ ap-northeast-1│
   │  ALB     │              │   ALB       │              │   ALB       │
   │  ECS     │              │   ECS       │              │   ECS       │
   └────┬─────┘              └──────┬──────┘              └──────┬──────┘
        │                           │                           │
   ┌────▼─────────────────────────────────────────────────────────▼────┐
   │       DynamoDB Global Table  (multi-active, last-writer-wins)      │
   └──────────────────────────────────────────────────────────────────┘

DynamoDB Global Tables replicate a table across Regions with active-active writes in every Region — any Region can accept a write, and replication catches the others up within roughly a second under normal conditions. The tradeoff the exam wants you to articulate: this buys you low local write latency everywhere at the cost of eventual consistency and the possibility of conflicting writes resolved by last-writer-wins. If your application cannot tolerate that (financial ledgers, inventory counts that must never go negative), Global Tables is the wrong answer regardless of how attractive the latency profile looks.

Aurora Global Database takes the opposite shape: one primary Region handles all writes, and up to several secondary Regions get physical-layer replication with typically sub-second lag, providing fast local reads everywhere and a disaster-recovery target with a low RPO. It does not give you multi-Region writes. When a scenario says “reporting queries in three Regions, but all order processing happens centrally,” Aurora Global Database is the fit; when it says “each Region needs to accept writes independently,” you’re back to Global Tables or a custom conflict-resolution scheme.

Global Accelerator solves a different layer of the problem entirely — it’s not a data replication tool, it’s a network entry point. It anycasts two static IPs from the AWS global network edge, then routes user traffic over AWS’s backbone rather than the public internet, improving both latency and jitter, and it fails traffic over to a healthy Region automatically if an endpoint group becomes unhealthy. Compare that against CloudFront, which is built for caching content at edge locations — Global Accelerator is for TCP/UDP traffic that isn’t cacheable, like gaming, VoIP, or API traffic that needs consistent low-latency routing rather than content caching.

Event-Driven Architecture at Enterprise Scale

Once you have more than a handful of services, direct service-to-service calls create a dependency graph nobody can reason about. Event-driven design decouples producers from consumers through an intermediary, and the Professional exam expects fluency in choosing the right intermediary for the traffic shape.

Order Service ──event──▶ EventBridge Bus ──rule──▶ Inventory Service
                              │
                              ├──rule──▶ Notification Service (SQS)
                              │
                              └──rule──▶ Analytics Pipeline (Kinesis Firehose)

SNS fans a single message out to multiple subscribers immediately — think alerting, or triggering several independent Lambda functions off one event. SQS buffers work for a single consumer group and lets it process at its own pace, with visibility timeouts and dead-letter queues protecting against poison messages. EventBridge goes further than either: it’s a schema-aware event bus with content-based routing rules, native integration with dozens of AWS services and SaaS partners, and support for archiving and replaying events — useful when you stand up a new consumer service six months later and need to backfill it against historical events. Kinesis Data Streams is for when order matters and throughput is continuous rather than discrete messages — clickstream data, IoT telemetry, anything you’ll process with multiple independent consumer applications reading the same ordered stream at their own checkpoint position.

A frequent scenario trap: a question describes “many independent teams need to react to the same business event, and new consumers get added regularly without changing the producer.” That phrase — new consumers added without touching the producer — is the signature of EventBridge or SNS, not a direct API call or a tightly coupled SQS queue per consumer.

Microservices Decomposition and Data Ownership

Breaking a monolith into services is as much a data design problem as a compute design problem, and Professional-level questions increasingly hinge on the data side. The rule that trips people up: each microservice owns its data exclusively, and other services never reach directly into that data store. They ask through an API or react to an event.

Monolith:  [ Single App ] ──── [ Single Shared Database ]

Microservices:
  [ Orders Service ] ── owns ── [ Orders DB ]
  [ Inventory Service ] ── owns ── [ Inventory DB ]
  [ Shipping Service ] ── owns ── [ Shipping DB ]
        │                              │
        └──────── events via EventBridge ────────┘

This is why a service mesh or API Gateway alone doesn’t solve microservices decomposition — the exam sometimes offers “put API Gateway in front of the monolith” as a distractor answer for a question that’s actually asking about decomposing the data layer, not just the request routing layer. Fronting a monolith with API Gateway changes nothing about coupling if every “service” still writes to the same tables.

Saga pattern handles transactions that span multiple services, since you can no longer rely on a single database transaction. Orchestration-based sagas (a Step Functions state machine coordinating each step and compensating on failure) are generally easier to reason about and test than choreography-based sagas (each service reacting to the previous service’s event with no central coordinator), and the exam tends to favor Step Functions orchestration as the “well-architected” answer when a workflow has more than three or four steps or requires visible compensating transactions.

Elasticity and Scalability Tradeoffs

Scaling isn’t free, and Professional-level design has to reckon with the failure modes of scaling itself, not just the mechanism.

Pattern	Scales Fast?	Cost Behavior	Risk
EC2 Auto Scaling (target tracking)	Minutes	Pay for running capacity	Scaling lag during sudden spikes
Application Auto Scaling on ECS/Fargate	Minutes	Pay for running tasks	Same lag, smaller blast radius per task
Lambda concurrency	Seconds	Pay per invocation	Downstream systems (RDS connections) can be overwhelmed by sudden concurrency
DynamoDB on-demand	Instant	Pay per request	Cost can spike unexpectedly under sustained high load

The recurring exam scenario: Lambda scales so quickly that it saturates a downstream RDS connection pool during a traffic spike, causing errors that look like a database problem but are actually a compute-scaling problem. The fix is RDS Proxy, which pools and multiplexes connections so thousands of concurrent Lambda invocations don’t each open a direct database connection. Recognizing “Lambda + RDS + connection exhaustion” as the setup for “the answer is RDS Proxy” is one of the more reliable pattern-matches on this exam.

Exam Focus: What Questions Test From This Step

Choosing DynamoDB Global Tables (multi-active writes, eventual consistency) versus Aurora Global Database (single-writer Region, fast cross-Region reads, DR target)
Global Accelerator versus CloudFront — network-layer routing for non-cacheable traffic versus edge content caching
Matching SNS (fan-out), SQS (buffered single-consumer), EventBridge (schema-aware routing, replay), and Kinesis (ordered continuous streams) to the described traffic pattern
Recognizing “new consumers added without changing the producer” as an EventBridge/SNS signal
Microservices data ownership — no shared database across service boundaries
Saga pattern: Step Functions orchestration versus event-driven choreography, and when each is preferred
Lambda-to-RDS connection exhaustion under scale, and RDS Proxy as the fix
Pruning compute options by hard constraints (GPU, licensing, execution duration) before optimizing for operational simplicity

Written by NPBlue Cloud Team — Cloud & Platform Engineers who runs production workloads on AWS daily and writes from real deployment experience, not the docs alone.

Reviewed for technical accuracy. Spot an error? Let us know.