Lambda Concurrency: Reserved, Provisioned, and How to Avoid Throttling

Lambda concurrency is the number of function instances handling requests at the same time. When your function receives 1 request, 1 execution environment is active. When it receives 500 simultaneous requests, Lambda runs 500 environments in parallel — provided the limits allow it.

Understanding concurrency limits is critical. Exceed them and your function starts returning 429 TooManyRequestsException errors. Misconfigure reserved concurrency and you can accidentally throttle your own critical functions.

Account-Level Concurrency Limit

Each AWS account has a regional concurrency limit — defaulting to 1,000 concurrent executions. This is a shared pool across all Lambda functions in that region.

Account concurrency pool (default 1,000):
┌────────────────────────────────────────────────────────────────┐
│                   1,000 concurrent executions                  │
│                                                                │
│  ┌───────────────┐  ┌───────────────┐  ┌───────────────────┐  │
│  │  api-handler  │  │ image-resizer │  │  data-pipeline    │  │
│  │  (300 used)   │  │  (200 used)   │  │  (400 used)       │  │
│  └───────────────┘  └───────────────┘  └───────────────────┘  │
│                                                                │
│  Unreserved pool remaining: 100                                │
└────────────────────────────────────────────────────────────────┘

If data-pipeline spikes to 800 concurrent executions and api-handler needs 300, that’s 1,100 — the API handler starts getting throttled because the total exceeds 1,000.

You can request a limit increase through AWS Service Quotas. There is no hard cap on how high you can go, but approval takes time.

The Three Concurrency Types

Unreserved Concurrency

Every function starts with unreserved concurrency — it pulls from the account pool without any dedicated allocation. Functions with unreserved concurrency compete for whatever the pool has left after reserved functions take their share.

Unreserved concurrency is the default. There is nothing to configure; it just means “use whatever is available.”

Problem with unreserved concurrency: A noisy function (batch job, runaway recursion, traffic spike) can consume most of the pool and starve other functions. In a multi-function account, this is a real failure mode.

# Monitor concurrent executions across functions
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name ConcurrentExecutions \
  --statistics Maximum \
  --period 60 \
  --start-time 2024-01-15T00:00:00Z \
  --end-time 2024-01-15T01:00:00Z

Reserved Concurrency

Reserved concurrency dedicates a fixed number of concurrent executions to a specific function, carved out of the account pool.

Account pool: 1,000

Reserved allocations:
  payment-processor: 100 reserved → always has at least 100 slots
  auth-service: 50 reserved       → always has at least 50 slots

Unreserved pool: 850 (shared by everything else)

Reserved concurrency serves two purposes:

Guarantee: The function always has capacity up to its reserved value, even if the rest of the account pool is saturated.

Cap: The function can never exceed its reserved value. This is important for functions that open database connections — if your database allows 100 connections and Lambda could spawn 1,000 concurrent environments, you can deadlock the database. Setting reserved concurrency to 80 prevents that.

# Set reserved concurrency
aws lambda put-function-concurrency \
  --function-name payment-processor \
  --reserved-concurrent-executions 100

# Check current setting
aws lambda get-function-concurrency \
  --function-name payment-processor

# Remove reserved concurrency (return to unreserved pool)
aws lambda delete-function-concurrency \
  --function-name payment-processor

Setting reserved concurrency to 0 effectively disables a function. It can receive events but all invocations throttle immediately. Useful for emergency shutoffs.

Provisioned Concurrency

Provisioned concurrency pre-initialises execution environments so they are ready before any requests arrive. The environments complete the INIT phase (runtime startup, init code) and sit in a warm state.

Without provisioned concurrency:
  Request 1 → cold start (300ms init + 50ms handler) = 350ms
  Request 2 → warm (50ms handler) = 50ms
  Request 3 → warm (50ms handler) = 50ms

With provisioned concurrency (5 pre-warmed):
  Request 1 → warm (50ms handler) = 50ms
  Request 2 → warm (50ms handler) = 50ms
  Request 6 → exceeds provisioned, cold start = 350ms

The first 5 requests hit pre-warmed environments. The 6th request creates a new environment on-demand (cold start), but the 6th environment remains warm for subsequent requests.

# Create a function version (required for provisioned concurrency)
VERSION=$(aws lambda publish-version \
  --function-name api-handler \
  --query 'Version' --output text)

# Enable provisioned concurrency on that version
aws lambda put-provisioned-concurrency-config \
  --function-name api-handler \
  --qualifier $VERSION \
  --provisioned-concurrent-executions 10

You can also configure provisioned concurrency on an alias instead of a version, which is the recommended approach for production:

# Create an alias pointing to the version
aws lambda create-alias \
  --function-name api-handler \
  --name prod \
  --function-version $VERSION

# Set provisioned concurrency on the alias
aws lambda put-provisioned-concurrency-config \
  --function-name api-handler \
  --qualifier prod \
  --provisioned-concurrent-executions 10

Cost: Provisioned concurrency has its own billing dimension — you pay per GB-second of provisioned concurrency time, even if those environments are idle. At 10 environments × 256 MB × 24 hours, this adds a meaningful cost. Only use provisioned concurrency where cold start latency causes user-facing problems.

Scaling Behaviour

Lambda’s burst limit controls how fast concurrency can increase:

Initial burst: 500–3,000 immediately available (varies by region)
After initial burst: +500 per minute until the account limit is reached

Concurrency growth after sudden spike:
  Minute 0: 0 → 3,000 (initial burst, us-east-1)
  Minute 1: 3,000 → 3,500
  Minute 2: 3,500 → 4,000
  ...until account limit

For traffic spikes that need more than the initial burst, Lambda will throttle until the per-minute allocation catches up.

Throttling: What Happens and How to Handle It

When Lambda throttles a request, the behaviour depends on the invocation type:

Synchronous (API Gateway, direct SDK):

Returns 429 TooManyRequestsException
Caller receives the error immediately
No automatic retry by Lambda

Asynchronous (S3, SNS, EventBridge):

Lambda retries for up to 6 hours with exponential backoff
If the function is still throttled after 6 hours, the event is dropped
Configure a dead-letter queue to capture dropped events

Polling (SQS, Kinesis, DynamoDB Streams):

Lambda pauses polling until concurrency is available
For Kinesis and DynamoDB Streams, processing falls behind
For SQS, messages stay in the queue (up to the visibility timeout)

# Handling throttling in client code
import boto3
from botocore.exceptions import ClientError
import time

lambda_client = boto3.client('lambda')

def invoke_with_retry(function_name, payload, max_retries=3):
    for attempt in range(max_retries):
        try:
            return lambda_client.invoke(
                FunctionName=function_name,
                Payload=payload
            )
        except ClientError as e:
            if e.response['Error']['Code'] == 'TooManyRequestsException':
                wait = 2 ** attempt
                print(f"Throttled, waiting {wait}s (attempt {attempt+1})")
                time.sleep(wait)
            else:
                raise
    raise Exception(f"Failed after {max_retries} retries")

Auto Scaling Provisioned Concurrency

Application Auto Scaling can automatically adjust provisioned concurrency based on metrics:

# Register scalable target
aws application-autoscaling register-scalable-target \
  --service-namespace lambda \
  --resource-id function:api-handler:prod \
  --scalable-dimension lambda:function:ProvisionedConcurrency \
  --min-capacity 2 \
  --max-capacity 20

# Create target tracking policy (scale when utilisation > 70%)
aws application-autoscaling put-scaling-policy \
  --service-namespace lambda \
  --resource-id function:api-handler:prod \
  --scalable-dimension lambda:function:ProvisionedConcurrency \
  --policy-name pc-utilisation-policy \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 0.7,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "LambdaProvisionedConcurrencyUtilization"
    }
  }'

This scales provisioned concurrency up when more than 70% of provisioned environments are actively handling requests.

Concurrency Limits and Database Connections

The most practical reason to set reserved concurrency is protecting downstream systems that have connection limits:

RDS database max connections: 100

Without reserved concurrency:
  Lambda scales to 500 concurrent → 500 DB connections → database rejects connections

With reserved concurrency = 80:
  Lambda capped at 80 concurrent → 80 DB connections → database stays healthy

Better solution: RDS Proxy
  Lambda scales freely → RDS Proxy pools connections → database sees ~20 connections

RDS Proxy is the recommended solution for Lambda-to-RDS connectivity because it handles connection pooling, eliminating the need to limit Lambda concurrency purely for connection management.

Common Interview Questions

Q: What is the default concurrency limit per region? 1,000 concurrent executions per region per account by default. This can be increased via a Service Quota request.

Q: What is the difference between reserved and provisioned concurrency? Reserved concurrency dedicates a number of slots to a function and caps its maximum concurrent executions. Provisioned concurrency pre-warms a number of execution environments so they are ready without a cold start. They serve different purposes: reserved is for capacity isolation and protection; provisioned is for latency.

Q: If you set reserved concurrency to 50, what happens when the 51st request arrives? The 51st request is throttled with a 429 error. Reserved concurrency acts as both a minimum guarantee and a maximum cap.

Q: When should you use provisioned concurrency? When your function is synchronously invoked (user is waiting), cold starts add noticeable latency, and the function runs frequently enough that warm environments from prior invocations are not reliably available. Typically used for customer-facing API handlers with strict p99 latency requirements.