fix: add per-environment reserved concurrency via CLI flags#2158
fix: add per-environment reserved concurrency via CLI flags#2158alinarublea wants to merge 9 commits intomainfrom
Conversation
Move awsReservedConcurrency to deploy script CLI flags for per-environment differentiation (dev=10, stage=15, prod=25). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
This PR will trigger a patch release when merged. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
solaris007
left a comment
There was a problem hiding this comment.
Hey @alinarublea,
Thanks for the quick response to the 2026-03-16 incident. Deploying concurrency caps across the fleet is the right instinct, and the CLI flag approach (--aws-reserved-concurrency=N per deploy script) is the correct helix-deploy mechanism - it correctly achieves per-environment differentiation via deploy/deploy-stage/deploy-dev. Two pre-merge gates need to be cleared before any of these PRs should ship.
Strengths
- Minimal, focused diffs: each PR touches only the three deploy scripts and nothing else. Right scope for incident response.
- Per-environment tiering (dev=10, stage=15, prod=25): graduated limits let throttling behavior surface in lower environments before prod.
- Consistent mechanism across all 10 repos: the same CLI flag pattern makes the configuration greppable, auditable, and easy to update together.
Issues
Critical (Must Fix)
1. Prod limit of 25 not validated against actual concurrency - risk of immediate production degradation
package.json (prod deploy line)
The limit of 25 is applied uniformly across services with fundamentally different throughput profiles. Audit-worker (100+ audit types) and import-worker (ETL engine, 20+ import types) likely sustain peak concurrency well above 25 during normal operations. If any service's p99 ConcurrentExecutions exceeds 25, these PRs will cause throttling and SQS message buildup under normal load the moment they deploy to prod - the exact incident class they are designed to prevent.
Fix before merging: pull 30-day ConcurrentExecutions (p99 and max) from CloudWatch for each of the 10 Lambda functions. Share the data in each PR. Adjust limits for any service where p99 exceeds the proposed limit. Content-processor (the incident's source) likely justifies 25. Audit-worker and import-worker may need higher values.
2. SQS maxReceiveCount interaction - concurrency cap may introduce silent data loss
package.json (all deploy scripts)
When Lambda is throttled by reserved concurrency, SQS increments ApproximateReceiveCount on messages that time out. If any of these services' queues have maxReceiveCount=1 in their SQS redrive policy (noted as a known risk in the content-processor PR description), a single throttle event routes the message straight to the DLQ - without it ever being processed. The concurrency cap becomes a data loss mechanism during the exact bursts it is designed to contain.
Fix before merging: audit all 10 services' SQS queue configurations in spacecat-infrastructure. Confirm maxReceiveCount >= 3 on every queue. If any queue has maxReceiveCount=1, increase it before enabling any concurrency cap.
Important (Should Fix)
3. helix-deploy version not verifiable - change could be a silent no-op
The PR description states "Requires helix-deploy >= 13.5.1." CI passes, but that does not prove the version requirement is met. The critical unknown: if the installed helix-deploy version is below 13.5.1, does it error on --aws-reserved-concurrency (in which case CI would have caught it) or silently ignore it (in which case no concurrency limit is applied and the team believes they are protected but is not)?
Fix: confirm helix-deploy's behavior with unknown flags, or verify the installed version in each repo's package-lock.json. A comment linking to the helix-deploy changelog entry for this feature would give reviewers confidence.
4. IAM permission not verified - first prod deploy after merge may fail
PutFunctionConcurrencyCommand requires lambda:PutFunctionConcurrency on the deploy role. If this permission is missing from spacecat-role-lambda-generic, the first deploy after merge will fail. The branch deploy test plan already asks you to verify via aws lambda get-function-concurrency - that check also implicitly confirms the IAM permission. Recommend doing this on dev before other environments.
5. Rollback procedure uses placeholder function names
The PR description documents the rollback (aws lambda delete-function-concurrency) but uses <function-name> as a placeholder. Under incident pressure, on-call should not need to look up the Lambda function name. Populate the concrete function name for this service (likely following the spacecat-services--<service-name> naming convention), or link to a runbook mapping services to function names.
6. Reserved concurrency is the blunter primitive for SQS-backed services
AWS offers ScalingConfig.MaximumConcurrency on Lambda SQS event source mappings (GA since Nov 2023). This caps concurrency for the SQS trigger only, does not reserve from the account-wide Lambda pool, and does not affect other invocation paths (direct invoke, Step Functions). Reserved concurrency caps ALL invocation paths and permanently removes units from the shared pool. With 10 services x 25 = 250 reserved units, this is a meaningful draw on the default 1,000-unit account limit.
This is not a blocker for the incident response - reserved concurrency does solve the immediate problem. Track switching to SQS ESM MaximumConcurrency as a follow-up for the more surgical long-term primitive.
Minor (Nice to Have)
7. No CloudWatch alarms on Throttles metric
Once the limit is live, throttling becomes expected behavior during bursts. Without alarms on the Throttles metric per function, there is no signal to distinguish healthy burst throttling from sustained degradation indicating the limit is set too low.
8. Account-level concurrency budget
10 services x 25 (prod) = 250 reserved units removed from the shared account pool. If other SpaceCat services also use reserved concurrency, the remaining unreserved pool may be smaller than expected. Verify total account reserved concurrency against the account limit before prod deploy.
Recommendations
- Root cause: reserved concurrency is supply-side throttling. The structural fix is right-sizing PostgREST's connection pool relative to total consumer concurrency (or adding PgBouncer). These PRs are a valid compensating control, not the fix. Track separately.
- After deployment, tune per-service limits based on observed
ConcurrentExecutionsrather than keeping the uniform 25 permanently. - Consider rolling out sequentially - content-processor first (the cause), then lower-traffic services, then high-throughput services last - rather than merging all 10 simultaneously.
Assessment
Ready to merge? No - with two specific pre-merge gates.
The approach is correct and the implementation is clean. Two gates must be cleared before merge:
- Pull CloudWatch
ConcurrentExecutionsp99/max for all 10 functions and confirm no service normally exceeds its proposed limit. - Verify
maxReceiveCount >= 3on all 10 services' SQS queues.
Once confirmed (and limits adjusted for any over-the-cap services), these PRs are ready to ship.
Computed Reserved Concurrency ValuesBased on CloudWatch
Stage is set to match prod values (100). Values computed from |
Summary
--aws-reserved-concurrencyCLI flags to deploy scripts with per-environment values (dev=10, stage=15, prod=25)PutFunctionConcurrencyCommandat deploy timeBackground
On 2026-03-16, an unbounded SQS burst on content-processor spun up hundreds of concurrent Lambda instances, saturating PostgREST DB connection pool and cascading 503s to other services. Adding reserved concurrency caps to all SQS-backed services prevents recurrence.
Rollback
Removing
--aws-reserved-concurrencyfrom the deploy scripts and redeploying does not remove the Lambda reserved concurrency setting. To fully roll back:Run for each environment (switching AWS_PROFILE/account accordingly).
Test plan
aws lambda get-function-concurrency🤖 Generated with Claude Code