diff --git a/agents/aws-incident-triage.agent.md b/agents/aws-incident-triage.agent.md
new file mode 100644
index 00000000..da6f839b
--- /dev/null
+++ b/agents/aws-incident-triage.agent.md
@@ -0,0 +1,118 @@
+---
+name: AWS Incident Triage
+description: On-call SRE agent that drives structured CloudWatch-based incident investigation from alarms through root-cause hypothesis.
+---
+
+# AWS Incident Triage Agent
+
+You are a senior Site Reliability Engineer on call for a production AWS environment. Your job is to drive a structured, time-bounded investigation when an alarm fires or an anomaly is reported. You think in evidence, not hunches. Every claim you make is backed by a metric, log line, or trace span.
+
+## Persona
+
+- Calm, methodical, and concise under pressure.
+- Default to read-only operations. Never mutate infrastructure without explicit approval.
+- Prefer narrowing scope over broadening it. Start wide, then zoom in.
+- Communicate findings as they emerge; do not wait for a complete picture.
+- Time-box each investigation phase. If a phase yields nothing after two attempts, document what was tried and move on.
+
+## Investigation Protocol
+
+### Phase 1: Alarm Context (< 2 minutes)
+
+1. Retrieve the firing alarm(s) using `get_active_alarms`.
+2. For each alarm, pull alarm history to understand state transitions and recent threshold breaches.
+3. Record: alarm name, metric namespace, dimensions, threshold, current value, time entered ALARM state.
+4. **Decision point:** If multiple alarms fired within a 5-minute window, group them by service/account and treat as a correlated incident.
+
+### Phase 2: Blast Radius Assessment (< 3 minutes)
+
+Apply the "narrow the blast radius" decision tree:
+
+```
+Account → Region → Service → Operation → Resource
+```
+
+1. Identify which account(s) are affected (check alarm dimensions or cross-account dashboards).
+2. Confirm the region(s) — do not assume us-east-1.
+3. Identify the service (Lambda, ECS, API Gateway, RDS, etc.) from the alarm's namespace.
+4. Narrow to the specific operation or API action showing degradation.
+5. Identify the specific resource (function name, cluster, DB instance).
+
+**Decision point:** If blast radius spans multiple services, declare a multi-service incident and investigate the shared dependency (network, IAM, deployment) first.
+
+### Phase 3: Metric Anomaly Detection (< 5 minutes)
+
+1. Query the primary metric from the alarm with 1-minute granularity over the last 2 hours.
+2. Query correlated metrics:
+ - For Lambda: Duration p99, Errors, Throttles, ConcurrentExecutions
+ - For ECS: CPUUtilization, MemoryUtilization, RunningTaskCount
+ - For API Gateway: 5XXError, Latency p99, Count
+ - For RDS: DatabaseConnections, ReadLatency, FreeableMemory, CPUUtilization
+3. Look for inflection points — when did the metric first deviate from baseline?
+4. Correlate the inflection time with deployment events (check CloudTrail for `UpdateFunctionCode`, `UpdateService`, `CreateDeployment` within +/- 15 minutes).
+
+**Decision point:** If a deployment correlates with the anomaly onset, flag it as probable cause and proceed to Phase 5 for confirmation. Otherwise continue to Phase 4.
+
+### Phase 4: Log Investigation (< 5 minutes)
+
+1. Identify the relevant log group(s) from the affected resource.
+2. Run targeted Logs Insights queries (use templates from the aws-cloudwatch-investigation skill):
+ - Error spike query filtered to the incident time window.
+ - If latency-related: p99 latency breakdown by operation.
+ - If memory-related: OOM detection query.
+3. Extract the top 3-5 most frequent error messages with counts.
+4. For each unique error, pull one full log event for context (request ID, stack trace, upstream dependency).
+
+**Decision point:** If logs reveal a clear upstream dependency failure (timeout to another service, connection refused, auth error), pivot investigation to that dependency.
+
+### Phase 5: Trace Sampling (< 3 minutes)
+
+1. If X-Ray or distributed tracing is available, pull 3-5 traces from the incident window that exhibit the failure mode.
+2. Identify the span where latency spikes or errors originate.
+3. Note the downstream service, operation, and error code from the failing span.
+4. Compare with a healthy trace from before the incident window.
+
+**Decision point:** If traces confirm a single downstream bottleneck, you have a root cause candidate. If traces show distributed failures, suspect a shared resource (network, DNS, IAM token vending).
+
+### Phase 6: Root-Cause Hypothesis (< 2 minutes)
+
+Synthesize findings into a structured hypothesis:
+
+```
+## Root-Cause Hypothesis
+
+**Summary:** [One sentence description]
+
+**Confidence:** [High / Medium / Low]
+
+**Evidence chain:**
+1. [Alarm] — what fired and when
+2. [Metric] — what changed and the inflection point
+3. [Log] — specific error messages with counts
+4. [Trace/Deploy] — corroborating evidence
+
+**Blast radius:** [Account / Region / Service / Resources affected]
+
+**Timeline:**
+- T+0: [First anomaly detected]
+- T+N: [Alarm fired]
+- T+M: [Current state]
+
+**Suggested mitigation:**
+- [Immediate action, e.g., rollback deploy, scale out, circuit-break]
+- [Follow-up action for permanent fix]
+
+**What this does NOT explain:**
+- [Any contradictory evidence or open questions]
+```
+
+## Operating Rules
+
+1. **Never skip phases** — even if you think you know the answer after Phase 1, confirm with metrics and logs.
+2. **Cite everything** — reference specific metric data points, log event timestamps, trace IDs.
+3. **Time-box strictly** — if a phase is blocked (permissions, missing data), document the blocker and proceed.
+4. **Escalation triggers:**
+ - Data loss suspected → escalate immediately
+ - Blast radius growing → escalate immediately
+ - No hypothesis after all phases → escalate with investigation summary
+5. **Post-incident:** Recommend specific monitors or dashboards to add for future detection.
diff --git a/docs/README.agents.md b/docs/README.agents.md
index baf0e7d6..147e24b7 100644
--- a/docs/README.agents.md
+++ b/docs/README.agents.md
@@ -42,6 +42,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-agents) for guidelines on how to
| [Atlassian Requirements to Jira](../agents/atlassian-requirements-to-jira.agent.md)
[](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fatlassian-requirements-to-jira.agent.md)
[](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fatlassian-requirements-to-jira.agent.md) | Transform requirements documents into structured Jira epics and user stories with intelligent duplicate detection, change management, and user-approved creation workflow. | |
| [AVM Owner Triage](../agents/azure-verified-modules-owner-triage.agent.md)
[](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fazure-verified-modules-owner-triage.agent.md)
[](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fazure-verified-modules-owner-triage.agent.md) | Triage open GitHub issues across the Azure Verified Modules (AVM) repos an owner maintains. Splits the backlog into a Copilot-delegatable pile and a human pile, produces a report with a delegation ratio, and never comments or assigns without explicit user approval. | |
| [Aws Cloud Expert](../agents/aws-cloud-expert.agent.md)
[](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Faws-cloud-expert.agent.md)
[](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Faws-cloud-expert.agent.md) | AWS Cloud Expert provides deep, hands-on guidance for designing, building, and operating AWS workloads. Covers the full AWS ecosystem — serverless, containers, databases, networking, IaC, security, and cost optimization — grounded in the AWS Well-Architected Framework. | |
+| [AWS Incident Triage](../agents/aws-incident-triage.agent.md)
[](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Faws-incident-triage.agent.md)
[](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Faws-incident-triage.agent.md) | On-call SRE agent that drives structured CloudWatch-based incident investigation from alarms through root-cause hypothesis. | |
| [Aws Principal Architect](../agents/aws-principal-architect.agent.md)
[](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Faws-principal-architect.agent.md)
[](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Faws-principal-architect.agent.md) | Provide expert AWS Principal Architect guidance using AWS Well-Architected Framework principles and AWS best practices. | |
| [Aws Serverless Architect](../agents/aws-serverless-architect.agent.md)
[](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Faws-serverless-architect.agent.md)
[](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Faws-serverless-architect.agent.md) | Provide expert AWS Serverless Architect guidance focusing on event-driven architectures, Lambda, API Gateway, and serverless best practices. | |
| [Azure AVM Bicep mode](../agents/azure-verified-modules-bicep.agent.md)
[](https://aka.ms/awesome-copilot/install/agent?url=vscode%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fazure-verified-modules-bicep.agent.md)
[](https://aka.ms/awesome-copilot/install/agent?url=vscode-insiders%3Achat-agent%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fagents%2Fazure-verified-modules-bicep.agent.md) | Create, update, or review Azure IaC in Bicep using Azure Verified Modules (AVM). | |
diff --git a/docs/README.skills.md b/docs/README.skills.md
index 6c170693..789e7eca 100644
--- a/docs/README.skills.md
+++ b/docs/README.skills.md
@@ -59,6 +59,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
| [audit-integrity](../skills/audit-integrity/SKILL.md)
`gh skills install github/awesome-copilot audit-integrity` | Shared audit integrity framework for all AppSec agents — enforces output quality, intellectual honesty, and continuous improvement through anti-rationalization guards, self-critique loops, retry protocols, non-negotiable behaviors, self-reflection quality gates (1-10 scoring, ≥8 threshold), and a self-learning system with lesson/memory governance for security analysis agents. | `references/anti-rationalization-guard.md`
`references/clarification-protocol.md`
`references/non-negotiable-behaviors.md`
`references/retry-protocol.md`
`references/self-critique-loop.md`
`references/self-learning-system.md`
`references/self-reflection-quality-gate.md` |
| [automate-this](../skills/automate-this/SKILL.md)
`gh skills install github/awesome-copilot automate-this` | Analyze a screen recording of a manual process and produce targeted, working automation scripts. Extracts frames and audio narration from video files, reconstructs the step-by-step workflow, and proposes automation at multiple complexity levels using tools already installed on the user machine. | None |
| [autoresearch](../skills/autoresearch/SKILL.md)
`gh skills install github/awesome-copilot autoresearch` | Autonomous iterative experimentation loop for any programming task. Guides the user through defining goals, measurable metrics, and scope constraints, then runs an autonomous loop of code changes, testing, measuring, and keeping/discarding results. Inspired by Karpathy's autoresearch. USE FOR: autonomous improvement, iterative optimization, experiment loop, auto research, performance tuning, automated experimentation, hill climbing, try things automatically, optimize code, run experiments, autonomous coding loop. DO NOT USE FOR: one-shot tasks, simple bug fixes, code review, or tasks without a measurable metric. | None |
+| [AWS CloudWatch Investigation](../skills/aws-cloudwatch-investigation/SKILL.md)
`gh skills install github/awesome-copilot aws-cloudwatch-investigation` | Reusable investigation patterns for AWS CloudWatch: Logs Insights query templates, alarm-to-deployment correlation, blast-radius narrowing decision tree, and PromQL-style metric query patterns for structured incident triage. | None |
| [aws-cdk-python-setup](../skills/aws-cdk-python-setup/SKILL.md)
`gh skills install github/awesome-copilot aws-cdk-python-setup` | Setup and initialization guide for developing AWS CDK (Cloud Development Kit) applications in Python. This skill enables users to configure environment prerequisites, create new CDK projects, manage dependencies, and deploy to AWS. | None |
| [aws-cost-optimize](../skills/aws-cost-optimize/SKILL.md)
`gh skills install github/awesome-copilot aws-cost-optimize` | Analyze AWS resources used in the app (IaC files and/or resources in a target account/region) and optimize costs - creating GitHub issues for identified optimizations. | None |
| [aws-resource-health-diagnose](../skills/aws-resource-health-diagnose/SKILL.md)
`gh skills install github/awesome-copilot aws-resource-health-diagnose` | Analyze AWS resource health, diagnose issues from CloudWatch logs and metrics, and create a remediation plan for identified problems. | None |
diff --git a/skills/aws-cloudwatch-investigation/SKILL.md b/skills/aws-cloudwatch-investigation/SKILL.md
new file mode 100644
index 00000000..1b25503f
--- /dev/null
+++ b/skills/aws-cloudwatch-investigation/SKILL.md
@@ -0,0 +1,333 @@
+---
+name: AWS CloudWatch Investigation
+description: >
+ Reusable investigation patterns for AWS CloudWatch: Logs Insights query templates,
+ alarm-to-deployment correlation, blast-radius narrowing decision tree, and
+ PromQL-style metric query patterns for structured incident triage.
+---
+
+# AWS CloudWatch Investigation Skill
+
+Reusable patterns for investigating production incidents using CloudWatch Logs, Metrics, and Alarms. These patterns are designed to be composed together during incident triage.
+
+---
+
+## Pattern 1: Logs Insights Query Templates
+
+### Error Spike Detection
+
+Find the top errors in a time window, grouped by error type:
+
+```
+fields @timestamp, @message, @logStream
+| filter @message like /(?i)(error|exception|fatal|critical)/
+| stats count(*) as errorCount by bin(5m), @logStream
+| sort errorCount desc
+| limit 20
+```
+
+### P99 Latency Breakdown by Operation
+
+Identify which operations are driving latency spikes:
+
+```
+fields @timestamp, @duration, operation
+| filter ispresent(@duration)
+| stats avg(@duration) as avgMs,
+ pct(@duration, 50) as p50Ms,
+ pct(@duration, 95) as p95Ms,
+ pct(@duration, 99) as p99Ms,
+ count(*) as invocations
+ by operation
+| sort p99Ms desc
+| limit 15
+```
+
+### Lambda Cold Start Detection
+
+Quantify cold start impact during an incident:
+
+```
+fields @timestamp, @duration, @initDuration, @memorySize, @maxMemoryUsed
+| filter ispresent(@initDuration)
+| stats count(*) as coldStarts,
+ avg(@initDuration) as avgInitMs,
+ max(@initDuration) as maxInitMs,
+ avg(@duration) as avgDurationMs
+ by bin(5m)
+| sort @timestamp desc
+```
+
+### Out-of-Memory (OOM) Detection
+
+Find Lambda functions or containers killed by memory pressure:
+
+```
+fields @timestamp, @message, @logStream, @memorySize, @maxMemoryUsed
+| filter @message like /Runtime exited|out of memory|OOMKilled|Cannot allocate memory|MemoryError/
+| stats count(*) as oomEvents by @logStream, bin(10m)
+| sort oomEvents desc
+| limit 10
+```
+
+For memory utilization trending before OOM:
+
+```
+fields @timestamp, @maxMemoryUsed, @memorySize
+| filter ispresent(@maxMemoryUsed)
+| stats max(@maxMemoryUsed / @memorySize * 100) as peakMemPct,
+ avg(@maxMemoryUsed / @memorySize * 100) as avgMemPct
+ by bin(5m)
+| sort @timestamp desc
+```
+
+### Timeout Detection
+
+Find invocations that hit the configured timeout:
+
+```
+fields @timestamp, @duration, @logStream, @requestId
+| filter @message like /Task timed out/ or @duration > 28000
+| stats count(*) as timeouts by @logStream, bin(5m)
+| sort timeouts desc
+```
+
+---
+
+## Pattern 2: Alarm History to Deploy-Event Correlation
+
+### Process
+
+1. **Get alarm transition time** — note the exact timestamp when the alarm entered ALARM state.
+2. **Query CloudTrail** for deployment-related events in a window of [alarm_time - 30min, alarm_time]:
+
+```
+# CloudTrail Lake query for deployment events
+SELECT eventTime, eventName, userIdentity.arn, requestParameters
+FROM
+WHERE eventTime > ''
+ AND eventTime < ''
+ AND eventName IN (
+ 'UpdateFunctionCode', 'UpdateFunctionConfiguration',
+ 'UpdateService', 'CreateDeployment', 'RegisterTaskDefinition',
+ 'CreateChangeSet', 'ExecuteChangeSet',
+ 'StartPipelineExecution', 'PutImage'
+ )
+ORDER BY eventTime DESC
+```
+
+3. **Correlation criteria** — a deploy is "correlated" if:
+ - It targets the same service/resource as the alarm
+ - It completed within 15 minutes before the alarm transition
+ - The deployer identity matches a CI/CD role (not a human applying a hotfix)
+
+4. **Strengthening the correlation:**
+ - Check if the same alarm was healthy in the previous deployment cycle
+ - Verify no other environmental changes (scaling events, config changes) in the same window
+ - Look for canary/synthetic monitor failures that started at the same time
+
+### Output Format
+
+```
+Deploy Correlation:
+ Event: UpdateFunctionCode
+ Time: 2024-03-15T14:23:07Z (12 min before alarm)
+ Actor: arn:aws:sts::123456789012:assumed-role/github-actions-deploy/session
+ Resource: arn:aws:lambda:us-east-1:123456789012:function:payment-processor
+ Correlation: STRONG — same resource, CI/CD actor, alarm was OK prior cycle
+```
+
+---
+
+## Pattern 3: Narrow the Blast Radius Decision Tree
+
+Use this tree to systematically scope an incident from broadest to most specific:
+
+```
+START
+ |
+ v
+[1] ACCOUNT — Which account(s) show the alarm?
+ | - Check: Are alarms firing in multiple accounts?
+ | - If yes → suspect shared service (SSO, networking, shared deployment pipeline)
+ | - If no → proceed to Region
+ v
+[2] REGION — Which region(s) are affected?
+ | - Check: Same alarm in other regions?
+ | - If multi-region → suspect global service (IAM, Route53, S3 global)
+ | - If single-region → proceed to Service
+ v
+[3] SERVICE — Which service namespace shows degradation?
+ | - Check CloudWatch namespace: AWS/Lambda, AWS/ECS, AWS/ApiGateway, etc.
+ | - If multiple services → suspect shared dependency (VPC, NAT, DNS, IAM)
+ | - If single service → proceed to Operation
+ v
+[4] OPERATION — Which API action or function is failing?
+ | - For Lambda: which function name?
+ | - For ECS: which service/task definition?
+ | - For API GW: which stage/resource/method?
+ | - If all operations → suspect service-level issue (throttling, quota)
+ | - If specific operation → proceed to Resource
+ v
+[5] RESOURCE — Which specific resource instance?
+ - Function ARN, Task ID, DB instance identifier
+ - This is your investigation target
+ - Proceed to log and trace analysis scoped to this resource
+```
+
+### Shared Dependency Investigation
+
+When blast radius spans multiple services, investigate in this order:
+
+1. **VPC/Networking** — NAT Gateway ErrorPortAllocation, packet drops, DNS resolution failures
+2. **IAM/STS** — ThrottlingException on AssumeRole, token vending latency
+3. **Downstream dependency** — shared database, cache, or external API
+4. **Deployment pipeline** — simultaneous deploys across services from same pipeline run
+5. **AWS service event** — check AWS Health Dashboard and Service Health for the region
+
+---
+
+## Pattern 4: PromQL-Style Metric Query Patterns
+
+These patterns use CloudWatch metric math and GetMetricData to build composite signals. Express them as metric queries for dashboards or programmatic retrieval.
+
+### Error Rate as Percentage
+
+```
+MetricDataQueries:
+ - Id: errors
+ MetricStat:
+ Metric:
+ Namespace: AWS/Lambda
+ MetricName: Errors
+ Dimensions: [{Name: FunctionName, Value: TARGET}]
+ Period: 60
+ Stat: Sum
+ - Id: invocations
+ MetricStat:
+ Metric:
+ Namespace: AWS/Lambda
+ MetricName: Invocations
+ Dimensions: [{Name: FunctionName, Value: TARGET}]
+ Period: 60
+ Stat: Sum
+ - Id: error_rate
+ Expression: "errors / invocations * 100"
+ Label: "Error Rate %"
+```
+
+### Latency Anomaly Detection (Compare to Baseline)
+
+```
+MetricDataQueries:
+ - Id: current_p99
+ MetricStat:
+ Metric:
+ Namespace: AWS/Lambda
+ MetricName: Duration
+ Dimensions: [{Name: FunctionName, Value: TARGET}]
+ Period: 300
+ Stat: p99
+ - Id: baseline_p99
+ MetricStat:
+ Metric:
+ Namespace: AWS/Lambda
+ MetricName: Duration
+ Dimensions: [{Name: FunctionName, Value: TARGET}]
+ Period: 300
+ Stat: p99
+ # Use StartTime/EndTime set to same window last week
+ - Id: anomaly_ratio
+ Expression: "current_p99 / baseline_p99"
+ Label: "Latency vs Baseline (ratio > 2 = anomaly)"
+```
+
+### Throttling Pressure Score
+
+Combine multiple throttling signals into a single pressure metric:
+
+```
+MetricDataQueries:
+ - Id: lambda_throttles
+ MetricStat:
+ Metric: {Namespace: AWS/Lambda, MetricName: Throttles}
+ Period: 60
+ Stat: Sum
+ - Id: api_gw_429s
+ MetricStat:
+ Metric: {Namespace: AWS/ApiGateway, MetricName: 4XXError, Dimensions: [{Name: ApiName, Value: TARGET}]}
+ Period: 60
+ Stat: Sum
+ - Id: dynamo_throttles
+ MetricStat:
+ Metric: {Namespace: AWS/DynamoDB, MetricName: ThrottledRequests, Dimensions: [{Name: TableName, Value: TARGET}]}
+ Period: 60
+ Stat: Sum
+ - Id: throttle_pressure
+ Expression: "lambda_throttles + api_gw_429s + dynamo_throttles"
+ Label: "Combined Throttle Pressure"
+```
+
+### Concurrent Execution Headroom
+
+```
+MetricDataQueries:
+ - Id: concurrent
+ MetricStat:
+ Metric: {Namespace: AWS/Lambda, MetricName: ConcurrentExecutions}
+ Period: 60
+ Stat: Maximum
+ - Id: headroom
+ Expression: "1000 - concurrent"
+ Label: "Remaining Concurrency (account limit 1000)"
+```
+
+---
+
+## Pattern 5: Incident Timeline Reconstruction
+
+### Process
+
+Reconstruct a precise timeline by merging data from multiple sources:
+
+1. **Collect timestamps:**
+
+| Source | Query | Yields |
+|--------|-------|--------|
+| CloudWatch Alarms | Alarm history API | State transition times |
+| CloudWatch Metrics | GetMetricData with 1-min period | First anomaly point |
+| CloudWatch Logs | Logs Insights with `earliest(@timestamp)` | First error occurrence |
+| CloudTrail | LookupEvents filtered by time | Deployment/change events |
+| AWS Health | DescribeEvents | AWS-side incidents |
+
+2. **Build the timeline:**
+
+```
+fields @timestamp, @message
+| filter @message like /ERROR|WARN|timeout|refused|denied/
+| stats earliest(@timestamp) as firstSeen, latest(@timestamp) as lastSeen, count(*) as occurrences
+ by @message
+| sort firstSeen asc
+| limit 20
+```
+
+3. **Identify the sequence:**
+
+```
+Timeline:
+ T-15m: CloudTrail — UpdateFunctionCode by CI/CD role
+ T-12m: Logs — first error "Connection refused to payments-api.internal"
+ T-10m: Metrics — Error count crosses 5/min threshold
+ T-8m: Alarm — PaymentProcessorErrors enters ALARM
+ T-5m: Metrics — p99 latency spikes to 28s (timeout)
+ T-0: Current — error rate at 45%, alarm still firing
+```
+
+4. **Determine root event** — the earliest change that preceded all symptoms. Walk backward from the first symptom to the most recent mutation (deploy, config change, scaling event, or external dependency shift).
+
+### Gotchas
+
+- CloudWatch metric timestamps are end-of-period. A 1-minute datapoint at 14:05 covers 14:04-14:05.
+- CloudTrail events can have up to 15-minute delivery delay. Use `eventTime`, not ingestion time.
+- Log group timestamps depend on the agent/SDK flush interval. Allow for 30-60s of clock skew.
+- Alarm state changes have a built-in evaluation delay (periods x evaluation periods). The actual anomaly started earlier.