Automated incident investigation
Alerts fire. Annie investigates. Root cause and remediation land in the incident channel before the on-call engineer has finished joining the bridge. The investigation correlates infrastructure changes, monitoring data, and dependencies across AWS, GCP, Kubernetes, Terraform, and your monitoring stack. No console-hopping. No tab-jumping.Slack channel registration
Register Annie on your incident channel
In your Slack incident channel, run:Example:
/register_annie_on_call @DatadogWait for an alert
When an alert fires in your channel, Annie automatically picks it up and starts the investigation.
Alert ingestion sources
Annie investigates incidents originating from four channels:PagerDuty
Automatic RCA when incidents are created. Results posted as incident notes.
Incident.io
Webhook integration for automatic RCA. Results posted as comments.
Slack
Mention @Annie with incident details for on-demand investigation.
MCP Tools
Trigger RCA from your IDE during development.
How it works
Investigation runs against the versioned knowledge graph of your stack, where every IAM update, Helm rollout, Terraform apply, and merged commit is a node. “What changed in the last 24 hours that touches the payment-service deployment chain?” resolves as a graph diff, not a manual hunt across CloudTrail, kubectl, and git logs. When an alert fires, Annie posts the result back to the incident in about 30 to 90 seconds.Postmortem-ready output
When Annie completes an RCA, you receive:Executive Summary
Executive Summary
A concise summary suitable for stakeholder communication:
“The checkout API latency spike was caused by DynamoDB read throttling after a 5x traffic increase from the marketing campaign. Immediate mitigation: increase read capacity to 500 RCU.”
Timeline of Events
Timeline of Events
Chronological sequence leading to the incident:
- 09:55 - Marketing campaign email sent
- 10:02 - Traffic increases 5x
- 10:05 - DynamoDB throttling begins
- 10:07 - P99 latency exceeds threshold
- 10:08 - Alert fires
Root Cause Details
Root Cause Details
Technical details with evidence from your systems:
- What happened
- Why it happened
- Supporting evidence from logs, metrics, and configuration history
Affected Resources
Affected Resources
List of impacted infrastructure with the specific impact on each.
Remediation Steps
Remediation Steps
Actionable fixes organized by urgency:
- Immediate: Resolve the incident now
- Short-term: Prevent recurrence this sprint
- Long-term: Systemic improvements
Real-world examples
Database Connection Failures
Database Connection Failures
“RDS connection timeout on prod-api service”
Root Cause: Security groupsg-prod-dbwas modified at 14:32, removing the inbound rule for the application subnet (10.0.1.0/24). Evidence:
- Security group change detected 15 minutes before alert
- No changes to RDS instance itself
- Application logs show “connection refused” starting at 14:35
Kubernetes Pods CrashLooping
Kubernetes Pods CrashLooping
“Pod restarts exceeding threshold for payment-service”
Root Cause: Deploymentpayment-service:v2.3.0was deployed 1 hour ago and has a memory leak. Pods are being OOMKilled. Evidence:
- New image deployed at 10:00
- Memory usage increased from ~300Mi to 600Mi under load
- Pod memory limit is 512Mi (see Kubernetes resource limits)
- OOMKilled events in Kubernetes
API Latency Spike
API Latency Spike
“P99 latency > 2s on checkout API”
Root Cause: DynamoDB tablecheckout-sessionsis throttling due to exceeded read capacity. A marketing campaign at 10:00 AM increased traffic 5x. Evidence:
- Traffic increased from 100 req/s to 500 req/s at 10:00
- DynamoDB throttled requests spiked at 10:05
- Provisioned RCU (100) is insufficient (see DynamoDB provisioned throughput)
Get Started
Create Account
Sign up for Anyshift
Request Demo
See RCA in action
Related
Proactive
The flip side of RCA. Annie finds and predicts issues before an alert fires.
Annie Knowledge
Ask follow-up questions about any incident in plain language.
Knowledge Graph
The versioned graph Annie traverses to find “what changed”.
Change Management
Replay your stack’s state at any point in the last 7 days.