Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.anyshift.io/llms.txt

Use this file to discover all available pages before exploring further.

Automated incident investigation

Alerts fire. Annie investigates. Root cause and remediation land in the incident channel before the on-call engineer has finished joining the bridge. The investigation correlates infrastructure changes, monitoring data, and dependencies across AWS, GCP, Kubernetes, Terraform, and your monitoring stack. No console-hopping. No tab-jumping.

Mean time to resolution (MTTR)

Google’s SRE postmortem culture frames MTTR as the dominant operational metric. Industry-wide MTTR for production incidents typically runs 2 to 6 hours, with Atlassian’s incident handbook calling for sub-hour targets on SEV-1. Annie compresses the investigation phase of MTTR. Customers running PagerDuty or incident.io alongside Annie report MTTR reductions of 85% or more on incidents where “what changed?” is the blocking question.

Slack channel registration

1

Register Annie on your incident channel

In your Slack incident channel, run:
/register_annie_on_call @your_bot_name
Example: /register_annie_on_call @Datadog
2

Wait for an alert

When an alert fires in your channel, Annie automatically picks it up and starts the investigation.
3

Get root cause and fix

Annie analyzes the alert, correlates with your infrastructure, and provides the root cause with actionable remediation steps.
Want to customize how Annie responds to specific alerts? See Annie Instructions.

Alert ingestion sources

Annie investigates incidents originating from four channels:

PagerDuty

Automatic RCA when incidents are created. Results posted as incident notes.

Incident.io

Webhook integration for automatic RCA. Results posted as comments.

Slack

Mention @Annie with incident details for on-demand investigation.

MCP Tools

Trigger RCA from your IDE during development.

PagerDuty webhook trigger

The PagerDuty webhook fires on incident creation. Annie receives the event, queries the versioned infrastructure graph, and posts the investigation back as an incident note. End-to-end latency runs about 30 to 90 seconds depending on graph size.

Versioned graph traversal

Investigation runs against an indexed model of the production stack. Every IAM update, Helm rollout, Terraform apply, and merged commit is recorded as a versioned node. “What changed in the last 24 hours that touches the payment-service deployment chain?” resolves as a graph diff rather than a manual hunt across CloudTrail, kubectl, and git logs. Cloudflare’s November 2025 cascading outage is the canonical example of why this matters. Monitoring caught the failures in minutes. Tracing them to the originating internal change took hours, because the dependency chain was not queryable data.

Postmortem-ready output

When Annie completes an RCA, you receive:
A concise summary suitable for stakeholder communication:
“The checkout API latency spike was caused by DynamoDB read throttling after a 5x traffic increase from the marketing campaign. Immediate mitigation: increase read capacity to 500 RCU.”
Chronological sequence leading to the incident:
  • 09:55 - Marketing campaign email sent
  • 10:02 - Traffic increases 5x
  • 10:05 - DynamoDB throttling begins
  • 10:07 - P99 latency exceeds threshold
  • 10:08 - Alert fires
Technical details with evidence from your systems:
  • What happened
  • Why it happened
  • Supporting evidence from logs, metrics, and configuration history
List of impacted infrastructure with the specific impact on each.
Actionable fixes organized by urgency:
  • Immediate: Resolve the incident now
  • Short-term: Prevent recurrence this sprint
  • Long-term: Systemic improvements

Agentic Context Engineering

The methodology behind Annie’s investigation loop is documented in Agentic Context Engineering, a paper authored with researchers at Stanford and SambaNova Systems and accepted at ICLR 2026. The technique has been live in production since October 2025. It has cut root-cause-analysis time by 30% on real customer incidents.

Operational benefits

Reduce MTTR

Cut mean time to resolution from hours to minutes by automating the investigation.

Less On-Call Stress

Engineers get root cause and fix suggestions immediately instead of scrambling through dashboards.

Consistent Investigation

Every incident gets the same thorough analysis, regardless of who’s on call.

Actionable Fixes

Annie provides specific commands and code changes, not just diagnoses.

Real-world examples

“RDS connection timeout on prod-api service”
Root Cause: Security group sg-prod-db was modified at 14:32, removing the inbound rule for the application subnet (10.0.1.0/24). Evidence:
  • Security group change detected 15 minutes before alert
  • No changes to RDS instance itself
  • Application logs show “connection refused” starting at 14:35
“Pod restarts exceeding threshold for payment-service”
Root Cause: Deployment payment-service:v2.3.0 was deployed 1 hour ago and has a memory leak. Pods are being OOMKilled. Evidence:
  • New image deployed at 10:00
  • Memory usage increased from ~300Mi to 600Mi under load
  • Pod memory limit is 512Mi (see Kubernetes resource limits)
  • OOMKilled events in Kubernetes
“P99 latency > 2s on checkout API”
Root Cause: DynamoDB table checkout-sessions is throttling due to exceeded read capacity. A marketing campaign at 10:00 AM increased traffic 5x. Evidence:
  • Traffic increased from 100 req/s to 500 req/s at 10:00
  • DynamoDB throttled requests spiked at 10:05
  • Provisioned RCU (100) is insufficient (see DynamoDB provisioned throughput)

Get Started

Create Account

Sign up for Anyshift

Request Demo

See RCA in action