> ## Documentation Index
> Fetch the complete documentation index at: https://docs.anyshift.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Root Cause Analysis

> When incidents occur, Annie automatically correlates alerts, logs, metrics, and infrastructure changes to pinpoint root causes and suggest actionable fixes.

<iframe width="560" height="315" src="https://www.youtube.com/embed/ppgrX1fligI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

## Automated incident investigation

Alerts fire. Annie investigates. Root cause and remediation land in the incident channel before the on-call engineer has finished joining the bridge.

The investigation correlates infrastructure changes, monitoring data, and dependencies across AWS, GCP, Kubernetes, Terraform, and your monitoring stack. No console-hopping. No tab-jumping.

## Slack channel registration

<Steps>
  <Step title="Register Annie on your incident channel">
    In your Slack incident channel, run:

    ```bash theme={null}
    /register_annie_on_call @your_bot_name
    ```

    **Example:** `/register_annie_on_call @Datadog`
  </Step>

  <Step title="Wait for an alert">
    When an alert fires in your channel, Annie automatically picks it up and starts the investigation.
  </Step>

  <Step title="Get root cause and fix">
    Annie analyzes the alert, correlates with your infrastructure, and provides the root cause with actionable remediation steps.
  </Step>
</Steps>

<Tip>
  Want to customize how Annie responds to specific alerts? See [Automation](/pages/product/customization/instructions).
</Tip>

## Alert ingestion sources

Annie investigates incidents originating from four channels:

<CardGroup cols={2}>
  <Card title="PagerDuty" icon="bell" href="/pages/integration/pagerduty">
    Automatic RCA when incidents are created. Results posted as incident notes.
  </Card>

  <Card title="Incident.io" icon="triangle-exclamation" href="/pages/integration/incident-io">
    Webhook integration for automatic RCA. Results posted as comments.
  </Card>

  <Card title="Slack" icon="slack" href="/pages/product/integration/slack">
    Mention @Annie with incident details for on-demand investigation.
  </Card>

  <Card title="MCP Tools" icon="robot" href="/pages/product/integration/remote_mcp">
    Trigger RCA from your IDE during development.
  </Card>
</CardGroup>

## How it works

Investigation runs against the versioned [knowledge graph](/pages/overview/knowledge_graph) of your stack, where every IAM update, Helm rollout, Terraform apply, and merged commit is a node. "What changed in the last 24 hours that touches the payment-service deployment chain?" resolves as a graph diff, not a manual hunt across CloudTrail, kubectl, and git logs. When an alert fires, Annie posts the result back to the incident in about 30 to 90 seconds.

## Postmortem-ready output

When Annie completes an RCA, you receive:

<AccordionGroup>
  <Accordion icon="file-lines" title="Executive Summary">
    A concise summary suitable for stakeholder communication:

    > "The checkout API latency spike was caused by DynamoDB read throttling after a 5x traffic increase from the marketing campaign. Immediate mitigation: increase read capacity to 500 RCU."
  </Accordion>

  <Accordion icon="clock" title="Timeline of Events">
    Chronological sequence leading to the incident:

    * 09:55 - Marketing campaign email sent
    * 10:02 - Traffic increases 5x
    * 10:05 - DynamoDB throttling begins
    * 10:07 - P99 latency exceeds threshold
    * 10:08 - Alert fires
  </Accordion>

  <Accordion icon="magnifying-glass" title="Root Cause Details">
    Technical details with evidence from your systems:

    * What happened
    * Why it happened
    * Supporting evidence from logs, metrics, and configuration history
  </Accordion>

  <Accordion icon="server" title="Affected Resources">
    List of impacted infrastructure with the specific impact on each.
  </Accordion>

  <Accordion icon="wrench" title="Remediation Steps">
    Actionable fixes organized by urgency:

    * **Immediate:** Resolve the incident now
    * **Short-term:** Prevent recurrence this sprint
    * **Long-term:** Systemic improvements

    For hypotheses that map to a concrete code change, the **Propose Fix** button on the hypothesis card opens a pull request with the diff. See [Propose Fix](/pages/product/propose_fix).
  </Accordion>
</AccordionGroup>

## Real-world examples

<AccordionGroup>
  <Accordion icon="database" title="Database Connection Failures">
    *"RDS connection timeout on prod-api service"*

    > **Root Cause**: Security group `sg-prod-db` was modified at 14:32, removing the inbound rule for the application subnet (10.0.1.0/24).
    >
    > **Evidence**:
    >
    > * Security group change detected 15 minutes before alert
    > * No changes to RDS instance itself
    > * Application logs show "connection refused" starting at 14:35
  </Accordion>

  <Accordion icon="cubes" title="Kubernetes Pods CrashLooping">
    *"Pod restarts exceeding threshold for payment-service"*

    > **Root Cause**: Deployment `payment-service:v2.3.0` was deployed 1 hour ago and has a memory leak. Pods are being OOMKilled.
    >
    > **Evidence**:
    >
    > * New image deployed at 10:00
    > * Memory usage increased from \~300Mi to 600Mi under load
    > * Pod memory limit is 512Mi (see [Kubernetes resource limits](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/))
    > * OOMKilled events in Kubernetes
  </Accordion>

  <Accordion icon="gauge-high" title="API Latency Spike">
    *"P99 latency > 2s on checkout API"*

    > **Root Cause**: DynamoDB table `checkout-sessions` is throttling due to exceeded read capacity. A marketing campaign at 10:00 AM increased traffic 5x.
    >
    > **Evidence**:
    >
    > * Traffic increased from 100 req/s to 500 req/s at 10:00
    > * DynamoDB throttled requests spiked at 10:05
    > * Provisioned RCU (100) is insufficient (see [DynamoDB provisioned throughput](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ProvisionedThroughput.html))
  </Accordion>
</AccordionGroup>

## Get Started

<CardGroup cols={2}>
  <Card title="Create Account" icon="user-plus" href="https://app.anyshift.io/">
    Sign up for Anyshift
  </Card>

  <Card title="Request Demo" icon="phone" href="https://calendly.com/roxane-fischer/30-zoom-meeting?back=1">
    See RCA in action
  </Card>
</CardGroup>

## Related

<CardGroup cols={2}>
  <Card title="Proactive" icon="radar" href="/pages/product/proactive_annie">
    The flip side of RCA. Annie finds and predicts issues before an alert fires.
  </Card>

  <Card title="Annie Knowledge" icon="brain" href="/pages/product/annie_knowledge">
    Ask follow-up questions about any incident in plain language.
  </Card>

  <Card title="Knowledge Graph" icon="database" href="/pages/overview/knowledge_graph">
    The versioned graph Annie traverses to find "what changed".
  </Card>

  <Card title="Change Management" icon="clock-rotate-left" href="/pages/product/time_travel">
    Replay your stack's state at any point in the last 7 days.
  </Card>
</CardGroup>
