Skip to main content

Overview

Annie automatically learns from your connected integrations (AWS, Terraform, Datadog, etc.), but some knowledge isn’t available in those systems—team conventions, business context, runbook locations, on-call contacts, and tribal knowledge. Custom Knowledge lets you add this extra context that Annie doesn’t have access to, so she can provide more accurate and relevant responses.

Knowledge Categories

Document your infrastructure topology:
  • Service dependencies and ownership
  • Critical paths and SLAs
  • Environment layouts (prod, staging, dev)
  • Architecture decisions and trade-offs
Define common log patterns and their meanings:
  • Error signatures and what they indicate
  • Warning patterns to watch for
  • Success patterns for verification
  • How to find relevant logs in your monitoring tools
Document how your team uses monitoring and debugging tools:
  • Key Datadog dashboards to check
  • Important CloudWatch metrics
  • Runbook locations
  • How to access specific environments
Connect business concepts to technical components:
  • Which services power which features
  • Revenue-critical paths
  • Customer-facing vs internal services
  • Feature flags and their impact
Define your team’s standards:
  • Naming conventions for resources
  • Tagging standards
  • Deployment patterns
  • On-call rotation and escalation paths

Examples

# Payment Service

## Overview
Handles all financial transactions for checkout and subscriptions.
Critical path: API Gateway → Payment Lambda → Stripe API → DynamoDB

## Dependencies
- Stripe API (external) - payment processing
- DynamoDB table: payments-prod - transaction records
- SQS queue: payment-events - async processing
- Redis cluster: payment-cache - rate limiting

## On-Call Contacts
- Primary: @payments-team in Slack
- Escalation: [email protected]

## Common Issues
1. Stripe timeouts: Check Stripe status page first
2. DynamoDB throttling: Scale up RCU or check for hot partitions
3. Lambda cold starts: Check concurrent execution limits

## Key Dashboards
- Datadog: "Payment Service Overview"
- CloudWatch alarm: payment-lambda-errors
- Business metric: checkout_success_rate
# Authentication Service

## Overview
Handles user login, SSO, and session management.
Critical for all user-facing applications.

## Dependencies
- Auth0 (external) - identity provider
- Redis cluster: session-cache - session storage
- PostgreSQL: users-db - user profiles
- Kafka: auth-events - audit logging

## On-Call Contacts
- Primary: @platform-team in Slack
- Security issues: [email protected] (immediate escalation)

## Common Issues
1. Auth0 rate limits: Check Auth0 dashboard, may need to request limit increase
2. Session cache misses: Usually Redis memory pressure, check eviction rate
3. Login failures spike: Often caused by downstream service issues, not auth itself

## Key Dashboards
- Datadog: "Auth Service Health"
- Auth0 Dashboard: https://manage.auth0.com/
- Metric to watch: login_success_rate (alert if < 99%)
# Database Operations

## Production Databases
| Database | Type | Primary Use | Owner |
|----------|------|-------------|-------|
| users-db | PostgreSQL | User profiles | @platform-team |
| orders-db | PostgreSQL | Order history | @commerce-team |
| analytics-db | ClickHouse | Reporting | @data-team |

## Connection Strings
- Production: Use AWS Secrets Manager, secret name: prod/db/credentials
- Staging: Use AWS Secrets Manager, secret name: staging/db/credentials

## Common Issues
1. Connection pool exhaustion: Check active connections in RDS console
   - Normal: < 80% of max_connections
   - Alert: > 90% of max_connections
   
2. Slow queries: Check pg_stat_statements for queries > 1s
   - Runbook: https://wiki.company.com/db-slow-queries

3. Replication lag: Check CloudWatch ReplicaLag metric
   - Normal: < 100ms
   - Alert: > 1s

## Maintenance Windows
- Production: Sundays 2-4 AM UTC
- Staging: No restrictions
# Naming Conventions

## Resource Naming Pattern
{env}-{service}-{resource-type}-{identifier}

Examples:
- prod-payment-lambda-processor
- staging-auth-rds-primary
- dev-api-ec2-worker-01

## Environments
- prod: Production (us-east-1, eu-west-1)
- staging: Pre-production testing (us-east-1)
- dev: Development (us-west-2)

## Tagging Standards
All resources MUST have:
- Environment: prod | staging | dev
- Service: service name (e.g., payment, auth, api)
- Owner: team email (e.g., [email protected])
- CostCenter: finance code (e.g., CC-1234)

## Terraform Module Locations
- Infrastructure: github.com/company/terraform-infra
- Modules: github.com/company/terraform-modules
- Environments: github.com/company/terraform-envs/{env}
# Incident Response

## Severity Levels
- SEV1: Customer-facing outage, all hands on deck
- SEV2: Degraded service, on-call + team lead
- SEV3: Minor issue, on-call only

## Escalation Path
1. On-call engineer (PagerDuty)
2. Team lead (after 15 min for SEV1/SEV2)
3. Engineering manager (after 30 min for SEV1)
4. VP Engineering (after 1 hour for SEV1)

## Communication Channels
- Incidents: #incidents (Slack)
- War room: #incident-{id} (created automatically)
- Status page: https://status.company.com

## Post-Incident
- Blameless postmortem within 48h for SEV1/SEV2
- Template: https://wiki.company.com/postmortem-template
- Review meeting: Thursdays 2 PM

## Key Contacts
- On-call: Check PagerDuty schedule
- Security: [email protected]
- Legal (data breach): [email protected]

Get Started