Custom Knowledge - Starter Kit

Overview

Annie automatically learns from your connected integrations (AWS, Terraform, Datadog, etc.), but some knowledge isn’t available in those systems—team conventions, business context, runbook locations, on-call contacts, and tribal knowledge. Custom Knowledge lets you add this extra context that Annie doesn’t have access to, so she can provide more accurate and relevant responses.

Knowledge Categories

Infrastructure Architecture

Document your infrastructure topology:

Service dependencies and ownership
Critical paths and SLAs
Environment layouts (prod, staging, dev)
Architecture decisions and trade-offs

Log Patterns

Define common log patterns and their meanings:

Error signatures and what they indicate
Warning patterns to watch for
Success patterns for verification
How to find relevant logs in your monitoring tools

Tool Usage

Document how your team uses monitoring and debugging tools:

Key Datadog dashboards to check
Important CloudWatch metrics
Runbook locations
How to access specific environments

Business-Technical Mapping

Connect business concepts to technical components:

Which services power which features
Revenue-critical paths
Customer-facing vs internal services
Feature flags and their impact

Schemas & Conventions

Define your team’s standards:

Naming conventions for resources
Tagging standards
Deployment patterns
On-call rotation and escalation paths

Examples

Payment Service

# Payment Service

## Overview
Handles all financial transactions for checkout and subscriptions.
Critical path: API Gateway → Payment Lambda → Stripe API → DynamoDB

## Dependencies
- Stripe API (external) - payment processing
- DynamoDB table: payments-prod - transaction records
- SQS queue: payment-events - async processing
- Redis cluster: payment-cache - rate limiting

## On-Call Contacts
- Primary: @payments-team in Slack
- Escalation: payments-oncall@company.com

## Common Issues
1. Stripe timeouts: Check Stripe status page first
2. DynamoDB throttling: Scale up RCU or check for hot partitions
3. Lambda cold starts: Check concurrent execution limits

## Key Dashboards
- Datadog: "Payment Service Overview"
- CloudWatch alarm: payment-lambda-errors
- Business metric: checkout_success_rate

Authentication Service

# Authentication Service

## Overview
Handles user login, SSO, and session management.
Critical for all user-facing applications.

## Dependencies
- Auth0 (external) - identity provider
- Redis cluster: session-cache - session storage
- PostgreSQL: users-db - user profiles
- Kafka: auth-events - audit logging

## On-Call Contacts
- Primary: @platform-team in Slack
- Security issues: security@company.com (immediate escalation)

## Common Issues
1. Auth0 rate limits: Check Auth0 dashboard, may need to request limit increase
2. Session cache misses: Usually Redis memory pressure, check eviction rate
3. Login failures spike: Often caused by downstream service issues, not auth itself

## Key Dashboards
- Datadog: "Auth Service Health"
- Auth0 Dashboard: https://manage.auth0.com/
- Metric to watch: login_success_rate (alert if < 99%)

Database Runbook

# Database Operations

## Production Databases
| Database | Type | Primary Use | Owner |
|----------|------|-------------|-------|
| users-db | PostgreSQL | User profiles | @platform-team |
| orders-db | PostgreSQL | Order history | @commerce-team |
| analytics-db | ClickHouse | Reporting | @data-team |

## Connection Strings
- Production: Use AWS Secrets Manager, secret name: prod/db/credentials
- Staging: Use AWS Secrets Manager, secret name: staging/db/credentials

## Common Issues
1. Connection pool exhaustion: Check active connections in RDS console
   - Normal: < 80% of max_connections
   - Alert: > 90% of max_connections
   
2. Slow queries: Check pg_stat_statements for queries > 1s
   - Runbook: https://wiki.company.com/db-slow-queries

3. Replication lag: Check CloudWatch ReplicaLag metric
   - Normal: < 100ms
   - Alert: > 1s

## Maintenance Windows
- Production: Sundays 2-4 AM UTC
- Staging: No restrictions

Infrastructure Naming Conventions

# Naming Conventions

## Resource Naming Pattern
{env}-{service}-{resource-type}-{identifier}

Examples:
- prod-payment-lambda-processor
- staging-auth-rds-primary
- dev-api-ec2-worker-01

## Environments
- prod: Production (us-east-1, eu-west-1)
- staging: Pre-production testing (us-east-1)
- dev: Development (us-west-2)

## Tagging Standards
All resources MUST have:
- Environment: prod | staging | dev
- Service: service name (e.g., payment, auth, api)
- Owner: team email (e.g., platform@company.com)
- CostCenter: finance code (e.g., CC-1234)

## Terraform Module Locations
- Infrastructure: github.com/company/terraform-infra
- Modules: github.com/company/terraform-modules
- Environments: github.com/company/terraform-envs/{env}

Incident Response Playbook

# Incident Response

## Severity Levels
- SEV1: Customer-facing outage, all hands on deck
- SEV2: Degraded service, on-call + team lead
- SEV3: Minor issue, on-call only

## Escalation Path
1. On-call engineer (PagerDuty)
2. Team lead (after 15 min for SEV1/SEV2)
3. Engineering manager (after 30 min for SEV1)
4. VP Engineering (after 1 hour for SEV1)

## Communication Channels
- Incidents: #incidents (Slack)
- War room: #incident-{id} (created automatically)
- Status page: https://status.company.com

## Post-Incident
- Blameless postmortem within 48h for SEV1/SEV2
- Template: https://wiki.company.com/postmortem-template
- Review meeting: Thursdays 2 PM

## Key Contacts
- On-call: Check PagerDuty schedule
- Security: security@company.com
- Legal (data breach): legal@company.com

Get Started

Create Account

Request Demo

See Custom Knowledge in action

Get Started

Product

Privacy & Security

​Overview

​Knowledge Categories

​Examples

​Get Started

Create Account

Request Demo

Overview

Knowledge Categories

Examples

Get Started