← Back

Ops Copilot

What if your runbooks could answer the phone at 2am?

RAG over Markdown runbooks + an approval-gated AWS remediation pipeline. Answers cite their source. Nothing touches production without a human in the loop.

PythonFAISSsentence-transformersStreamlitOllamaBedrockLambdaStep FunctionsSNSECSSAM
GitHub →

// incident.log — 02:17:43

The problem

[ALARM] ALB-5xx-High — ALARM state

paging on-call engineer...

you: what does this alarm even mean

you: opening Confluence...

you: searching Slack history...

you: was it the deploy or the traffic spike?

10 minutes pass. service still degrading.

The runbook exists. The answer is in there. But finding the right section under pressure, at 2am, with a degrading service, is its own problem — separate from the technical one.

I wanted retrieval that's faster than search. Not an AI that guesses — but one that reads the docs your team already wrote and surfaces the exact section that matches the alarm, with the filename attached so you can verify in under 30 seconds.

The second half solves a different question: how do you let automation act on a recommendation without removing the human from dangerous operations? The answer is a gate that doesn't close automatically. The Step Functions pipeline proposes, pauses, and waits. The ECS rollback only runs if you say so.

What this solves

  • 10-minute context gap on every page
  • Inconsistent triage across engineers
  • No audit trail for manual rollbacks
  • High cognitive load at incident start
  • Which ECS task def to revert to

Who it fits

  • Teams with Markdown or Confluence runbooks
  • ECS services that break under load
  • Regulated envs needing approval chains
  • On-call rotations with painful handoffs
  • Orgs avoiding $$$$ commercial AIOps
01

Local RAG pipeline

Offline runbook indexing and retrieval. No external API calls required when running with Ollama.

Indexing path

data/*.mdrunbook filesingest.pychunk + cleanembed_local.pysentence-transformersbuild_index.pyFAISS IndexFlatL2index/ dir.faiss + .pkl

The index binary and chunk pickle are written to disk. Re-run only when runbooks change.

Query path

User questionStreamlit inputquery.pyembed + FAISSTop-K chunksranked by distancellm_generate.pyOllama or BedrockGrounded answer+ source citations

Answer always includes the runbook filename so engineers can verify the source in under 30 seconds.

// indexing

  • data/*.mdRunbooks are plain Markdown files
  • H2 chunksSplit on headings to preserve context
  • all-MiniLM-L6-v2384-dim embeddings, local
  • IndexFlatL2Exact L2 search, no approximation
  • index/ dir.faiss + .pkl written to disk
  • re-indexSingle script call, seconds on typical sets

// query

  • same modelEncodes question at query time
  • top-K=5Nearest chunks (configurable)
  • system promptChunks injected as context
  • eval harnessTests ALB 5xx and RDS CPU scenarios
  • Titan ExpressBedrock path model
  • Ollamallama2, mistral — any local model
02

AWS incident pipeline

CloudWatch alarm → ECS rollback, with a mandatory human approval gate between triage and action.

CloudWatchAlarm JSON inputvia Step FunctionsStep Functionsorchestrationstate machineTriage Lambdaclassify alarmextract ECS contextSNS Gateemail approvalblocks state machineAPI Callbackoperator clicks linkAPI Gateway + LambdaRollback LambdaECS task def revertAPPROVE only

Triage Lambda

  • inputAlarm name + state from Step Functions
  • routingALB/ECS alarms → rollback path
  • contextExtracts cluster, service, task def ARN
  • outputContext dict to state machine
  • non-ECSCompletes without triggering rollback
  • logsStructured output every execution

SNS Gate

  • waitPauses at waitForTaskToken
  • emailAPPROVE + REJECT links via SNS
  • callbackLinks hit API Gateway with token
  • resumeSendTaskSuccess or SendTaskFailure
  • blockingState machine frozen until response
  • timeoutConfigurable token expiry

Rollback Lambda

  • inputCluster, service, prev task def ARN
  • actionECS update_service → prior task def
  • guardExits cleanly if context is missing
  • triggerAPPROVE signal only
  • traceFull execution log in Step Functions
  • iamScoped to target ECS service only

Deploy with AWS SAM

# Build and deploy the full stack
sam build
sam deploy --guided

# Parameters prompted:
# ApproverEmail     — SNS subscription address
# TriageFunctionName
# CallbackApiStageName

Demo execution

# Start execution with simulated CloudWatch alarm
aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:us-east-1:123456789:stateMachine:IncidentStateMachine \
  --input '{"alarmName":"ALB-5xx-High","alarmState":"ALARM","cluster":"prod","service":"web-api"}'

# Check execution status
aws stepfunctions describe-execution \
  --execution-arn <execution-arn>

# Approve rollback via API callback (sent in SNS email)
curl "https://<api-gw-id>.execute-api.us-east-1.amazonaws.com/prod/approve?token=<task-token>&action=APPROVE"

Repository structure

ops-copilot-bedrock/
  data/                          # Markdown runbooks (ALB, RDS, ECS)
  src/
    rag/
      ingest.py                  # Load and chunk runbook Markdown files
      embed_local.py             # Embed chunks with sentence-transformers
      build_index.py             # Build and write FAISS index to disk
      query.py                   # Embed question, search index, return top-K chunks
      llm_generate.py            # Build prompt, call Ollama or Bedrock, return answer
    aws/
      triage_lambda/             # Classify alarm, extract ECS context
      callback_lambda/           # Receive API callback, send Step Functions task token
      rollback_lambda/           # Call ECS update_service to revert task definition
  ui/
    app.py                       # Streamlit UI for RAG query interface
  infra/
    template.yaml                # AWS SAM template (Step Functions, Lambda, API GW, SNS)
    state_machine_definition.json
  index/                         # Generated: FAISS binary + chunk pickle (gitignored)
  tests/
    eval_harness.py              # Scenario-based evaluation for ALB 5xx and RDS CPU
  run_all.py                     # One-command local runner (index + Streamlit)
  requirements.txt
  .env.example

Full tech stack

ComponentTechnologyPurpose
FrontendStreamlitConversational query UI, real-time answer streaming
Embeddingsentence-transformers all-MiniLM-L6-v2384-dim local embedding, no API key required
Vector storeFAISS IndexFlatL2Exact nearest-neighbor search over chunked runbooks
Local LLMOllamaRuns any locally pulled model, fully air-gapped capable
Cloud LLMAmazon Bedrock (Titan)Cloud inference path for production deployments
OrchestrationAWS Step FunctionsState machine with waitForTaskToken approval step
TriageAWS Lambda (Python)Alarm classification and ECS context extraction
Approval gateAmazon SNSEmail with approve/reject links, task token callback
CallbackAPI Gateway + LambdaReceives operator click, resumes state machine
RemediationAWS Lambda + Amazon ECSReverts ECS service to previous task definition
IaCAWS SAMSingle template for all Lambda, Step Functions, API GW, SNS
EvaluationCustom harness (Python)Scenario-based retrieval and generation quality tests

Quickstart

# Clone and install dependencies
git clone https://github.com/Sankartk/ops-copilot-bedrock
cd ops-copilot-bedrock
pip install -r requirements.txt

# Configure environment
cp .env.example .env
# Edit .env:
# LLM_BACKEND=ollama          # or "bedrock"
# OLLAMA_MODEL=mistral        # any model you have pulled
# AWS_REGION=us-east-1
# BEDROCK_MODEL_ID=amazon.titan-text-express-v1

# Add your runbooks to data/ as Markdown files
# then index once:
python src/rag/build_index.py

# Launch the Streamlit UI
python run_all.py             # builds index + starts Streamlit in one command

Guardrails

  • No rollback without APPROVE click
  • Full state trace in Step Functions
  • Lambda exits cleanly on missing context
  • IAM scoped to target ECS service
  • Token timeout = safe expiry
  • RAG answers cite source files

Known limits

  • !FAISS has no incremental index updates
  • !Only ECS rollback implemented
  • !SNS path needs inbound HTTPS
  • !Ollama quality varies — use mistral+
  • !No chunk deduplication across files
  • !Eval harness is scenario-based only

Next up

  • Slack approval instead of email
  • RDS + Lambda error routing rules
  • Persistent audit history export
  • Chunk overlap + deduplication tuning
  • Reranker before LLM context window
  • RAGAS scoring for retrieval quality