Ops Copilot

What if your runbooks could answer the phone at 2am?

RAG over Markdown runbooks + an approval-gated AWS remediation pipeline. Answers cite their source. Nothing touches production without a human in the loop.

PythonFAISSsentence-transformersStreamlitOllamaBedrockLambdaStep FunctionsSNSECSSAM

GitHub →

// incident.log — 02:17:43

The problem

[ALARM] ALB-5xx-High — ALARM state

paging on-call engineer...

you: what does this alarm even mean

you: opening Confluence...

you: searching Slack history...

you: was it the deploy or the traffic spike?

10 minutes pass. service still degrading.

The runbook exists. The answer is in there. But finding the right section under pressure, at 2am, with a degrading service, is its own problem — separate from the technical one.

I wanted retrieval that's faster than search. Not an AI that guesses — but one that reads the docs your team already wrote and surfaces the exact section that matches the alarm, with the filename attached so you can verify in under 30 seconds.

The second half solves a different question: how do you let automation act on a recommendation without removing the human from dangerous operations? The answer is a gate that doesn't close automatically. The Step Functions pipeline proposes, pauses, and waits. The ECS rollback only runs if you say so.

What this solves

→The 10-minute context gap every on-call engineer loses at the start of an incident
→Runbooks that exist but are never found under pressure
→Rollbacks that run before anyone has approved them
→Different engineers triaging the same alarm differently
→No paper trail for who approved what at 2am

Who it fits

→Teams with Markdown or Confluence runbooks
→ECS services that break under load
→Regulated envs needing approval chains
→On-call rotations with painful handoffs
→Orgs avoiding $$$$ commercial AIOps

Local RAG pipeline

Offline runbook indexing and retrieval. No external API calls required when running with Ollama.

Indexing path

The index binary and chunk pickle are written to disk. Re-run only when runbooks change.

Query path

Answer always includes the runbook filename so engineers can verify the source in under 30 seconds.

// indexing

data/*.mdRunbooks are plain Markdown files
H2 chunksSplit on headings to preserve context
all-MiniLM-L6-v2384-dim embeddings, local
IndexFlatL2Exact L2 search, no approximation
index/ dir.faiss + .pkl written to disk
re-indexSingle script call, seconds on typical sets

// query

same modelEncodes question at query time
top-K=5Nearest chunks (configurable)
system promptChunks injected as context
eval harnessTests ALB 5xx and RDS CPU scenarios
Titan ExpressBedrock path model
Ollamallama2, mistral — any local model

AWS incident pipeline

CloudWatch alarm → ECS rollback, with a mandatory human approval gate between triage and action.

Triage Lambda

inputAlarm name + state from Step Functions
routingALB/ECS alarms → rollback path
contextExtracts cluster, service, task def ARN
outputContext dict to state machine
non-ECSCompletes without triggering rollback
logsStructured output every execution

SNS Gate

waitPauses at waitForTaskToken
emailAPPROVE + REJECT links via SNS
callbackLinks hit API Gateway with token
resumeSendTaskSuccess or SendTaskFailure
blockingState machine frozen until response
timeoutConfigurable token expiry

Rollback Lambda

inputCluster, service, prev task def ARN
actionECS update_service → prior task def
guardExits cleanly if context is missing
triggerAPPROVE signal only
traceFull execution log in Step Functions
iamScoped to target ECS service only

Deploy with AWS SAM

# Build and deploy the full stack
sam build
sam deploy --guided

# Parameters prompted:
# ApproverEmail     — SNS subscription address
# TriageFunctionName
# CallbackApiStageName

Demo execution

# Start execution with simulated CloudWatch alarm
aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:us-east-1:123456789:stateMachine:IncidentStateMachine \
  --input '{"alarmName":"ALB-5xx-High","alarmState":"ALARM","cluster":"prod","service":"web-api"}'

# Check execution status
aws stepfunctions describe-execution \
  --execution-arn <execution-arn>

# Approve rollback via API callback (sent in SNS email)
curl "https://<api-gw-id>.execute-api.us-east-1.amazonaws.com/prod/approve?token=<task-token>&action=APPROVE"

Streamlit query interface

FAISS retrieval · source-cited answers · approval-gated remediation

○ Ops Copilot — RAG Interfacelocalhost:8501

// runbook context loaded

alb-runbook.md12 chunks

rds-runbook.md9 chunks

ecs-runbook.md7 chunks

// cosine similarity threshold: 0.72

// model: nomic-embed-text (local via Ollama)

FAISS index ready — 28 vectors

// incident query

prod-db disk full — what do I do?ask

// retrieved chunk — top-1 of 3

runbooks/rds-runbook.md #L42similarity 0.93

Run df -h /var/lib/postgresql to confirm disk usage. If above 90%, execute the cleanup procedure in section 3.2 to reclaim WAL logs and vacuum dead tuples.

runbooks/rds-runbook.md #L580.81runbooks/alb-runbook.md #L190.74

// remediation gate — approval required

⚠PENDING APPROVALStep Functions → SNS callback

action: execute-db-cleanup.sh on rds-prod-01

Approval triggers Lambda → ECS Fargate task. All actions logged to CloudWatch.

Repository structure

ops-copilot-bedrock/
  data/                          # Markdown runbooks (ALB, RDS, ECS)
  src/
    rag/
      ingest.py                  # Load and chunk runbook Markdown files
      embed_local.py             # Embed chunks with sentence-transformers
      build_index.py             # Build and write FAISS index to disk
      query.py                   # Embed question, search index, return top-K chunks
      llm_generate.py            # Build prompt, call Ollama or Bedrock, return answer
    aws/
      triage_lambda/             # Classify alarm, extract ECS context
      callback_lambda/           # Receive API callback, send Step Functions task token
      rollback_lambda/           # Call ECS update_service to revert task definition
  ui/
    app.py                       # Streamlit UI for RAG query interface
  infra/
    template.yaml                # AWS SAM template (Step Functions, Lambda, API GW, SNS)
    state_machine_definition.json
  index/                         # Generated: FAISS binary + chunk pickle (gitignored)
  tests/
    eval_harness.py              # Scenario-based evaluation for ALB 5xx and RDS CPU
  run_all.py                     # One-command local runner (index + Streamlit)
  requirements.txt
  .env.example

Full tech stack

Component	Technology	Purpose
Frontend	Streamlit	Conversational query UI, real-time answer streaming
Embedding	sentence-transformers all-MiniLM-L6-v2	384-dim local embedding, no API key required
Vector store	FAISS IndexFlatL2	Exact nearest-neighbor search over chunked runbooks
Local LLM	Ollama	Runs any locally pulled model, fully air-gapped capable
Cloud LLM	Amazon Bedrock (Titan)	Cloud inference path for production deployments
Orchestration	AWS Step Functions	State machine with waitForTaskToken approval step
Triage	AWS Lambda (Python)	Alarm classification and ECS context extraction
Approval gate	Amazon SNS	Email with approve/reject links, task token callback
Callback	API Gateway + Lambda	Receives operator click, resumes state machine
Remediation	AWS Lambda + Amazon ECS	Reverts ECS service to previous task definition
IaC	AWS SAM	Single template for all Lambda, Step Functions, API GW, SNS
Evaluation	Custom harness (Python)	Scenario-based retrieval and generation quality tests

Quickstart

# Clone and install dependencies
git clone https://github.com/Sankartk/ops-copilot-bedrock
cd ops-copilot-bedrock
pip install -r requirements.txt

# Configure environment
cp .env.example .env
# Edit .env:
# LLM_BACKEND=ollama          # or "bedrock"
# OLLAMA_MODEL=mistral        # any model you have pulled
# AWS_REGION=us-east-1
# BEDROCK_MODEL_ID=amazon.titan-text-express-v1

# Add your runbooks to data/ as Markdown files
# then index once:
python src/rag/build_index.py

# Launch the Streamlit UI
python run_all.py             # builds index + starts Streamlit in one command

Guardrails

✓No rollback without APPROVE click
✓Full state trace in Step Functions
✓Lambda exits cleanly on missing context
✓IAM scoped to target ECS service
✓Token timeout = safe expiry
✓RAG answers cite source files

Known limits

!FAISS has no incremental index updates
!Only ECS rollback implemented
!SNS path needs inbound HTTPS
!Ollama quality varies — use mistral+
!No chunk deduplication across files
!Eval harness is scenario-based only

Next up

→Slack approval instead of email
→RDS + Lambda error routing rules
→Persistent audit history export
→Chunk overlap + deduplication tuning
→Reranker before LLM context window
→RAGAS scoring for retrieval quality