Ops Copilot
What if your runbooks could answer the phone at 2am?
RAG over Markdown runbooks + an approval-gated AWS remediation pipeline. Answers cite their source. Nothing touches production without a human in the loop.
// incident.log — 02:17:43
The problem
[ALARM] ALB-5xx-High — ALARM state
paging on-call engineer...
you: what does this alarm even mean
you: opening Confluence...
you: searching Slack history...
you: was it the deploy or the traffic spike?
10 minutes pass. service still degrading.
The runbook exists. The answer is in there. But finding the right section under pressure, at 2am, with a degrading service, is its own problem — separate from the technical one.
I wanted retrieval that's faster than search. Not an AI that guesses — but one that reads the docs your team already wrote and surfaces the exact section that matches the alarm, with the filename attached so you can verify in under 30 seconds.
The second half solves a different question: how do you let automation act on a recommendation without removing the human from dangerous operations? The answer is a gate that doesn't close automatically. The Step Functions pipeline proposes, pauses, and waits. The ECS rollback only runs if you say so.
What this solves
- →10-minute context gap on every page
- →Inconsistent triage across engineers
- →No audit trail for manual rollbacks
- →High cognitive load at incident start
- →Which ECS task def to revert to
Who it fits
- →Teams with Markdown or Confluence runbooks
- →ECS services that break under load
- →Regulated envs needing approval chains
- →On-call rotations with painful handoffs
- →Orgs avoiding $$$$ commercial AIOps
Local RAG pipeline
Offline runbook indexing and retrieval. No external API calls required when running with Ollama.
Indexing path
The index binary and chunk pickle are written to disk. Re-run only when runbooks change.
Query path
Answer always includes the runbook filename so engineers can verify the source in under 30 seconds.
// indexing
data/*.mdRunbooks are plain Markdown filesH2 chunksSplit on headings to preserve contextall-MiniLM-L6-v2384-dim embeddings, localIndexFlatL2Exact L2 search, no approximationindex/ dir.faiss + .pkl written to diskre-indexSingle script call, seconds on typical sets
// query
same modelEncodes question at query timetop-K=5Nearest chunks (configurable)system promptChunks injected as contexteval harnessTests ALB 5xx and RDS CPU scenariosTitan ExpressBedrock path modelOllamallama2, mistral — any local model
AWS incident pipeline
CloudWatch alarm → ECS rollback, with a mandatory human approval gate between triage and action.
Triage Lambda
inputAlarm name + state from Step FunctionsroutingALB/ECS alarms → rollback pathcontextExtracts cluster, service, task def ARNoutputContext dict to state machinenon-ECSCompletes without triggering rollbacklogsStructured output every execution
SNS Gate
waitPauses at waitForTaskTokenemailAPPROVE + REJECT links via SNScallbackLinks hit API Gateway with tokenresumeSendTaskSuccess or SendTaskFailureblockingState machine frozen until responsetimeoutConfigurable token expiry
Rollback Lambda
inputCluster, service, prev task def ARNactionECS update_service → prior task defguardExits cleanly if context is missingtriggerAPPROVE signal onlytraceFull execution log in Step FunctionsiamScoped to target ECS service only
Deploy with AWS SAM
# Build and deploy the full stack sam build sam deploy --guided # Parameters prompted: # ApproverEmail — SNS subscription address # TriageFunctionName # CallbackApiStageName
Demo execution
# Start execution with simulated CloudWatch alarm
aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:us-east-1:123456789:stateMachine:IncidentStateMachine \
--input '{"alarmName":"ALB-5xx-High","alarmState":"ALARM","cluster":"prod","service":"web-api"}'
# Check execution status
aws stepfunctions describe-execution \
--execution-arn <execution-arn>
# Approve rollback via API callback (sent in SNS email)
curl "https://<api-gw-id>.execute-api.us-east-1.amazonaws.com/prod/approve?token=<task-token>&action=APPROVE"Repository structure
ops-copilot-bedrock/
data/ # Markdown runbooks (ALB, RDS, ECS)
src/
rag/
ingest.py # Load and chunk runbook Markdown files
embed_local.py # Embed chunks with sentence-transformers
build_index.py # Build and write FAISS index to disk
query.py # Embed question, search index, return top-K chunks
llm_generate.py # Build prompt, call Ollama or Bedrock, return answer
aws/
triage_lambda/ # Classify alarm, extract ECS context
callback_lambda/ # Receive API callback, send Step Functions task token
rollback_lambda/ # Call ECS update_service to revert task definition
ui/
app.py # Streamlit UI for RAG query interface
infra/
template.yaml # AWS SAM template (Step Functions, Lambda, API GW, SNS)
state_machine_definition.json
index/ # Generated: FAISS binary + chunk pickle (gitignored)
tests/
eval_harness.py # Scenario-based evaluation for ALB 5xx and RDS CPU
run_all.py # One-command local runner (index + Streamlit)
requirements.txt
.env.exampleFull tech stack
| Component | Technology | Purpose |
|---|---|---|
| Frontend | Streamlit | Conversational query UI, real-time answer streaming |
| Embedding | sentence-transformers all-MiniLM-L6-v2 | 384-dim local embedding, no API key required |
| Vector store | FAISS IndexFlatL2 | Exact nearest-neighbor search over chunked runbooks |
| Local LLM | Ollama | Runs any locally pulled model, fully air-gapped capable |
| Cloud LLM | Amazon Bedrock (Titan) | Cloud inference path for production deployments |
| Orchestration | AWS Step Functions | State machine with waitForTaskToken approval step |
| Triage | AWS Lambda (Python) | Alarm classification and ECS context extraction |
| Approval gate | Amazon SNS | Email with approve/reject links, task token callback |
| Callback | API Gateway + Lambda | Receives operator click, resumes state machine |
| Remediation | AWS Lambda + Amazon ECS | Reverts ECS service to previous task definition |
| IaC | AWS SAM | Single template for all Lambda, Step Functions, API GW, SNS |
| Evaluation | Custom harness (Python) | Scenario-based retrieval and generation quality tests |
Quickstart
# Clone and install dependencies git clone https://github.com/Sankartk/ops-copilot-bedrock cd ops-copilot-bedrock pip install -r requirements.txt # Configure environment cp .env.example .env # Edit .env: # LLM_BACKEND=ollama # or "bedrock" # OLLAMA_MODEL=mistral # any model you have pulled # AWS_REGION=us-east-1 # BEDROCK_MODEL_ID=amazon.titan-text-express-v1 # Add your runbooks to data/ as Markdown files # then index once: python src/rag/build_index.py # Launch the Streamlit UI python run_all.py # builds index + starts Streamlit in one command
Guardrails
- ✓No rollback without APPROVE click
- ✓Full state trace in Step Functions
- ✓Lambda exits cleanly on missing context
- ✓IAM scoped to target ECS service
- ✓Token timeout = safe expiry
- ✓RAG answers cite source files
Known limits
- !FAISS has no incremental index updates
- !Only ECS rollback implemented
- !SNS path needs inbound HTTPS
- !Ollama quality varies — use mistral+
- !No chunk deduplication across files
- !Eval harness is scenario-based only
Next up
- →Slack approval instead of email
- →RDS + Lambda error routing rules
- →Persistent audit history export
- →Chunk overlap + deduplication tuning
- →Reranker before LLM context window
- →RAGAS scoring for retrieval quality