
Software & Data Engineer
Sankar Kalyanakumar
I build systems where bad data can't hide.
6 years designing pipelines that validate before they propagate, fail loud instead of silent, and let humans approve before anything irreversible runs.
// how I think about engineering
Validate at the boundary
Bad data shouldn't travel far. Catch it at ingestion, log exactly what failed and why, and stop the pipeline before the corruption spreads downstream.
Pause before irreversible
Automation is most dangerous right before it does something permanent. Build the gate first — approval, confirmation, timeout — then build the action.
Fail loud, not silent
A pipeline that swallows errors and marks rows 'processed' is worse than one that crashes. If something is wrong, scream and stop. Silent failures cost weeks.
$ ls ~/projects
Open source work
CashCast
“Every branch pads its vault order 15–20% as a buffer. CashCast forecasts that demand with ML — the buffer becomes a number, not a guess.”
- →Ridge regression per branch, 730 days — avg MAPE 9.1%
- →Isolation Forest flags demand anomalies before vault gaps occur
- →14-day forecast with confidence bands + $1K-rounded order rec
- →AI narrative: peak day, seasonal delta, idle cash risk per branch
- →Plotly.js ops dashboard: vault status, charts, CSV export
- →Swagger at
/docs— 5 tagged endpoints, fully documented
Avg MAPE
9.1%
Tests
14 / 14
Branches
6
Horizon
14 days
Branches
6
Avg MAPE
9.1%
Total Rec
$867K
Horizon
14d
Anomalies
2
High Risk
1
// 14-day demand forecast — BRK-01 Downtown
// order recs
FleetPulse
“A truck breaks down. The service was six weeks overdue. The spreadsheet was the last to know.”
- →Hourly scheduler catches overdue maintenance before anyone checks
- →Idempotent alerts — same event fires once, not on every poll
- →Live ops dashboard: resolve alerts, KPIs update every 60s
- →25+ REST endpoints, Flyway migrations, role-based access
- →16 integration tests — zero failures across full lifecycle
- →PostgreSQL + Spring Data JPA, containerised with Docker Compose
Endpoints
25+
Tests
16 / 16
Stack
Java 21
DB
Postgres
Vehicles
8
Overdue
1
Alerts
3
Tests
16/16
// unresolved alerts
// fleet status
Ops Copilot
“2am. Service is down. The fix is buried somewhere in a 40-page runbook.”
- →FAISS-indexed runbooks — answers cite exact file and line number
- →LLM stays grounded: only quotes what it found, never invents steps
- →Step Functions pauses at SNS gate — nothing runs until approved
- →Human-in-the-loop: approve or reject before any remediation fires
- →Swap one env var to switch between Ollama (local) and AWS Bedrock
- →Modular retriever: swap FAISS for any vector store without rewriting
Vector DB
FAISS
LLM
Bedrock
Gate
SNS
Workflow
StepFn
// incident query
// answer — grounded in runbooks
Run df -h /var/lib/postgresql to confirm. If >90%, execute cleanup as per section 3.2.
Ledger Reconciler
“Every break has a reason. They're just buried in 80 rows of noise before anyone can dig.”
- →94.7% match rate — 720 transactions over a 30-day run
- →4 ordered passes: exact → amount+date → reference → fuzzy
- →Every break classified with root cause before a human sees it
- →Streamlit dashboard: trend chart, aging heatmap, break drill-down
- →SQLite audit log — every match decision is traceable and replayable
- →Handles timing diffs, format mismatches, and near-duplicate entries
Match rate
94.7%
Txns
720
Period
30 days
Passes
4
Match Rate
94.7%
Matched
681
Breaks
39
Period
30d
// open breaks
// by category