Senior Software / AI Engineer based in Munich, Germany, with around 14 years across DevOps, SRE, cloud and, more recently, building production AI and RAG systems. I currently work at Audi on AI image-generation pipelines.
What I enjoy: taking messy operational problems and turning them into reliable, automated systems, and lately into agentic AI tools that actually ship and keep a human in the loop.
- Currently: AI and automation engineering at Audi; finished my M.Sc. in Business Analytics and Data Science (June 2026).
- Working on: agentic RAG, LLM observability, and AI-driven operations automation.
- Tools I reach for: Python, FastAPI, LangChain / LangGraph, AWS, Kubernetes, Docker, Terraform, Prometheus and Grafana.
- Ask me about: RAG, agentic workflows, MLOps, or Linux/Unix automation.
| Project | What it does |
|---|---|
| enterprise-agentic-rag-azure | Production agentic RAG with LangGraph, guardrails, evals and observability |
| ai-ops-incident-agent | Triages incidents, suggests root cause, drafts change tickets for human review |
| rag-support-assistant | RAG support assistant with citations, guardrails, automated evaluation and KPIs |
| graphrag-knowledge-assistant | Multi-hop RAG over a knowledge graph |
| llm-observability-platform | Tracks LLM cost, latency, tokens and answer faithfulness |
| cloud-native-platform-aws | Internal developer platform: Terraform EKS, ArgoCD GitOps, Prometheus, SLOs |
Each repo has a short architecture diagram, a runnable quickstart, and sample output, so you can see how it works in a minute.
A newer set across AI, platform and data. Each one runs with a single make demo, ships a full test suite, and shows real output in its README.
| Project | What it does |
|---|---|
| realtime-stream-inference | Anomaly detection over event streams with queue backpressure and p99 latency tracking |
| ai-incident-copilot | Collapses Alertmanager alerts into incidents, scores severity, and suggests a runbook |
| slo-error-budget | Error budget, burn rate, and multi-window paging from the SRE workbook |
| kubernetes-resource-rightsizer | Right-sizes CPU and memory from real usage, flags throttling and OOM risk |
| agent-trajectory-eval | Scores an agent run on tool choice, forbidden tools, redundant steps, and budget |
| llm-semantic-cache | Caches LLM responses by prompt similarity to cut repeat cost and latency |
| llm-finetune-toolkit | Validates, splits, formats, and evaluates supervised fine-tuning datasets |
| ab-test-analyzer | A/B test significance, confidence intervals, and sample-size planning |
- 14 years across system engineering, Linux administration, DevOps, SRE and cloud, now focused on AI engineering.
- Certifications: AWS Solutions Architect Associate, AWS ML Specialty, Databricks ML Professional, RHCE, RHCSA.
- LinkedIn: https://linkedin.com/in/krishna-gove-327463222
- Email: krishnagove88@gmail.com