name: Ajay Devineni
location: Atlanta, GA (Cumming, Georgia)
title: Senior SRE / DevSecOps Engineer · Cloud Architect
focus: AWS · Kubernetes · AI/ML Infrastructure · Fintech · Telecom
contact: ajayjboss@gmail.com
linkedin: linkedin.com/in/ajay-devineni
experience:
total: 11+ years
domains: [ Banking, Credit Union Platforms, Telecom, Healthcare, Insurance ]
at_scale: Millions of transactions · Multi-region · 24/7 on-call
core_stack:
cloud: [ AWS (primary), Azure, GCP ]
iac: [ Terraform, Terragrunt, CloudFormation ]
containers: [ Kubernetes/EKS, Docker, Helm, Flux, ArgoCD ]
cicd: [ GitHub Actions, Jenkins, Azure Pipelines, CodeDeploy ]
ai_ops: [ MLOps pipelines, LLM inference infra, GPU compute (P5/G5) ]
observe: [ Dynatrace, Grafana, Prometheus, Datadog, Splunk, ELK ]
security: [ CrowdStrike, Zero-Trust, PrivateLink, SOC2, AWS IAM/SSO ]
scripting: [ Python, Bash/Shell ]
currently_building:
- agentsre (PyPI): OSS SRE reliability instrumentation library for agentic AI
- 4 original SLIs for AI agents — DQR · TIE · HER · AQDD
- Publishing on AWS Community Builders & DEV Community — AIOps + SRE patterns| Metric | Achievement |
|---|---|
| 💰 Cloud Cost Savings | $140,000/month eliminated via AWS resource optimization (FY24) |
| ⬆️ Uptime Delivered | 99.99% across 5 digital banking platforms simultaneously |
| ⚙️ Toil Reduction | 30% operational overhead cut via Python/Shell self-healing automation |
| 🕒 Hours Saved/Month | 10–15+ engineering hours freed via patching & audit automation |
| 🐛 Legacy Bug Resolved | 3–4 year recurring balance transfer failure permanently fixed |
| ⏱️ Downtime Prevented | 50+ hours cumulative production downtime averted over career |
| 🔒 Security Posture | Zero-trust migrations eliminating years of recurring network instability |
| 🏦 Platforms Supported | 5 credit union banking platforms — 90%+ reliability rate |
AWS Services Deep-Cut:
EKS · EC2 · RDS · Aurora · S3 · Lambda · VPC · IAM · SSO · Transit Gateway
PrivateLink · DMS · DynamoDB · CloudWatch · SQS · SNS · ACM · Route53 · Auto Scaling
Networking Specialization:
Aviatrix · AWS VPN (HA dual-tunnel) · PrivateLink · VPC Peering
Transit Gateway · Zero-Trust Architecture · OpenVPN · SSL/TLS · DNS
🌟 agentsre · |
|
📐 4 Original SLIs
|
🔗 A2A Semantic Validator
|
⚡ Circuit Breaker
|
☁️ AWS-Native
|
pip install agentsre # core — zero dependencies
pip install agentsre[aws] # + boto3 for CloudWatch publishingPython AWS CloudWatch DynamoDB EventBridge SSM X-Ray A2A LLM Agents MIT License
Production-grade framework applying machine learning to SRE workflows. Detects anomalies in metrics streams before they become incidents. Implements intelligent runbook automation via LLM agents. Highlights:
|
Battle-tested Terraform modules reflecting 11+ years of real production patterns. Encodes AWS best practices, security baselines, and cost guardrails learned from managing multi-million-dollar cloud estates. Highlights:
|
Spring Boot service demonstrating SRE-first microservice design: health probes, graceful degradation, structured logging, and observability hooks. Built to 12-factor app principles for Kubernetes deployment.
|
Backend service for insurance domain with compliance-first design patterns. Demonstrates audit logging, role-based access, and observability integration for regulated workloads.
|
🏦 Candescent (formerly NCR) — Senior SRE / DevSecOps Engineer (Oct 2021 – Present)
Zero-Trust Network Modernization
- Migrated banking clients from legacy Aviatrix S2C VPNs to AWS-native VPN with HA dual-tunnel encryption — eliminating years of recurring instability
- Migrated clients from SSH tunnels to AWS PrivateLink, removing Aviatrix dependency entirely
- Redesigned QA RDS access architecture for security compliance during Azure → AWS migration
Cost Engineering
- Identified and decommissioned unused EC2, RDS, WorkSpaces, and Aviatrix components → $140K/month saved, exceeding 5% cost-reduction target
- Docker Hub audit + service account rollout → additional ~1% monthly spend reduction
AI & Automation
- Built AI-assisted escrow scripts for packaging, validating, and delivering release artifacts
- Automated AWS SSO user reporting via Python — enabling audit team to receive accurate access reports without manual effort
- Deployed self-hosted GitHub Actions runners via ASG — improved CI/CD reliability and eliminated managed-minute costs
- Python Lambda for automated SSM parameter sync during DR
Critical Incident Resolution
- Permanently resolved a 3–4 year recurring banking balance transfer failure restoring full production reliability
- Diagnosed and fixed non-functioning pods on EKS nodes during live client incidents
- Resolved CashEdge connectivity and no-route-to-destination failures under SLA
Security & Compliance
- Deployed CrowdStrike Falcon across Linux/Windows VMs and EKS clusters (Detection → Prevention mode)
- Proactively caught expiring SAML certificates (60-day notice); authored MOP docs for repeatable renewal
- Supported SOC2 compliance audits with DevSecOps policy enforcement
📡 AT&T — SRE / L3 Infrastructure Engineer (Apr 2017 – Oct 2021)
- L3 support for AT&T Digital Life smart home platform — millions of customers, dual redundant data centers
- Managed full AWS estate: EC2, S3, RDS, SQS, ELB, VPC, IAM, KMS, CloudWatch, SNS, Auto Scaling, CloudTrail, ACM
- Contributed to physical data center → Azure migration
- Managed middleware: Apache Tomcat, Nginx, JBoss — SSL/TLS cert renewals with zero cert-related outages
- Authored Shell scripts for log rotation, server auto-startup/shutdown, compliance automation
- DR War Rooms and BigButton failover automation across dual data centers
- 24/7 on-call rotation — monitoring via CloudWatch, Nagios, Zabbix, Introscope, Netcool
Traditional SRE asks: "How do we respond faster when things break?"
My approach asks: "How do we make the system tell us before it breaks?"
The stack I believe in:
[Alert Fatigue] → AI anomaly detection filters signal from noise
[Manual toil] → Self-healing automation removes humans from the loop
[Reactive ops] → SLO-driven engineering makes reliability measurable
[Cost blindness] → FinOps embedded into reliability decisions from day one
Publishing original SRE + AIOps patterns — not theory, production experience.
| Title | Platform | |
|---|---|---|
| 🔥 | SLOs for Agentic AI: The Reliability Framework Production Teams Are Missing | DEV Community |
| 🔥 | Why SRE Principles Are the Missing Layer in MCP Security | DEV Community |
| ☁️ | Single-Agent + MCP: SLOs for Agentic AI on AWS | AWS Community Builders |
| ☁️ | Multi-Agent + A2A: The SRE Reliability Framework Nobody Has Written Yet | AWS Community Builders |
| 🏅 AWS Certified Solutions Architect | Amazon Web Services |
| 🎓 M.S. Information Security | University of the Cumberlands |
| 🎓 M.S. Computer Science | Silicon Valley University |
| 🎓 B.E. Electronics & Communication | JNTUH College of Engineering |
| 🔬 IEEE Senior Member | Institute of Electrical and Electronics Engineers |
| 📐 Fellow SCRS | Society of Clinical Research Sites |
| ⚡ IET Member | The Institution of Engineering and Technology |
Open to conversations around:
Staff SRE · Principal Cloud Architect · Platform Engineering Leadership · AI Infrastructure · DevSecOps · Fintech/Banking Cloud
"The best on-call rotation is the one that never pages — because the system healed itself."

