AJAY DEVINENI Ajay150313

`$ whoami`

name:         Ajay Devineni
location:     Atlanta, GA (Cumming, Georgia)
title:        Senior SRE / DevSecOps Engineer · Cloud Architect
focus:        AWS · Kubernetes · AI/ML Infrastructure · Fintech · Telecom
contact:      ajayjboss@gmail.com
linkedin:     linkedin.com/in/ajay-devineni

experience:
  total:      11+ years
  domains:    [ Banking, Credit Union Platforms, Telecom, Healthcare, Insurance ]
  at_scale:   Millions of transactions · Multi-region · 24/7 on-call

core_stack:
  cloud:      [ AWS (primary), Azure, GCP ]
  iac:        [ Terraform, Terragrunt, CloudFormation ]
  containers: [ Kubernetes/EKS, Docker, Helm, Flux, ArgoCD ]
  cicd:       [ GitHub Actions, Jenkins, Azure Pipelines, CodeDeploy ]
  ai_ops:     [ MLOps pipelines, LLM inference infra, GPU compute (P5/G5) ]
  observe:    [ Dynatrace, Grafana, Prometheus, Datadog, Splunk, ELK ]
  security:   [ CrowdStrike, Zero-Trust, PrivateLink, SOC2, AWS IAM/SSO ]
  scripting:  [ Python, Bash/Shell ]

currently_building:
  - agentsre (PyPI): OSS SRE reliability instrumentation library for agentic AI
  - 4 original SLIs for AI agents — DQR · TIE · HER · AQDD
  - Publishing on AWS Community Builders & DEV Community — AIOps + SRE patterns

📊 Impact By The Numbers

Metric	Achievement
💰 Cloud Cost Savings	$140,000/month eliminated via AWS resource optimization (FY24)
⬆️ Uptime Delivered	99.99% across 5 digital banking platforms simultaneously
⚙️ Toil Reduction	30% operational overhead cut via Python/Shell self-healing automation
🕒 Hours Saved/Month	10–15+ engineering hours freed via patching & audit automation
🐛 Legacy Bug Resolved	3–4 year recurring balance transfer failure permanently fixed
⏱️ Downtime Prevented	50+ hours cumulative production downtime averted over career
🔒 Security Posture	Zero-trust migrations eliminating years of recurring network instability
🏦 Platforms Supported	5 credit union banking platforms — 90%+ reliability rate

🛠️ Full Technology Stack

Cloud & Infrastructure

Development & CI/CD

Observability & Security

AWS Services Deep-Cut:
EKS · EC2 · RDS · Aurora · S3 · Lambda · VPC · IAM · SSO · Transit Gateway
PrivateLink · DMS · DynamoDB · CloudWatch · SQS · SNS · ACM · Route53 · Auto Scaling

Networking Specialization:
Aviatrix · AWS VPN (HA dual-tunnel) · PrivateLink · VPC Peering
Transit Gateway · Zero-Trust Architecture · OpenVPN · SSL/TLS · DNS

🚀 Featured Projects

🌟 agentsre · · Active Development

SRE reliability instrumentation for agentic AI in production — pip install agentsre

Your AI agent returns HTTP 200. Uptime is 99.9%. Every health check is green. And it's making wrong decisions 30% of the time. Your current observability stack won't tell you. agentsre implements the four SLIs that catch what CloudWatch, Datadog, and Grafana miss — born from a real production postmortem where a financial services AI agent ran 6 hours in silent failure mode before causing a 40-minute outage.

📐 4 Original SLIs

DQR — Decision Quality Rate
TIE — Tool Invocation Efficiency
HER — Human Escalation Rate
AQDD — Approval Queue Depth Drift (the one standard SLO burn-rate alerts miss entirely)

🔗 A2A Semantic Validator

Multi-agent boundary validation
Catches HTTP 200 semantic failures
Schema + behavioral drift detection
Blocks bad output before it propagates

⚡ Circuit Breaker

Operates at the semantic layer
Opens on success rate drop, not HTTP errors
Progressive autonomy constraint ladder
AWS SSM Automation integration

☁️ AWS-Native

CloudWatch custom metrics publish
DynamoDB behavioral baseline store
EventBridge breach-triggered workflows
X-Ray distributed tracing across A2A

pip install agentsre          # core — zero dependencies
pip install agentsre[aws]     # + boto3 for CloudWatch publishing

Python AWS CloudWatch DynamoDB EventBridge SSM X-Ray A2A LLM Agents MIT License

🤖 ai-sre-guardrails

ML-powered incident prediction & auto-remediation

Production-grade framework applying machine learning to SRE workflows. Detects anomalies in metrics streams before they become incidents. Implements intelligent runbook automation via LLM agents.

Highlights:

Time-series forecasting on infrastructure metrics
Automated remediation trigger engine
LLM-assisted root cause analysis
Designed for Kubernetes-native environments

Python AWS Lambda Prometheus LLM Agents Kafka

🏗️ dajay-dev-iac

Enterprise-grade AWS Infrastructure as Code

Battle-tested Terraform modules reflecting 11+ years of real production patterns. Encodes AWS best practices, security baselines, and cost guardrails learned from managing multi-million-dollar cloud estates.

Highlights:

Multi-environment, multi-region AWS blueprints
Terragrunt layering for DRY infrastructure
SOC2-ready IAM and networking patterns
FinOps guardrails baked in

Terraform HCL AWS Terragrunt GitOps

💳 customer-service-spring-boot

Cloud-native microservice — fintech domain

Spring Boot service demonstrating SRE-first microservice design: health probes, graceful degradation, structured logging, and observability hooks. Built to 12-factor app principles for Kubernetes deployment.

Java Spring Boot AWS RDS OpenTelemetry EKS

🛡️ insurance-service-spring-boot

Regulated-domain microservice architecture

Backend service for insurance domain with compliance-first design patterns. Demonstrates audit logging, role-based access, and observability integration for regulated workloads.

Java Spring Boot AWS RBAC Audit Logging

🏆 Career Highlights

🏦 Candescent (formerly NCR) — Senior SRE / DevSecOps Engineer (Oct 2021 – Present)

Zero-Trust Network Modernization

Migrated banking clients from legacy Aviatrix S2C VPNs to AWS-native VPN with HA dual-tunnel encryption — eliminating years of recurring instability
Migrated clients from SSH tunnels to AWS PrivateLink, removing Aviatrix dependency entirely
Redesigned QA RDS access architecture for security compliance during Azure → AWS migration

Cost Engineering

Identified and decommissioned unused EC2, RDS, WorkSpaces, and Aviatrix components → $140K/month saved, exceeding 5% cost-reduction target
Docker Hub audit + service account rollout → additional ~1% monthly spend reduction

AI & Automation

Built AI-assisted escrow scripts for packaging, validating, and delivering release artifacts
Automated AWS SSO user reporting via Python — enabling audit team to receive accurate access reports without manual effort
Deployed self-hosted GitHub Actions runners via ASG — improved CI/CD reliability and eliminated managed-minute costs
Python Lambda for automated SSM parameter sync during DR

Critical Incident Resolution

Permanently resolved a 3–4 year recurring banking balance transfer failure restoring full production reliability
Diagnosed and fixed non-functioning pods on EKS nodes during live client incidents
Resolved CashEdge connectivity and no-route-to-destination failures under SLA

Security & Compliance

Deployed CrowdStrike Falcon across Linux/Windows VMs and EKS clusters (Detection → Prevention mode)
Proactively caught expiring SAML certificates (60-day notice); authored MOP docs for repeatable renewal
Supported SOC2 compliance audits with DevSecOps policy enforcement

📡 AT&T — SRE / L3 Infrastructure Engineer (Apr 2017 – Oct 2021)

L3 support for AT&T Digital Life smart home platform — millions of customers, dual redundant data centers
Managed full AWS estate: EC2, S3, RDS, SQS, ELB, VPC, IAM, KMS, CloudWatch, SNS, Auto Scaling, CloudTrail, ACM
Contributed to physical data center → Azure migration
Managed middleware: Apache Tomcat, Nginx, JBoss — SSL/TLS cert renewals with zero cert-related outages
Authored Shell scripts for log rotation, server auto-startup/shutdown, compliance automation
DR War Rooms and BigButton failover automation across dual data centers
24/7 on-call rotation — monitoring via CloudWatch, Nagios, Zabbix, Introscope, Netcool

🧩 SRE Philosophy

Traditional SRE asks:  "How do we respond faster when things break?"
My approach asks:      "How do we make the system tell us before it breaks?"

The stack I believe in:

  [Alert Fatigue]  →  AI anomaly detection filters signal from noise
  [Manual toil]    →  Self-healing automation removes humans from the loop
  [Reactive ops]   →  SLO-driven engineering makes reliability measurable
  [Cost blindness] →  FinOps embedded into reliability decisions from day one

✍️ Writing & Thought Leadership

Publishing original SRE + AIOps patterns — not theory, production experience.

	Title	Platform
🔥	SLOs for Agentic AI: The Reliability Framework Production Teams Are Missing	DEV Community
🔥	Why SRE Principles Are the Missing Layer in MCP Security	DEV Community
☁️	Single-Agent + MCP: SLOs for Agentic AI on AWS	AWS Community Builders
☁️	Multi-Agent + A2A: The SRE Reliability Framework Nobody Has Written Yet	AWS Community Builders

📊 GitHub Activity

🎓 Credentials


🏅 AWS Certified Solutions Architect	Amazon Web Services
🎓 M.S. Information Security	University of the Cumberlands
🎓 M.S. Computer Science	Silicon Valley University
🎓 B.E. Electronics & Communication	JNTUH College of Engineering
🔬 IEEE Senior Member	Institute of Electrical and Electronics Engineers
📐 Fellow SCRS	Society of Clinical Research Sites
⚡ IET Member	The Institution of Engineering and Technology

🤝 Let's Connect

Open to conversations around: Staff SRE · Principal Cloud Architect · Platform Engineering Leadership · AI Infrastructure · DevSecOps · Fintech/Banking Cloud

"The best on-call rotation is the one that never pages — because the system healed itself."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly