Skip to content
View Ajay150313's full-sized avatar

Block or report Ajay150313

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Ajay150313/README.md

Typing SVG


AWS SAA IEEE SOC2 Experience


$ whoami

name:         Ajay Devineni
location:     Atlanta, GA (Cumming, Georgia)
title:        Senior SRE / DevSecOps Engineer · Cloud Architect
focus:        AWS · Kubernetes · AI/ML Infrastructure · Fintech · Telecom
contact:      ajayjboss@gmail.com
linkedin:     linkedin.com/in/ajay-devineni

experience:
  total:      11+ years
  domains:    [ Banking, Credit Union Platforms, Telecom, Healthcare, Insurance ]
  at_scale:   Millions of transactions · Multi-region · 24/7 on-call

core_stack:
  cloud:      [ AWS (primary), Azure, GCP ]
  iac:        [ Terraform, Terragrunt, CloudFormation ]
  containers: [ Kubernetes/EKS, Docker, Helm, Flux, ArgoCD ]
  cicd:       [ GitHub Actions, Jenkins, Azure Pipelines, CodeDeploy ]
  ai_ops:     [ MLOps pipelines, LLM inference infra, GPU compute (P5/G5) ]
  observe:    [ Dynatrace, Grafana, Prometheus, Datadog, Splunk, ELK ]
  security:   [ CrowdStrike, Zero-Trust, PrivateLink, SOC2, AWS IAM/SSO ]
  scripting:  [ Python, Bash/Shell ]

currently_building:
  - agentsre (PyPI): OSS SRE reliability instrumentation library for agentic AI
  - 4 original SLIs for AI agents — DQR · TIE · HER · AQDD
  - Publishing on AWS Community Builders & DEV Community — AIOps + SRE patterns

📊 Impact By The Numbers

Metric Achievement
💰 Cloud Cost Savings $140,000/month eliminated via AWS resource optimization (FY24)
⬆️ Uptime Delivered 99.99% across 5 digital banking platforms simultaneously
⚙️ Toil Reduction 30% operational overhead cut via Python/Shell self-healing automation
🕒 Hours Saved/Month 10–15+ engineering hours freed via patching & audit automation
🐛 Legacy Bug Resolved 3–4 year recurring balance transfer failure permanently fixed
⏱️ Downtime Prevented 50+ hours cumulative production downtime averted over career
🔒 Security Posture Zero-trust migrations eliminating years of recurring network instability
🏦 Platforms Supported 5 credit union banking platforms — 90%+ reliability rate

🛠️ Full Technology Stack

Cloud & Infrastructure

AWS Azure GCP Kubernetes Docker Terraform Ansible Linux

Development & CI/CD

Python Java Bash GitHub Actions Jenkins Git

Observability & Security

Grafana Prometheus Kafka

AWS Services Deep-Cut:
EKS · EC2 · RDS · Aurora · S3 · Lambda · VPC · IAM · SSO · Transit Gateway
PrivateLink · DMS · DynamoDB · CloudWatch · SQS · SNS · ACM · Route53 · Auto Scaling

Networking Specialization:
Aviatrix · AWS VPN (HA dual-tunnel) · PrivateLink · VPC Peering
Transit Gateway · Zero-Trust Architecture · OpenVPN · SSL/TLS · DNS

🚀 Featured Projects

🌟 agentsre  ·  PyPI  ·  Active Development

SRE reliability instrumentation for agentic AI in production — pip install agentsre

Your AI agent returns HTTP 200. Uptime is 99.9%. Every health check is green. And it's making wrong decisions 30% of the time. Your current observability stack won't tell you. agentsre implements the four SLIs that catch what CloudWatch, Datadog, and Grafana miss — born from a real production postmortem where a financial services AI agent ran 6 hours in silent failure mode before causing a 40-minute outage.

📐 4 Original SLIs

  • DQR — Decision Quality Rate
  • TIE — Tool Invocation Efficiency
  • HER — Human Escalation Rate
  • AQDD — Approval Queue Depth Drift (the one standard SLO burn-rate alerts miss entirely)

🔗 A2A Semantic Validator

  • Multi-agent boundary validation
  • Catches HTTP 200 semantic failures
  • Schema + behavioral drift detection
  • Blocks bad output before it propagates

⚡ Circuit Breaker

  • Operates at the semantic layer
  • Opens on success rate drop, not HTTP errors
  • Progressive autonomy constraint ladder
  • AWS SSM Automation integration

☁️ AWS-Native

  • CloudWatch custom metrics publish
  • DynamoDB behavioral baseline store
  • EventBridge breach-triggered workflows
  • X-Ray distributed tracing across A2A
pip install agentsre          # core — zero dependencies
pip install agentsre[aws]     # + boto3 for CloudWatch publishing

Python AWS CloudWatch DynamoDB EventBridge SSM X-Ray A2A LLM Agents MIT License

Stars Forks Last Commit PyPI Downloads License


ML-powered incident prediction & auto-remediation

Production-grade framework applying machine learning to SRE workflows. Detects anomalies in metrics streams before they become incidents. Implements intelligent runbook automation via LLM agents.

Highlights:

  • Time-series forecasting on infrastructure metrics
  • Automated remediation trigger engine
  • LLM-assisted root cause analysis
  • Designed for Kubernetes-native environments

Python AWS Lambda Prometheus LLM Agents Kafka

Stars

Enterprise-grade AWS Infrastructure as Code

Battle-tested Terraform modules reflecting 11+ years of real production patterns. Encodes AWS best practices, security baselines, and cost guardrails learned from managing multi-million-dollar cloud estates.

Highlights:

  • Multi-environment, multi-region AWS blueprints
  • Terragrunt layering for DRY infrastructure
  • SOC2-ready IAM and networking patterns
  • FinOps guardrails baked in

Terraform HCL AWS Terragrunt GitOps

Stars

Cloud-native microservice — fintech domain

Spring Boot service demonstrating SRE-first microservice design: health probes, graceful degradation, structured logging, and observability hooks. Built to 12-factor app principles for Kubernetes deployment.

Java Spring Boot AWS RDS OpenTelemetry EKS

Regulated-domain microservice architecture

Backend service for insurance domain with compliance-first design patterns. Demonstrates audit logging, role-based access, and observability integration for regulated workloads.

Java Spring Boot AWS RBAC Audit Logging


🏆 Career Highlights

🏦 Candescent (formerly NCR) — Senior SRE / DevSecOps Engineer (Oct 2021 – Present)

Zero-Trust Network Modernization

  • Migrated banking clients from legacy Aviatrix S2C VPNs to AWS-native VPN with HA dual-tunnel encryption — eliminating years of recurring instability
  • Migrated clients from SSH tunnels to AWS PrivateLink, removing Aviatrix dependency entirely
  • Redesigned QA RDS access architecture for security compliance during Azure → AWS migration

Cost Engineering

  • Identified and decommissioned unused EC2, RDS, WorkSpaces, and Aviatrix components → $140K/month saved, exceeding 5% cost-reduction target
  • Docker Hub audit + service account rollout → additional ~1% monthly spend reduction

AI & Automation

  • Built AI-assisted escrow scripts for packaging, validating, and delivering release artifacts
  • Automated AWS SSO user reporting via Python — enabling audit team to receive accurate access reports without manual effort
  • Deployed self-hosted GitHub Actions runners via ASG — improved CI/CD reliability and eliminated managed-minute costs
  • Python Lambda for automated SSM parameter sync during DR

Critical Incident Resolution

  • Permanently resolved a 3–4 year recurring banking balance transfer failure restoring full production reliability
  • Diagnosed and fixed non-functioning pods on EKS nodes during live client incidents
  • Resolved CashEdge connectivity and no-route-to-destination failures under SLA

Security & Compliance

  • Deployed CrowdStrike Falcon across Linux/Windows VMs and EKS clusters (Detection → Prevention mode)
  • Proactively caught expiring SAML certificates (60-day notice); authored MOP docs for repeatable renewal
  • Supported SOC2 compliance audits with DevSecOps policy enforcement
📡 AT&T — SRE / L3 Infrastructure Engineer (Apr 2017 – Oct 2021)
  • L3 support for AT&T Digital Life smart home platform — millions of customers, dual redundant data centers
  • Managed full AWS estate: EC2, S3, RDS, SQS, ELB, VPC, IAM, KMS, CloudWatch, SNS, Auto Scaling, CloudTrail, ACM
  • Contributed to physical data center → Azure migration
  • Managed middleware: Apache Tomcat, Nginx, JBoss — SSL/TLS cert renewals with zero cert-related outages
  • Authored Shell scripts for log rotation, server auto-startup/shutdown, compliance automation
  • DR War Rooms and BigButton failover automation across dual data centers
  • 24/7 on-call rotation — monitoring via CloudWatch, Nagios, Zabbix, Introscope, Netcool

🧩 SRE Philosophy

Traditional SRE asks:  "How do we respond faster when things break?"
My approach asks:      "How do we make the system tell us before it breaks?"

The stack I believe in:

  [Alert Fatigue]  →  AI anomaly detection filters signal from noise
  [Manual toil]    →  Self-healing automation removes humans from the loop
  [Reactive ops]   →  SLO-driven engineering makes reliability measurable
  [Cost blindness] →  FinOps embedded into reliability decisions from day one

✍️ Writing & Thought Leadership

Publishing original SRE + AIOps patterns — not theory, production experience.

Title Platform
🔥 SLOs for Agentic AI: The Reliability Framework Production Teams Are Missing DEV Community
🔥 Why SRE Principles Are the Missing Layer in MCP Security DEV Community
☁️ Single-Agent + MCP: SLOs for Agentic AI on AWS AWS Community Builders
☁️ Multi-Agent + A2A: The SRE Reliability Framework Nobody Has Written Yet AWS Community Builders

📊 GitHub Activity


🎓 Credentials

🏅 AWS Certified Solutions Architect Amazon Web Services
🎓 M.S. Information Security University of the Cumberlands
🎓 M.S. Computer Science Silicon Valley University
🎓 B.E. Electronics & Communication JNTUH College of Engineering
🔬 IEEE Senior Member Institute of Electrical and Electronics Engineers
📐 Fellow SCRS Society of Clinical Research Sites
IET Member The Institution of Engineering and Technology

🤝 Let's Connect

LinkedIn Email GitHub


Open to conversations around: Staff SRE · Principal Cloud Architect · Platform Engineering Leadership · AI Infrastructure · DevSecOps · Fintech/Banking Cloud


"The best on-call rotation is the one that never pages — because the system healed itself."


Profile Views   GitHub followers

Popular repositories Loading

  1. agentsre agentsre Public

    SRE reliability instrumentation for agentic AI — DQR, TIE, HER, AQD

    Python 317 177

  2. agentsre-langchain agentsre-langchain Public

    Python 102 80

  3. slo-impact slo-impact Public

    User-impact-weighted SLO dashboard — measure true 99.99% uptime by actual customer exposure, not raw availability

    Python 62 63

  4. slo-burn slo-burn Public

    sre slo alerting prometheus burn-rate observability devops monitoring reliability python

    Python 26

  5. dajay-dev-iac dajay-dev-iac Public

    HCL

  6. customer-service-srping-boot customer-service-srping-boot Public

    Java