SYS_STATUS: AVAILABLE
v2.6.0 BASED IN MEMPHIS, TN OPEN TO RELOCATION UPTIME: 8 YRS

Reliability for the
systems that train
and serve modern AI.

I'm Arafath Aamer — a Site Reliability Engineer focused on distributed systems, GPU infrastructure, and Kubernetes at scale. Eight years operating production cloud platforms — from CI/CD and incident response to LLM serving, RAG pipelines, and GPU telemetry. I build the substrate that lets engineering and research teams move fast without breaking prod.

arafath@platform-prod ~ %
Years in production ~8yr
Clouds operated 3aws · gcp · azure
Postmortems authored 40+
Stack focus k8s · gpu · llm
Status healthy
01 //

Capabilities — what I bring to a platform team

SRE · Platform · AI Infrastructure

Reliability is the product underneath the product.

I work at the intersection of distributed systems, Kubernetes, GPU observability, and AI infrastructure — the layer where training jobs, inference services, and developer platforms have to actually stay up. Below are the practice areas I specialize in, with the kind of work I've shipped behind each one.

End-to-end reliability · SLOs

SLOs that balance velocity and stability.

Define and own SLOs across CI/CD, observability, and runtime layers — including error-budget policies that gate releases without blocking development velocity. Authored 40+ postmortems and drove follow-up fixes to closure.

Distributed systems observability

Telemetry across the full request and training path.

Build observability stacks combining Prometheus, Grafana, OpenTelemetry, and DCGM for both service traffic and GPU health (utilization, ECC, temperature, NVLink, NCCL). Comfortable instrumenting code from the metric definition to the on-call alert.

Kubernetes at scale

Operating heterogeneous clusters in production.

Production Kubernetes across multiple clouds — namespace isolation, resource quotas, NetworkPolicies, PodSecurity standards, and tuning for mixed CPU/GPU workloads. Comfortable debugging the parts of K8s that nobody wants to debug.

Incident response · postmortems

Rapid recovery, blameless review, recurrence kill.

On-call rotations covering platform-tier services. Drove incidents from page → resolution → postmortem → systemic fix, including infrastructure-as-code guardrails and automated detection for entire failure classes after they occurred once.

Tooling & automation

Software written to solve reliability problems.

Production code in Python, Go, and Bash: custom Prometheus exporters, PID-controller-based autoscaling logic, deployment automation, and CLI tools that pay for themselves the first week they're deployed.

AI infrastructure

Where infra failures surface as ML behavior.

Operate the LLM serving stack: vLLM/FastAPI inference, RAG with ChromaDB, LoRA/PEFT/DPO fine-tuning, and multi-provider routing. Treat training failures the way backend engineers read stack traces — NCCL timeouts, gradient explosions, and checkpoint corruption are infra problems wearing ML clothes.

02 //

Stack — service inventory

Orchestration & Compute

  • Kubernetes DAILY
  • Helm & Kustomize DAILY
  • Argo CD / Argo Workflows PROD
  • Docker / containerd DAILY
  • Linux / systemd DAILY

Observability

  • Prometheus / Mimir DAILY
  • Grafana DAILY
  • OpenTelemetry PROD
  • DCGM (GPU telemetry) PROD
  • Loki / log pipelines PROD

Cloud & Infra

  • AWS (EKS, EC2, IAM) DAILY
  • GCP (GKE, Compute, GPU VMs) DAILY
  • Azure FAMILIAR
  • Terraform PROD
  • GitHub Actions / CI-CD DAILY

AI / ML Platform

  • LLM serving (vLLM, FastAPI) DAILY
  • RAG pipelines (ChromaDB) DAILY
  • LoRA / PEFT / DPO PROD
  • PyTorch PROD
  • Multi-provider routing DAILY

Languages

  • Python DAILY
  • Go PROD
  • Bash DAILY
  • SQL DAILY
  • TypeScript / React PROD

Reliability Practice

  • SLO / SLI design DAILY
  • Incident command PROD
  • Postmortem authoring DAILY
  • Capacity planning PROD
  • Chaos / failure injection PROD
03 //

Selected work — prod incidents survived

2023 → present
Site Reliability Engineer
AutoZone (via Wipro)
Own reliability for retail-scale platform services. SLO design, on-call leadership, observability tooling (Prometheus, Grafana, OpenTelemetry), and incident response. Drove 40+ postmortems to systemic fixes and built automation that eliminated entire failure classes.
KubernetesPrometheusSLOsOn-callAWS
40+ postmortems
2025 → present
Course Distiller
Independent · Platform & AI Pipeline
Plugin-based agent pipeline that ingests transcripts and code, distills them into textbook chapters, flashcards, and concept maps. Multi-LLM routing (Gemini, Claude, DeepSeek), ChromaDB RAG, FastAPI backend, deployed on GCP with systemd.
FastAPIGCPRAGMulti-LLMsystemd
324 lectures processed
2025
Medical ML Platform
Independent · GPU Inference
MedGemma 4B + Google MedASR served from a Tesla T4 GCP VM via FastAPI + systemd, with API-key auth and HIPAA-aware design. Lazy model loading, auto-unload, bfloat16 + greedy decoding for deterministic medical Q&A.
MedGemmaT4 GPUFastAPIHIPAASpring Boot
T4 GPU served
2021 → 2023
Software Development Engineer
Grant Thornton
Backend services and DevOps tooling for enterprise audit platforms. Containerized legacy workloads, built CI/CD pipelines, and migrated services to managed Kubernetes.
K8s migrationCI/CDBackendAzure
2yr tenure
2018 → 2021
Software Engineer
Web Initiate
Full-stack engineering across backend services, deployment automation, and customer-facing applications. Foundation for the SRE/platform direction.
BackendDeploymentFoundations
3yr foundation
04 //

Writing — notes from the on-call rotation

05 //

Contact — page me

If you're hiring for the layer
where systems actually run,
let's talk.

Open to SRE, Platform Engineering, and AI Infrastructure roles — particularly on teams running real distributed systems, GPU clusters, or Kubernetes at scale. Available for relocation.

Email me