I'm Arafath Aamer — a Site Reliability Engineer focused on distributed systems, GPU infrastructure, and Kubernetes at scale. Eight years operating production cloud platforms — from CI/CD and incident response to LLM serving, RAG pipelines, and GPU telemetry. I build the substrate that lets engineering and research teams move fast without breaking prod.
Define and own SLOs across CI/CD, observability, and runtime layers — including error-budget policies that gate releases without blocking development velocity. Authored 40+ postmortems and drove follow-up fixes to closure.
Build observability stacks combining Prometheus, Grafana, OpenTelemetry, and DCGM for both service traffic and GPU health (utilization, ECC, temperature, NVLink, NCCL). Comfortable instrumenting code from the metric definition to the on-call alert.
Production Kubernetes across multiple clouds — namespace isolation, resource quotas, NetworkPolicies, PodSecurity standards, and tuning for mixed CPU/GPU workloads. Comfortable debugging the parts of K8s that nobody wants to debug.
On-call rotations covering platform-tier services. Drove incidents from page → resolution → postmortem → systemic fix, including infrastructure-as-code guardrails and automated detection for entire failure classes after they occurred once.
Production code in Python, Go, and Bash: custom Prometheus exporters, PID-controller-based autoscaling logic, deployment automation, and CLI tools that pay for themselves the first week they're deployed.
Operate the LLM serving stack: vLLM/FastAPI inference, RAG with ChromaDB, LoRA/PEFT/DPO fine-tuning, and multi-provider routing. Treat training failures the way backend engineers read stack traces — NCCL timeouts, gradient explosions, and checkpoint corruption are infra problems wearing ML clothes.
A pattern I've seen repeatedly: training jobs hang in ways that look like model bugs but are actually network or scheduler problems. Coming soon.
GPU telemetry isn't just "GPU util." A field guide to ECC errors, NVLink saturation, and the metrics that predict thermal throttling before it happens.
Job-completion SLOs and per-step latency SLOs measure different failure modes. How to pick the right one and what each one tells you about your platform.
Open to SRE, Platform Engineering, and AI Infrastructure roles — particularly on teams running real distributed systems, GPU clusters, or Kubernetes at scale. Available for relocation.
Email me →