v2.6.0 BASED IN MEMPHIS, TN OPEN TO RELOCATION UPTIME: 8 YRS

Reliability for the
systems that train
and serve modern AI.

I'm Arafath Aamer — a Site Reliability Engineer focused on distributed systems, GPU infrastructure, and Kubernetes at scale. Eight years operating production cloud platforms — from CI/CD and incident response to LLM serving, RAG pipelines, and GPU telemetry. I build the substrate that lets engineering and research teams move fast without breaking prod.

View capabilities → Download résumé GitHub ↗

arafath@platform-prod ~ %

Years in production ~8yr

Clouds operated 3aws · gcp · azure

Postmortems authored 40+

Stack focus k8s · gpu · llm

Status healthy

01 //

Capabilities — what I bring to a platform team

core practice areas

End-to-end reliability · SLOs

SLOs that balance velocity and stability.

Define and own SLOs across CI/CD, observability, and runtime layers — including error-budget policies that gate releases without blocking development velocity. Authored 40+ postmortems and drove follow-up fixes to closure.

Distributed systems observability

Telemetry across the full request and training path.

Build observability stacks combining Prometheus, Grafana, OpenTelemetry, and DCGM for both service traffic and GPU health (utilization, ECC, temperature, NVLink, NCCL). Comfortable instrumenting code from the metric definition to the on-call alert.

Kubernetes at scale

Operating heterogeneous clusters in production.

Production Kubernetes across multiple clouds — namespace isolation, resource quotas, NetworkPolicies, PodSecurity standards, and tuning for mixed CPU/GPU workloads. Comfortable debugging the parts of K8s that nobody wants to debug.

Incident response · postmortems

Rapid recovery, blameless review, recurrence kill.

On-call rotations covering platform-tier services. Drove incidents from page → resolution → postmortem → systemic fix, including infrastructure-as-code guardrails and automated detection for entire failure classes after they occurred once.

Tooling & automation

Software written to solve reliability problems.

Production code in Python, Go, and Bash: custom Prometheus exporters, PID-controller-based autoscaling logic, deployment automation, and CLI tools that pay for themselves the first week they're deployed.

AI infrastructure

Where infra failures surface as ML behavior.

Operate the LLM serving stack: vLLM/FastAPI inference, RAG with ChromaDB, LoRA/PEFT/DPO fine-tuning, and multi-provider routing. Treat training failures the way backend engineers read stack traces — NCCL timeouts, gradient explosions, and checkpoint corruption are infra problems wearing ML clothes.

02 //

Stack — service inventory

production · daily-driver

Orchestration & Compute

Kubernetes DAILY
Helm & Kustomize DAILY
Argo CD / Argo Workflows PROD
Docker / containerd DAILY
Linux / systemd DAILY

Observability

Prometheus / Mimir DAILY
Grafana DAILY
OpenTelemetry PROD
DCGM (GPU telemetry) PROD
Loki / log pipelines PROD

Cloud & Infra

AWS (EKS, EC2, IAM) DAILY
GCP (GKE, Compute, GPU VMs) DAILY
Azure FAMILIAR
Terraform PROD
GitHub Actions / CI-CD DAILY

AI / ML Platform

LLM serving (vLLM, FastAPI) DAILY
RAG pipelines (ChromaDB) DAILY
LoRA / PEFT / DPO PROD
PyTorch PROD
Multi-provider routing DAILY

Languages

Python DAILY
Go PROD
Bash DAILY
SQL DAILY
TypeScript / React PROD

Reliability Practice

SLO / SLI design DAILY
Incident command PROD
Postmortem authoring DAILY
Capacity planning PROD
Chaos / failure injection PROD

03 //

Selected work — prod incidents survived

2018 → present

2023 → present

Site Reliability Engineer

AutoZone (via Wipro)

Own reliability for retail-scale platform services. SLO design, on-call leadership, observability tooling (Prometheus, Grafana, OpenTelemetry), and incident response. Drove 40+ postmortems to systemic fixes and built automation that eliminated entire failure classes.

KubernetesPrometheusSLOsOn-callAWS

40+ postmortems

2025 → present

Course Distiller

Independent · Platform & AI Pipeline

Plugin-based agent pipeline that ingests transcripts and code, distills them into textbook chapters, flashcards, and concept maps. Multi-LLM routing (Gemini, Claude, DeepSeek), ChromaDB RAG, FastAPI backend, deployed on GCP with systemd.

FastAPIGCPRAGMulti-LLMsystemd

324 lectures processed

2025

Medical ML Platform

Independent · GPU Inference

MedGemma 4B + Google MedASR served from a Tesla T4 GCP VM via FastAPI + systemd, with API-key auth and HIPAA-aware design. Lazy model loading, auto-unload, bfloat16 + greedy decoding for deterministic medical Q&A.

MedGemmaT4 GPUFastAPIHIPAASpring Boot

T4 GPU served

2021 → 2023

Software Development Engineer

Grant Thornton

Backend services and DevOps tooling for enterprise audit platforms. Containerized legacy workloads, built CI/CD pipelines, and migrated services to managed Kubernetes.

K8s migrationCI/CDBackendAzure

2yr tenure

2018 → 2021

Software Engineer

Web Initiate

Full-stack engineering across backend services, deployment automation, and customer-facing applications. Foundation for the SRE/platform direction.

BackendDeploymentFoundations

3yr foundation

Reliability for the
systems that train
and serve modern AI.

Capabilities — what I bring to a platform team

Reliability is the product underneath the product.

SLOs that balance velocity and stability.

Telemetry across the full request and training path.

Operating heterogeneous clusters in production.

Rapid recovery, blameless review, recurrence kill.

Software written to solve reliability problems.

Where infra failures surface as ML behavior.

Stack — service inventory

Orchestration & Compute

Observability

Cloud & Infra

AI / ML Platform

Languages

Reliability Practice

Selected work — prod incidents survived

Writing — notes from the on-call rotation

Reading NCCL timeouts as infrastructure smoke signals.

What DCGM tells you that node_exporter doesn't.

SLOs for distributed training: completion vs. step-time.

Contact — page me

If you're hiring for the layer
where systems actually run,
let's talk.

Reliability for the systems that train and serve modern AI.

Capabilities — what I bring to a platform team

Reliability is the product underneath the product.

SLOs that balance velocity and stability.

Telemetry across the full request and training path.

Operating heterogeneous clusters in production.

Rapid recovery, blameless review, recurrence kill.

Software written to solve reliability problems.

Where infra failures surface as ML behavior.

Stack — service inventory

Orchestration & Compute

Observability

Cloud & Infra

AI / ML Platform

Languages

Reliability Practice

Selected work — prod incidents survived

Writing — notes from the on-call rotation

Reading NCCL timeouts as infrastructure smoke signals.

What DCGM tells you that node_exporter doesn't.

SLOs for distributed training: completion vs. step-time.

Contact — page me

If you're hiring for the layerwhere systems actually run,let's talk.

Reliability for the
systems that train
and serve modern AI.

If you're hiring for the layer
where systems actually run,
let's talk.