STATUS: graduating soon, accepting offers, mildly caffeinated

Dibya Darshan Khanal_

SRE DevOps AIOps Cloud Data & Ops

I keep production from catching fire, automate the boring parts, and build AI systems that pull their weight. Currently finishing a Master’s in Computer Science at the University of Cincinnati and looking for full-time roles where pagers go off and somebody wants them to stop.

hire me, probably see my AI receipts

years in production

99.95%

SLO defended

100K+

users / hr served

batsignal owned

about / disclaimer

I’m an SRE / DevOps / AIOps engineer with five-plus years of running cloud infrastructure that refuses to fall over. AWS, Azure, Kubernetes, Terraform, observability stacks, the occasional three a.m. PagerDuty page that turns out to be DNS. Always DNS.

Right now I’m wrapping up an MS in Computer Science at the University of Cincinnati while moonlighting as an AI Operations researcher at the P&G Digital Accelerator, where I build agentic LLM systems that are smarter than me on a good day.

I’m looking for full-time roles in Cloud, DevOps, AIOps, Data, Operations, and SRE. If you have a fleet of microservices, a flaky deploy pipeline, or a Grafana dashboard that nobody reads, we should talk.

experience / war stories

Jul 2025 — Present
Research Assistant — AI Operations Engineer

P&G Digital Accelerator @ University of Cincinnati · Cincinnati, OH
- Designed an agentic LLM application with RAG over enterprise knowledge bases using LangChain, vector DBs, Python and Streamlit.
- Architected client-isolated, autoscaling deployments on Azure with immutable infra and load balancing.
- Built an end-to-end ETL pipeline in Databricks for unstructured data and downstream MLOps.
- Wrote a custom auth and usage-tracking wrapper for zero-trust access and per-tenant LLM cost attribution.
- Built load and performance test harnesses so SLOs aren’t just vibes.
Dec 2022 — Jan 2025
Site Reliability Engineer

UBA Solutions Pvt Ltd. · Kathmandu, Nepal
- Built an observability stack with Datadog, ELK, Grafana, Splunk and OpenTelemetry — 40% better alert accuracy, 25% lower MTTR.
- Maintained 99.95% uptime for a SaaS platform serving 100K+ users/hr across 3 AWS regions with multi-region failover.
- Ran 50+ workloads on Amazon EKS with Istio, canary deploys and policy-driven networking.
- Wrote self-healing runbooks and toil-killing automation that cut on-call manual work by 35%. Won the 2023 Above and Beyond Award.
- Defined SLOs, SLIs and error budgets like a responsible adult.
- Ran chaos engineering with Gremlin and home-grown fault injectors because hope is not a strategy.
Aug 2021 — Dec 2022
Associate Cloud & DevOps Engineer

Cloudlaya LLC · Kathmandu, Nepal
- Automated CI/CD for 50+ apps with Jenkins and AWS CodePipeline; deployment time down 80%.
- Migrated 20+ services to AWS ECS Fargate via Terraform and CloudFormation; infra cost down 30%.
- Refactored Spring Boot, PHP and Node.js apps with DevSecOps practices; security risk down 40%.
- Standardized reusable Terraform modules and GitFlow across 15+ concurrent projects.
- Supported SOC2 and security audits with documented infra and access controls.
Apr 2020 — Jul 2021
Cloud & DevOps Engineer Intern

Cloudlaya LLC · Kathmandu, Nepal
- Operated 100+ production web apps at 99.9% availability with sub-15-minute MTTA.
- Managed DNS, SSL/TLS, Zimbra mail and Linux hosting for smooth client onboarding.

skills / loadout

Cloud & Infra

AWS (EC2, S3, EBS, EFS, Lambda, ECS, EKS, RDS, Auto Scaling, API Gateway, CloudFront, CloudTrail) · Azure · GCP (Compute Engine, GKE, Cloud Run, Vertex AI, Cloud Storage, IAM, Cloud Monitoring) · Nginx · DNS · SSL/TLS

CI/CD & IaC

Jenkins · GitHub Actions · GitLab CI · ArgoCD · GitOps · Terraform · Ansible · CloudFormation · Helm · Docker · Kubernetes

Observability

Datadog · Dynatrace · Prometheus · Grafana · OpenTelemetry · ELK · Splunk · CloudWatch · PagerDuty · Distributed Tracing

Reliability

Chaos engineering · Incident management · SLO/SLI design · Error budgets · On-call · Capacity planning

AI / ML / Data

LLMs · RAG · LangChain · Transformers · Prompt engineering · Vector DBs · MLOps · SageMaker · TensorFlow · ETL Pipelines

Languages & Data

Python · Bash · JavaScript · SQL · PostgreSQL · MongoDB · Redis

projects / artifacts

research · Dec 2025 – Present

LLM-Droid-Tester

LLM-driven Android bug reproduction. Hooks into an emulator, screenshots with bounding boxes, and lets Gemini 2.5 Pro follow human bug reports.

PythonGemini 2.5 ProADBVision

chaos · Mar 2026

K8s Resilience Testing Framework

Chaos toolkit for Kubernetes. Pod kills, network partitions, resource pressure. Ships with pass/fail SLO gates inside CI/CD.

KubernetesChaosCI gates

observability · Oct 2024

Datadog Synthetic Monitor Validator

Health-check and alert validator with anomaly detection. Cuts false positives, tracks error budgets, routes incidents to Slack and PagerDuty.

Datadog APIPythonPagerDuty

platform · Apr 2022

Java Spring Boot CI/CD on EKS

End-to-end pipeline with Jenkins, SonarQube, Docker, Terraform and Helm. Blue-green and canary, container scanning, IaC, the whole production cosplay.

JenkinsEKSTerraformHelm

AI burn rate / receipts

A fully transparent, slightly embarrassing accounting of how much I’ve fed to large language models so they could autocomplete me into a better engineer. Token estimates are blended input/output at public list prices, give or take a few hundred million tokens. No, my parents do not know.

spent · across 5 platforms · still counting

0 approx. tokens consumed

$1,500

≈ 50M tokens

Heavy lifting. Architecture, refactors, and the answers I tell people I figured out myself.

$1,000

≈ 167M tokens

The everyday workhorse. Pull requests, scripts, “why is this YAML lying to me”.

$620

≈ 1.24B tokens

Bulk reasoning at suspiciously good per-token pricing. I asked, it answered, my CFO did not.

$400

≈ 500M tokens

Multi-modal models reading PDFs, screenshots, and the occasional whiteboard photo from 2 a.m.

$600

≈ 120M tokens

Powering LLM-Droid-Tester, a published research project, and one year of “explain this paper to me at midnight”.

$300

≈ 30M tokens

Since launch. ChatGPT Plus, Pro, the API, and one impulsive month of GPT-realtime.

* Token counts are blended estimates based on public list pricing for each model family. Actual numbers vary, but the receipts are real and increasingly difficult to explain.

education

Jan 2025 — Present

M.S. Computer Science

University of Cincinnati · Cincinnati, OH

Cloud Computing · Advanced Algorithms · Machine Learning · Artificial Intelligence · Data Analysis

Aug 2017 — Aug 2021

B.Sc. CSIT

Tribhuvan University · Nepal

Operating Systems · Computer Networks · DBMS · Data Structures

certs & honors

AWS Certified Solutions Architect — Associate
AWS Community Builder — Cloud Operations Track
2023 Above and Beyond Award · UBA Solutions
Graduate Student Judge, CEAS Expo 2026 · UC

contact / open a ticket

Currently in Cincinnati, OH. Open to relocation, remote, or a Bat-Signal in the night sky.

email dibya.ddk@gmail.com linkedin /in/dibya-darshan-khanal github @Dibae101 resume Khanal_Dibya_2026.pdf