STATUS: graduating soon, accepting offers, mildly caffeinated

Dibya Darshan Khanal_

SRE DevOps AIOps Cloud Data & Ops

I keep production from catching fire, automate the boring parts, and occasionally let a large language model do my homework. Currently finishing a Master’s in Computer Science at the University of Cincinnati and looking for full-time roles where pagers go off and somebody wants them to stop.

hire me, probably see my AI receipts
5+
years in production
99.95%
SLO defended
100K+
users / hr served
1
batsignal owned
01

about / disclaimer

I’m an SRE / DevOps / AIOps engineer with five-plus years of running cloud infrastructure that refuses to fall over. AWS, Azure, Kubernetes, Terraform, observability stacks, the occasional three a.m. PagerDuty page that turns out to be DNS. Always DNS.

Right now I’m wrapping up an MS in Computer Science at the University of Cincinnati while moonlighting as an AI Operations researcher at the P&G Digital Accelerator, where I build agentic LLM systems that are smarter than me on a good day.

I’m looking for full-time roles in Cloud, DevOps, AIOps, Data, Operations, and SRE. If you have a fleet of microservices, a flaky deploy pipeline, or a Grafana dashboard that nobody reads, we should talk.

02

experience / war stories

  1. Jul 2025 — Present

    Research Assistant — AI Operations Engineer

    P&G Digital Accelerator @ University of Cincinnati · Cincinnati, OH

    • Designed an agentic LLM application with RAG over enterprise knowledge bases using LangChain, vector DBs, Python and Streamlit.
    • Architected client-isolated, autoscaling deployments on Azure with immutable infra and load balancing.
    • Built an end-to-end ETL pipeline in Databricks for unstructured data and downstream MLOps.
    • Wrote a custom auth and usage-tracking wrapper for zero-trust access and per-tenant LLM cost attribution.
    • Built load and performance test harnesses so SLOs aren’t just vibes.
  2. Dec 2022 — Jan 2025

    Site Reliability Engineer

    UBA Solutions Pvt Ltd. · Kathmandu, Nepal

    • Built an observability stack with Datadog, ELK, Grafana, Splunk and OpenTelemetry — 40% better alert accuracy, 25% lower MTTR.
    • Maintained 99.95% uptime for a SaaS platform serving 100K+ users/hr across 3 AWS regions with multi-region failover.
    • Ran 50+ workloads on Amazon EKS with Istio, canary deploys and policy-driven networking.
    • Wrote self-healing runbooks and toil-killing automation that cut on-call manual work by 35%. Won the 2023 Above and Beyond Award.
    • Defined SLOs, SLIs and error budgets like a responsible adult.
    • Ran chaos engineering with Gremlin and home-grown fault injectors because hope is not a strategy.
  3. Apr 2020 — Dec 2022

    Associate Cloud & DevOps Engineer

    Cloudlaya LLC · San Francisco, CA

    • Operated 100+ production web apps at 99.9% availability with sub-15-minute MTTA.
    • Automated CI/CD for 50+ apps with Jenkins and AWS CodePipeline; deployment time down 80%.
    • Migrated 20+ services to AWS ECS Fargate via Terraform and CloudFormation; infra cost down 30%.
    • Refactored Spring Boot, PHP and Node.js apps with DevSecOps practices; security risk down 40%.
    • Standardized Terraform modules and GitFlow across 15+ concurrent projects.
03

skills / loadout

Cloud & Infra

AWS (EC2, S3, Lambda, ECS, EKS, RDS, ASG, SNS, Secrets Manager) · Azure · GCP · Nginx · DNS · SSL/TLS

CI/CD & IaC

Jenkins · GitHub Actions · GitLab CI · ArgoCD · Terraform · Ansible · CloudFormation · Helm · Docker · Kubernetes

Observability

Datadog · CloudWatch · Splunk · ELK · Grafana · Prometheus · OpenTelemetry · PagerDuty · SLO/SLI

Reliability

Chaos engineering · Incident management · Capacity planning · Error budgets · On-call · MOPs/SOPs/EOPs

AI / ML / Data

LLMs · LangChain · RAG · Prompt engineering · TensorFlow · scikit-learn · SageMaker · Databricks · Vector DBs · MLOps

Languages & Data

Python · Bash · JavaScript · Groovy · SQL · PostgreSQL · MongoDB · Redis

04

projects / artifacts

research

CARBON

LLM-driven Android bug reproduction. Hooks into an emulator, screenshots with bounding boxes, and lets Gemini 2.5 Pro follow human bug reports.

PythonGemini 2.5 ProADBVision
platform

Java Spring Boot CI/CD on EKS

End-to-end pipeline with Jenkins, SonarQube, Docker, Terraform and Helm. Blue-green and canary, container scanning, IaC, the whole production cosplay.

JenkinsEKSTerraformHelm
observability

Datadog Synthetic Monitor Validator

Health-check and alert validator with anomaly detection. Cuts false positives, tracks error budgets, routes incidents to Slack and PagerDuty.

Datadog APIPythonPagerDuty
chaos

K8s Resilience Testing Framework

Chaos toolkit for Kubernetes. Pod kills, network partitions, resource pressure. Ships with pass/fail SLO gates inside CI/CD.

KubernetesChaosCI gates
05

AI burn rate / receipts

A fully transparent, slightly embarrassing accounting of how much I’ve fed to large language models so they could autocomplete me into a better engineer. Token estimates are blended input/output at public list prices, give or take a few hundred million tokens. No, my parents do not know.

total

grand total

$0
spent · across 5 platforms · still counting
0 approx. tokens consumed
Kiro · Claude Opus 4.7 ~34%

$1,500

≈ 50M tokens

Heavy lifting. Architecture, refactors, and the answers I tell people I figured out myself.

Kiro · Claude Sonnet 4.7 ~23%

$1,000

≈ 167M tokens

The everyday workhorse. Pull requests, scripts, “why is this YAML lying to me”.

AWS Bedrock · DeepSeek v3.2 ~14%

$620

≈ 1.24B tokens

Bulk reasoning at suspiciously good per-token pricing. I asked, it answered, my CFO did not.

Bedrock · OCR & Vision ~9%

$400

≈ 500M tokens

Multi-modal models reading PDFs, screenshots, and the occasional whiteboard photo from 2 a.m.

Gemini 2.5 Pro ~14%

$600

≈ 120M tokens

Powering CARBON, a published research project, and one year of “explain this paper to me at midnight”.

OpenAI subscriptions ~7%

$300

≈ 30M tokens

Since launch. ChatGPT Plus, Pro, the API, and one impulsive month of GPT-realtime.

* Token counts are blended estimates based on public list pricing for each model family. Actual numbers vary, but the receipts are real and increasingly difficult to explain.

06

education

Jan 2025 — Present

M.S. Computer Science

University of Cincinnati · Cincinnati, OH

Cloud Computing · Advanced Algorithms · Machine Learning · Artificial Intelligence · Data Analysis

Aug 2017 — Aug 2021

B.Sc. CSIT

Tribhuvan University · Nepal

Operating Systems · Computer Networks · DBMS · Data Structures

certs & honors

07

contact / open a ticket

Currently in Cincinnati, OH. Open to relocation, remote, or a Bat-Signal in the night sky.