I build a layer between Research and Reality — disaggregating inference across GPU clusters, profiling the exact layer that ate your memory budget, and building communication fabric that runs at RDMA speeds.
Most ML work stops at the model. I work on what happens after the model exists — how it runs across a 64-GPU cluster, why it stalls at layer 18, and whether it can reach an ARM chip without losing accuracy.
At MulticoreWare, I'm part of a Microsoft collaboration rethinking how large language models are served at scale. Attention and FFN operations have fundamentally different hardware needs — memory-bandwidth vs. raw FLOPS. Forcing them to share the same GPU wastes both. My work disaggregates them physically, then builds the RDMA communication fabric that makes the separation invisible to the inference runtime.
Before this, I spent five months at IIIT Hyderabad under Dr. CK Raju, pulling apart Transformer internals — not the API, the math. That habit of reasoning from first principles now shapes how I debug distributed systems: I start with the linear algebra, not the error message.
Outside production work, I run a personal lab — building AI experiments at the intersection of human memory, multi-agent autonomy, and the problems people have but haven't articulated yet.
Three projects spanning distributed GPU infrastructure, cross-platform ML profiling, and real-time computer vision — each production-facing, each solving a problem without an off-the-shelf answer.
"Monolithic LLM serving forces memory-bound Attention and compute-bound FFN to share the same GPUs — neither can be scaled independently. Hard wall around 70B parameters on a single node."
Implementing AFD inside SGLang — routing Attention to HBM-dense GPU groups and FFN to FLOP-dense ones, then building the A2F/F2A communication layer that stitches them back without visible latency.
Micro-batch overlapping schedules communication to run behind active compute — neither GPU group ever idles at a communication boundary. The scheduler is the real engineering contribution.
"Engineers deploying to edge hardware were running 6 separate profiling tools, hand-aligning CSVs, and still couldn't explain why ARM ran 40% slower than the benchmark predicted."
Layer-by-layer profiling across six runtimes simultaneously. Custom algorithm tracks layer fusion events — when 2+ ops silently merge into one kernel, invisible to standard profilers. Fusion tracking alone exposed that ~30% of expected quantization speedups were being lost to unfused operations. vLLM-backed RAG chatbot lets engineers query profiling data conversationally instead of reading raw tables at 2am.
"CCTV systems record everything and flag nothing — security teams review footage only after an incident has concluded. The gap between detection and response was entirely manual."
YOLOv11 runs weapon detection on live feeds. Early versions hit 800ms+ latency due to synchronous DB writes on every frame. Decoupling detection from reporting via async workers cut end-to-end alert latency 4× to sub-200ms. Workers also continuously audit the database for duplicate vehicle reports — eliminating false-alarm fatigue before alerts reach dispatch. Each detector module is independently swappable without touching the pipeline.
Personal projects at the intersection of AI, human memory, and multi-agent autonomy — less polished, more honest about the problems they're trying to solve.
Upload real chat exports — WhatsApp, iMessage, Telegram. helloEx builds an AI persona from the actual conversational style, tone, and phrases, then lets you have the conversations you never finished. Modes: Nostalgia, Cold, Honest, Ideal Future, Therapist. Voice-enabled via Whisper STT + ElevenLabs TTS.
The interesting problem: how do you make a language model feel like a specific person rather than a generic assistant? Custom prompt-built personas with memory retrieval from real message history.
Natural language in — order placed out. A MasterAgent routes intent to specialists: FoodAgent, TravelAgent, ShoppingAgent, QuickCommerceAgent, PaymentAgent. Supports headless browser automation for placing real orders on Zepto, Blinkit, Swiggy Instamart, and BigBasket. Razorpay for payments. Groq Whisper for voice.
The design challenge: routing ambiguous intent ("order dinner for tonight") correctly without over-specializing the routing logic. Built on MCP for tool integration, with a central response synthesizer that merges multi-agent output into a single coherent reply.
Real-time AI system tracking student activity and engagement from classroom video feeds — detecting attentiveness patterns and anomalous incidents. Built to give teachers actionable insight without reducing a classroom to a surveillance feed.
Work rigorous enough to present at ICACCS-24 and informed a filed patent. The ethical design constraints were as hard as the technical ones: what should the system flag, and what should it deliberately ignore?