ML Engineer . → San Jose

Shiva Nampalli

I build a layer between Research and Reality — disaggregating inference across GPU clusters, profiling the exact layer that ate your memory budget, and building communication fabric that runs at RDMA speeds.

01 / About

The engineer
behind the system.

Most ML work stops at the model. I work on what happens after the model exists — how it runs across a 64-GPU cluster, why it stalls at layer 18, and whether it can reach an ARM chip without losing accuracy.

At MulticoreWare, I'm part of a Microsoft collaboration rethinking how large language models are served at scale. Attention and FFN operations have fundamentally different hardware needs — memory-bandwidth vs. raw FLOPS. Forcing them to share the same GPU wastes both. My work disaggregates them physically, then builds the RDMA communication fabric that makes the separation invisible to the inference runtime.

"The real bottleneck in distributed inference isn't compute — it's the cost of moving tensors between nodes. Once you see that, everything else follows."

Before this, I spent five months at IIIT Hyderabad under Dr. CK Raju, pulling apart Transformer internals — not the API, the math. That habit of reasoning from first principles now shapes how I debug distributed systems: I start with the linear algebra, not the error message.

Outside production work, I run a personal lab — building AI experiments at the intersection of human memory, multi-agent autonomy, and the problems people have but haven't articulated yet.

02 / Work

Systems that
ship to production.

Three projects spanning distributed GPU infrastructure, cross-platform ML profiling, and real-time computer vision — each production-facing, each solving a problem without an off-the-shelf answer.

Project 02

Perfalign

Internal Product · MulticoreWare

"Engineers deploying to edge hardware were running 6 separate profiling tools, hand-aligning CSVs, and still couldn't explain why ARM ran 40% slower than the benchmark predicted."

Layer-by-layer profiling across six runtimes simultaneously. Custom algorithm tracks layer fusion events — when 2+ ops silently merge into one kernel, invisible to standard profilers. Fusion tracking alone exposed that ~30% of expected quantization speedups were being lost to unfused operations. vLLM-backed RAG chatbot lets engineers query profiling data conversationally instead of reading raw tables at 2am.

PyTorchArmNNZenDNNONNXTFLiteLangChainvLLMVector DB
6-runtime comparison in a single view — hardware eval from multi-day to sub-hour
Fusion tracking revealed ~30% of quantization speedups lost to unfused ops
Modular — add a new runtime without touching the analysis layer
Takeaway
"Layer fusion is invisible in standard profiling output — but it's where deployment surprises hide. If you can't see fusions, you can't explain the gap between theory and silicon."
Project 03

SecureSight

Make UC Winner · Patent-Backed

"CCTV systems record everything and flag nothing — security teams review footage only after an incident has concluded. The gap between detection and response was entirely manual."

YOLOv11 runs weapon detection on live feeds. Early versions hit 800ms+ latency due to synchronous DB writes on every frame. Decoupling detection from reporting via async workers cut end-to-end alert latency 4× to sub-200ms. Workers also continuously audit the database for duplicate vehicle reports — eliminating false-alarm fatigue before alerts reach dispatch. Each detector module is independently swappable without touching the pipeline.

YOLOv11OpenCVFlaskAWSDockerNode.jsTwilio
800ms → sub-200ms alert latency via async decoupling (4× reduction)
Modular detector architecture — swap models without pipeline rewrite
Won Make UC Global Hackathon · Directly informed 2 patent filings
Takeaway
"Real-time doesn't mean fast inference — it means fast end-to-end. The bottleneck was never the model; it was the synchronous write on every detection frame."
03 / Lab

Experiments
in progress.

Personal projects at the intersection of AI, human memory, and multi-agent autonomy — less polished, more honest about the problems they're trying to solve.

Active Development
Lab 01
helloEx
AI Closure Companion

Upload real chat exports — WhatsApp, iMessage, Telegram. helloEx builds an AI persona from the actual conversational style, tone, and phrases, then lets you have the conversations you never finished. Modes: Nostalgia, Cold, Honest, Ideal Future, Therapist. Voice-enabled via Whisper STT + ElevenLabs TTS.

The interesting problem: how do you make a language model feel like a specific person rather than a generic assistant? Custom prompt-built personas with memory retrieval from real message history.

FastAPIReact/VitevLLM/OllamaWhisperElevenLabsTypeScriptPython
Active Development
Lab 02
QuickPick
Multi-Agent Commerce System

Natural language in — order placed out. A MasterAgent routes intent to specialists: FoodAgent, TravelAgent, ShoppingAgent, QuickCommerceAgent, PaymentAgent. Supports headless browser automation for placing real orders on Zepto, Blinkit, Swiggy Instamart, and BigBasket. Razorpay for payments. Groq Whisper for voice.

The design challenge: routing ambiguous intent ("order dinner for tonight") correctly without over-specializing the routing logic. Built on MCP for tool integration, with a central response synthesizer that merges multi-agent output into a single coherent reply.

FastAPIPythonMCPSeleniumRazorpayGroqReact
Published · ICACCS-24
Lab 03
AI Student
Tracker
Behaviour Analysis · CV

Real-time AI system tracking student activity and engagement from classroom video feeds — detecting attentiveness patterns and anomalous incidents. Built to give teachers actionable insight without reducing a classroom to a surveillance feed.

Work rigorous enough to present at ICACCS-24 and informed a filed patent. The ethical design constraints were as hard as the technical ones: what should the system flag, and what should it deliberately ignore?

Computer VisionYOLOv8PythonOpenCVFlask
04 / Stack

Tools I reach
for first.

Categories
Inference & Serving
Deep Learning
Systems & Infra
Languages & Tools
LLM Serving
SGLang
Core / Active
vLLM
Proficient
ONNX Runtime
Core
Edge Runtimes
ArmNN
Core
ZenDNN (AMD)
Proficient
IPEX (Intel)
Proficient
TensorFlow Lite
Proficient
Frameworks
PyTorch
Core
TensorFlow
Proficient
Architectures
Transformers / LLMs
Core
YOLO (v8 / v11)
Core
Encoder-Decoder
Core
Retrieval & RAG
LangChain
Proficient
Vector + Graph DBs
Proficient
Networking / Communication
RDMA / InfiniBand
Core
StepMesh
Core
ZMQ
Proficient
Deployment & Infra
Docker
Proficient
AWS
Familiar
Linux
Proficient
Languages
Python
Core
SQL / MySQL
Proficient
TypeScript
Familiar
Libraries
NumPy / Pandas
Core
OpenCV
Proficient
Flask / FastAPI
Proficient
05 / Experience

Where the
work happened.

Jul 2024 – Present
MulticoreWare
Chennai, India
🏅 QBR Best Performer
Machine Learning Engineer
Implementing Attention-FFN Disaggregation in SGLang for a Microsoft-backed distributed inference project — designing A2F/F2A tensor communication over RDMA with micro-batch overlapping to eliminate GPU idle time at communication boundaries. Goal: independent scaling of Attention and FFN beyond single-node memory limits.
Built Perfalign from scratch — production layer-by-layer profiling across 6 runtimes with a custom layer fusion tracking algorithm that exposed why hardware deployments consistently underperformed benchmark predictions. Added a vLLM-backed RAG chatbot that drove daily adoption among non-expert team members.
Consistent pattern: communication overhead — not compute — is always the real bottleneck, whether it's RDMA latency between GPU groups or the information gap between profiling tables and the engineers reading them.
Feb 2024 – Jul 2024
IIIT Hyderabad
Hyderabad, India
AIML Research Intern
Worked under Dr. CK Raju on Transformer internals — not the high-level API, but the mathematics of self-attention gradients, positional encoding geometry, and why Encoder-Decoder designs diverge from decoder-only in practice.
Built the core debugging habit I still use: start with first principles, not the error message. When something breaks in distributed inference, the answer is usually in the math, not the stack trace.
06 / Research

Published
contributions.

Patent
Automated Traffic Helmet Violation Detection (ATHD) and Reporting System for Law Enforcements
Filed Patent
AI-driven public safety
Patent + Paper
Real-Time Student Activity Detection and Incident Monitoring Using Artificial Intelligence
Patent & Paper
Computer vision · Edge AI
Conference
An AI-Based Student Tracking System to Analyse Student Behavior
ICACCS-24
Int'l Conference · 2024
Paper
Enhanced Multimodal Object Detection Using U-Net Centric Feature Fusion
Published Research
Computer vision

Let's build
something
harder.