Shiva Nampalli

01 / About

The engineer
behind the system.

Most ML work stops at the model. I work on what happens after the model exists — how it runs across a 64-GPU cluster, why it stalls at layer 18, and whether it can reach an ARM chip without losing accuracy.

At MulticoreWare, I'm part of a Microsoft collaboration rethinking how large language models are served at scale. Attention and FFN operations have fundamentally different hardware needs — memory-bandwidth vs. raw FLOPS. Forcing them to share the same GPU wastes both. My work disaggregates them physically, then builds the RDMA communication fabric that makes the separation invisible to the inference runtime.

"The real bottleneck in distributed inference isn't compute — it's the cost of moving tensors between nodes. Once you see that, everything else follows."

Before this, I spent five months at IIIT Hyderabad under Dr. CK Raju, pulling apart Transformer internals — not the API, the math. That habit of reasoning from first principles now shapes how I debug distributed systems: I start with the linear algebra, not the error message.

Outside production work, I run a personal lab — building AI experiments at the intersection of human memory, multi-agent autonomy, and the problems people have but haven't articulated yet.

02 / Work

Systems that
ship to production.

Three projects spanning distributed GPU infrastructure, cross-platform ML profiling, and real-time computer vision — each production-facing, each solving a problem without an off-the-shelf answer.

Microsoft Collaboration · Active

Project 01

Attention–FFN
Disaggregation
in SGLang

"Monolithic LLM serving forces memory-bound Attention and compute-bound FFN to share the same GPUs — neither can be scaled independently. Hard wall around 70B parameters on a single node."

Implementing AFD inside SGLang — routing Attention to HBM-dense GPU groups and FFN to FLOP-dense ones, then building the A2F/F2A communication layer that stitches them back without visible latency.

Why RDMA over TCP/IP? Kernel-bypass reduces round-trip latency ~5–10× for large tensor transfers. ZMQ handles control-plane signaling because it tolerates irregular message sizes better than gRPC at microsecond intervals. StepMesh coordinates group membership without a central bottleneck.

Micro-batch overlapping schedules communication to run behind active compute — neither GPU group ever idles at a communication boundary. The scheduler is the real engineering contribution.

SGLangPyTorchRDMA/InfiniBandStepMeshZMQCUDAPython

→A2F/F2A pipelines between disaggregated GPU groups over RDMA fabric

→Micro-batch scheduler hides communication on the critical path entirely

→Targets multi-fold throughput gains over monolithic serving for 70B+ models

→Independent memory + compute scaling — decouples what was previously hardcoupled

↗ GitHub ↗ SGLang Docs

A2F / F2A Tensor Flow — SGLang Runtime

Before

FFN GPUs at ~40% utilization waiting on Attention. Memory caps entire batch. Hard scaling wall at 70B+.

After AFD

Both groups run near peak concurrently. Independent budgets unlock multi-node 100B+ architectures.

Engineering Insight

"Communication overhead compounds faster than compute. The micro-batch scheduler exists for one reason: keep communication off the critical path, always."

Project 02

Perfalign

Internal Product · MulticoreWare

"Engineers deploying to edge hardware were running 6 separate profiling tools, hand-aligning CSVs, and still couldn't explain why ARM ran 40% slower than the benchmark predicted."

Layer-by-layer profiling across six runtimes simultaneously. Custom algorithm tracks layer fusion events — when 2+ ops silently merge into one kernel, invisible to standard profilers. Fusion tracking alone exposed that ~30% of expected quantization speedups were being lost to unfused operations. vLLM-backed RAG chatbot lets engineers query profiling data conversationally instead of reading raw tables at 2am.

PyTorchArmNNZenDNNONNXTFLiteLangChainvLLMVector DB

→6-runtime comparison in a single view — hardware eval from multi-day to sub-hour

→Fusion tracking revealed ~30% of quantization speedups lost to unfused ops

→Modular — add a new runtime without touching the analysis layer

Takeaway

"Layer fusion is invisible in standard profiling output — but it's where deployment surprises hide. If you can't see fusions, you can't explain the gap between theory and silicon."

↗ GitHub

Project 03

SecureSight

Make UC Winner · Patent-Backed

"CCTV systems record everything and flag nothing — security teams review footage only after an incident has concluded. The gap between detection and response was entirely manual."

YOLOv11 runs weapon detection on live feeds. Early versions hit 800ms+ latency due to synchronous DB writes on every frame. Decoupling detection from reporting via async workers cut end-to-end alert latency 4× to sub-200ms. Workers also continuously audit the database for duplicate vehicle reports — eliminating false-alarm fatigue before alerts reach dispatch. Each detector module is independently swappable without touching the pipeline.

YOLOv11OpenCVFlaskAWSDockerNode.jsTwilio

→800ms → sub-200ms alert latency via async decoupling (4× reduction)

→Modular detector architecture — swap models without pipeline rewrite

→Won Make UC Global Hackathon · Directly informed 2 patent filings

Takeaway

"Real-time doesn't mean fast inference — it means fast end-to-end. The bottleneck was never the model; it was the synchronous write on every detection frame."

↗ GitHub ◎ Patent

03 / Lab

Experiments
in progress.

Personal projects at the intersection of AI, human memory, and multi-agent autonomy — less polished, more honest about the problems they're trying to solve.

Active Development

Lab 01

helloEx

AI Closure Companion

Upload real chat exports — WhatsApp, iMessage, Telegram. helloEx builds an AI persona from the actual conversational style, tone, and phrases, then lets you have the conversations you never finished. Modes: Nostalgia, Cold, Honest, Ideal Future, Therapist. Voice-enabled via Whisper STT + ElevenLabs TTS.

The interesting problem: how do you make a language model feel like a specific person rather than a generic assistant? Custom prompt-built personas with memory retrieval from real message history.

FastAPIReact/VitevLLM/OllamaWhisperElevenLabsTypeScriptPython

GitHub

Active Development

Lab 02

QuickPick

Multi-Agent Commerce System

Natural language in — order placed out. A MasterAgent routes intent to specialists: FoodAgent, TravelAgent, ShoppingAgent, QuickCommerceAgent, PaymentAgent. Supports headless browser automation for placing real orders on Zepto, Blinkit, Swiggy Instamart, and BigBasket. Razorpay for payments. Groq Whisper for voice.

The design challenge: routing ambiguous intent ("order dinner for tonight") correctly without over-specializing the routing logic. Built on MCP for tool integration, with a central response synthesizer that merges multi-agent output into a single coherent reply.

FastAPIPythonMCPSeleniumRazorpayGroqReact

GitHub

Published · ICACCS-24

Lab 03

AI Student
Tracker

Behaviour Analysis · CV

Real-time AI system tracking student activity and engagement from classroom video feeds — detecting attentiveness patterns and anomalous incidents. Built to give teachers actionable insight without reducing a classroom to a surveillance feed.

Work rigorous enough to present at ICACCS-24 and informed a filed patent. The ethical design constraints were as hard as the technical ones: what should the system flag, and what should it deliberately ignore?

Computer VisionYOLOv8PythonOpenCVFlask

GitHub ICACCS-24

04 / Stack

Tools I reach
for first.

Where the
work happened.

Jul 2024 – Present

MulticoreWare

Chennai, India

🏅 QBR Best Performer

Machine Learning Engineer

Implementing Attention-FFN Disaggregation in SGLang for a Microsoft-backed distributed inference project — designing A2F/F2A tensor communication over RDMA with micro-batch overlapping to eliminate GPU idle time at communication boundaries. Goal: independent scaling of Attention and FFN beyond single-node memory limits.

Built Perfalign from scratch — production layer-by-layer profiling across 6 runtimes with a custom layer fusion tracking algorithm that exposed why hardware deployments consistently underperformed benchmark predictions. Added a vLLM-backed RAG chatbot that drove daily adoption among non-expert team members.

Consistent pattern: communication overhead — not compute — is always the real bottleneck, whether it's RDMA latency between GPU groups or the information gap between profiling tables and the engineers reading them.

Feb 2024 – Jul 2024

IIIT Hyderabad

Hyderabad, India

AIML Research Intern

Worked under Dr. CK Raju on Transformer internals — not the high-level API, but the mathematics of self-attention gradients, positional encoding geometry, and why Encoder-Decoder designs diverge from decoder-only in practice.

Built the core debugging habit I still use: start with first principles, not the error message. When something breaks in distributed inference, the answer is usually in the math, not the stack trace.

06 / Research

Published
contributions.

Patent

Automated Traffic Helmet Violation Detection (ATHD) and Reporting System for Law Enforcements

Filed Patent
AI-driven public safety

Patent + Paper

Real-Time Student Activity Detection and Incident Monitoring Using Artificial Intelligence

Patent & Paper
Computer vision · Edge AI

Conference

An AI-Based Student Tracking System to Analyse Student Behavior

ICACCS-24
Int'l Conference · 2024

Paper

Enhanced Multimodal Object Detection Using U-Net Centric Feature Fusion

Published Research
Computer vision

The engineerbehind the system.

Systems thatship to production.

Attention–FFNDisaggregationin SGLang

Perfalign

SecureSight

Experimentsin progress.

Tools I reachfor first.

Where thework happened.

Publishedcontributions.

Let's buildsomethingharder.