AbhiOnGithub - Overview

Skip to content

Navigation Menu

Sign in

Appearance settings

View AbhiOnGithub's full-sized avatar

Abhishek Gupta AbhiOnGithub

Multi Cloud Distributed Systems, Enterprise Inference Service.

  • 13:17 (UTC -07:00)

Block or report AbhiOnGithub

Hey there 👋 I'm Abhishek

Obsessed with making models go brrr — from training to real-time inference at scale

LinkedIn Badge Twitter Badge


⚡ About Me

  • 🔥 I live and breathe AI Inference — optimizing models to run faster, cheaper, and at massive scale
  • 🧠 Deep in the NVIDIA inference stack: TensorRT, Triton Inference Server, CUDA, TensorRT-LLM, and NIM
  • 🚀 Passionate about squeezing every last TFLOP out of GPUs — from A100s to H100s to Blackwell
  • 🏗️ Building and scaling inference pipelines that serve millions of requests with minimal latency
  • 🌐 Background in cloud-native architecture across AWS, Azure, and GCP — now laser-focused on GPU-accelerated inference infrastructure
  • 🤝 Open to collaborating on open-source inference tooling, model optimization, and high-performance serving systems

GitHub Streak

Top Langs


🛠️ Inference & AI Stack:

TensorRT Triton CUDA TensorRT-LLM NIM Python C++ Go Rust PyTorch vLLM Docker Kubernetes

☁️ Cloud & Infra:

aws azure gcp docker kubernetes git


  • 💬 Ask me about GPU-accelerated inference, model optimization, batching strategies, and scaling LLM serving
  • 👯 Looking to collaborate on inference engines, model compilers, and open-source AI infrastructure
⚡ What I'm focused on in 2025–2026
- Optimizing LLM inference — KV-cache management, speculative decoding, continuous batching
- TensorRT-LLM and TensorRT for maximum throughput on NVIDIA GPUs
- Triton Inference Server — model ensembles, dynamic batching, multi-GPU serving
- NVIDIA NIM microservices for production-grade AI deployment
- CUDA kernel optimization and custom inference operators
- vLLM, SGLang, and other open-source LLM serving frameworks
- Multi-node inference on H100 / Blackwell clusters with NVLink & NVSwitch
- Quantization (FP8, INT4, AWQ, GPTQ) for efficient model deployment
- Go, Rust, and C++ for high-performance inference infrastructure
🧠 Technologies I know
- Inference: TensorRT, TensorRT-LLM, Triton Inference Server, NVIDIA NIM, vLLM, ONNX Runtime
- GPU/Compute: CUDA, cuDNN, NCCL, NVLink, Multi-Instance GPU (MIG)
- ML Frameworks: PyTorch, JAX, ONNX
- Cloud: AWS (SageMaker, EKS, EC2 P/G instances), Azure (AKS, NC/ND VMs), GCP (GKE, A3/A2 VMs)
- Containers & Orchestration: Docker, Kubernetes, Helm, NVIDIA GPU Operator
- Languages: Python, C++, Go, Rust, C#, Java
- IaC: Terraform, Pulumi, AWS CloudFormation, Azure ARM
- Monitoring: Prometheus, Grafana, Splunk, Elastic Stack
- Streaming: Apache Kafka, Apache Flink, Spark Streaming
📚 Previously
- Cloud-native architecture and distributed systems across AWS & Azure
- Serverless and modular monolithic architectures
- Full-stack development with C#/.NET, Java/Spring Boot, React
- GoLang microservices (GoORM, Fiber, Chi, Mux)
- Distributed Application Runtime (DAPR)
- Cross-platform development with Xamarin/MAUI

Pinned Loading

  1. This Repository contains Code used in Blog Posts of ML.NET

    C# 2 3

  2. A high-throughput and memory-efficient inference and serving engine for LLMs

    Python 75.6k 15.3k

  3. A Datacenter Scale Distributed Inference Serving Framework

    Rust 6.5k 995