GitHub - bgub/shade: pytorch-like webgpu-backed computations for ts/js

WebGPU Tensor Library

A high-performance tensor library built on WebGPU, designed for both eager and lazy execution with automatic CPU/GPU device management.

What are Compute Shaders?

Compute shaders are GPU programs that run massively parallel computations. Unlike graphics shaders that render pixels, compute shaders can perform arbitrary calculations on large datasets. WebGPU exposes this power through a modern, cross-platform API that works in browsers and native environments.

Performance Note: For small matrices (< 1K elements), CPU is often faster due to GPU setup overhead. For large matrices (> 100K elements), GPU parallelism dominates. Our library handles both seamlessly.

Key Features

Dual execution modes: Eager execution (immediate) and planned compiled execution (fused operations)
Cross-device: Seamless CPU ↔ GPU transfers
WebGPU native: Built for modern compute workloads
TypeScript: Full type safety with device-aware types

API Examples

Eager Execution (Current)

import { Tensor } from "./src/tensor.ts";

// Create tensors on CPU or GPU
const a = Tensor.fromArray([1, 2, 3, 4, 5], { device: "gpu" });
const b = Tensor.fromArray([2, 3, 4, 5, 6], { device: "gpu" });

// Method chaining works - operations execute immediately
const result = await a.add(b).mul(2); // Each step creates new tensor
const cpuResult = await result.to("cpu"); // Transfer back

console.log(cpuResult.toArray()); // [6, 10, 14, 18, 22]

Compiled Execution (Planned)

import { Tensor, compile } from "./src/index.ts";

// Same chaining syntax, but with major optimizations:
const fusedOp = compile((x: Tensor, y: Tensor) => {
  return x.add(y).mul(2).relu(); // Fused into single kernel
});

// Compiled mode advantages:
// ✅ Kernel fusion: Single compute shader instead of 3 separate ones
// ✅ In-place operations: Dynamic buffer allocator minimizes memory usage
// ✅ Auto cleanup: Intermediate tensors destroyed at closure end
// ✅ Reuse: First call compiles, subsequent calls blazing fast

const result = fusedOp(a, b); // Much faster than eager

Design Decisions

WebGPU Async Nature: WebGPU operations are inherently async, but we don't always await intermediate steps since the runtime automatically queues and awaits necessary operations. This allows for better performance through automatic batching.

Syntax Choices:

Tensor.fromArray() for explicit construction
.to(device) for clear device transfers
Method chaining with automatic queueing
Device-aware TypeScript types prevent cross-device errors at compile time

Platform Support

Currently using Deno exclusively due to its excellent built-in WebGPU support (--unstable-webgpu). Future plans include bundling for web browsers and Node.js.

Getting Started

Prerequisites

Deno 1.40+ with WebGPU support

Run Examples

deno task dev              # Run basic examples
deno run --unstable-webgpu --allow-all examples/basic_add.ts
deno run --unstable-webgpu --allow-all examples/performance_comparison.ts

Run Tests

deno task test             # Run all tests
deno test --unstable-webgpu --allow-all tests/tensor_test.ts

Contributing

We need help with several areas:

🔧 API Improvements

Better TypeScript support: Current device typing could be more ergonomic
Shape broadcasting: Automatic shape compatibility
Error handling: Better error messages and recovery

🐛 Bug Hunting

Memory leaks: GPU buffer cleanup
Edge cases: Empty tensors, large arrays, device switching
Performance regressions: Benchmark against baselines

⚡ New Kernels

Easy starter tasks! Copy src/kernels/add.ts to implement:

Element-wise ops: sub.ts, mul.ts, div.ts
Activation functions: relu.ts, sigmoid.ts, tanh.ts
Reductions: sum.ts, mean.ts, max.ts

🚀 Kernel Optimization

Current kernels are naive (1 thread = 1 element):

2D tiling: Each thread handles 8x8 tiles
Memory coalescing: Optimal memory access patterns
Workgroup optimization: Better thread group utilization

📦 Core Features

compile() API: Lazy execution with kernel fusion
Automatic differentiation: Backpropagation support
Shape inference: Automatic output shape calculation
Memory pooling: Buffer reuse and allocation optimization

📚 Documentation

API reference: Complete function documentation
Tutorials: WebGPU tensor programming guide
Examples: Real-world use cases

📊 Benchmarks

Performance testing: Comprehensive CPU vs GPU benchmarks
Memory profiling: Track buffer allocation and cleanup
Regression testing: Ensure optimizations don't break performance

🔍 More Bug Hunting

Seriously, we need thorough testing of edge cases, memory management, and cross-device operations.

Learning Resources

An Even Easier Introduction to CUDA - NVIDIA's excellent tutorial on parallel GPU programming
LeetGPU - Learn how to write CUDA and compute shaders
Optimizing a WebGPU Matmul Kernel - Inspiration for kernel optimization techniques

License

MIT