WebGPU Tensor Library
A high-performance tensor library built on WebGPU, designed for both eager and lazy execution with automatic CPU/GPU device management.
What are Compute Shaders?
Compute shaders are GPU programs that run massively parallel computations. Unlike graphics shaders that render pixels, compute shaders can perform arbitrary calculations on large datasets. WebGPU exposes this power through a modern, cross-platform API that works in browsers and native environments.
Performance Note: For small matrices (< 1K elements), CPU is often faster due to GPU setup overhead. For large matrices (> 100K elements), GPU parallelism dominates. Our library handles both seamlessly.
Key Features
- Dual execution modes: Eager execution (immediate) and planned compiled execution (fused operations)
- Cross-device: Seamless CPU ↔ GPU transfers
- WebGPU native: Built for modern compute workloads
- TypeScript: Full type safety with device-aware types
API Examples
Eager Execution (Current)
import { Tensor } from "./src/tensor.ts"; // Create tensors on CPU or GPU const a = Tensor.fromArray([1, 2, 3, 4, 5], { device: "gpu" }); const b = Tensor.fromArray([2, 3, 4, 5, 6], { device: "gpu" }); // Method chaining works - operations execute immediately const result = await a.add(b).mul(2); // Each step creates new tensor const cpuResult = await result.to("cpu"); // Transfer back console.log(cpuResult.toArray()); // [6, 10, 14, 18, 22]
Compiled Execution (Planned)
import { Tensor, compile } from "./src/index.ts"; // Same chaining syntax, but with major optimizations: const fusedOp = compile((x: Tensor, y: Tensor) => { return x.add(y).mul(2).relu(); // Fused into single kernel }); // Compiled mode advantages: // ✅ Kernel fusion: Single compute shader instead of 3 separate ones // ✅ In-place operations: Dynamic buffer allocator minimizes memory usage // ✅ Auto cleanup: Intermediate tensors destroyed at closure end // ✅ Reuse: First call compiles, subsequent calls blazing fast const result = fusedOp(a, b); // Much faster than eager
Design Decisions
WebGPU Async Nature: WebGPU operations are inherently async, but we don't always await intermediate steps since the runtime automatically queues and awaits necessary operations. This allows for better performance through automatic batching.
Syntax Choices:
Tensor.fromArray()for explicit construction.to(device)for clear device transfers- Method chaining with automatic queueing
- Device-aware TypeScript types prevent cross-device errors at compile time
Platform Support
Currently using Deno exclusively due to its excellent built-in WebGPU support (--unstable-webgpu). Future plans include bundling for web browsers and Node.js.
Getting Started
Prerequisites
- Deno 1.40+ with WebGPU support
Run Examples
deno task dev # Run basic examples
deno run --unstable-webgpu --allow-all examples/basic_add.ts
deno run --unstable-webgpu --allow-all examples/performance_comparison.tsRun Tests
deno task test # Run all tests deno test --unstable-webgpu --allow-all tests/tensor_test.ts
Contributing
We need help with several areas:
🔧 API Improvements
- Better TypeScript support: Current device typing could be more ergonomic
- Shape broadcasting: Automatic shape compatibility
- Error handling: Better error messages and recovery
🐛 Bug Hunting
- Memory leaks: GPU buffer cleanup
- Edge cases: Empty tensors, large arrays, device switching
- Performance regressions: Benchmark against baselines
⚡ New Kernels
Easy starter tasks! Copy src/kernels/add.ts to implement:
- Element-wise ops:
sub.ts,mul.ts,div.ts - Activation functions:
relu.ts,sigmoid.ts,tanh.ts - Reductions:
sum.ts,mean.ts,max.ts
🚀 Kernel Optimization
Current kernels are naive (1 thread = 1 element):
- 2D tiling: Each thread handles 8x8 tiles
- Memory coalescing: Optimal memory access patterns
- Workgroup optimization: Better thread group utilization
📦 Core Features
compile()API: Lazy execution with kernel fusion- Automatic differentiation: Backpropagation support
- Shape inference: Automatic output shape calculation
- Memory pooling: Buffer reuse and allocation optimization
📚 Documentation
- API reference: Complete function documentation
- Tutorials: WebGPU tensor programming guide
- Examples: Real-world use cases
📊 Benchmarks
- Performance testing: Comprehensive CPU vs GPU benchmarks
- Memory profiling: Track buffer allocation and cleanup
- Regression testing: Ensure optimizations don't break performance
🔍 More Bug Hunting
Seriously, we need thorough testing of edge cases, memory management, and cross-device operations.
Learning Resources
- An Even Easier Introduction to CUDA - NVIDIA's excellent tutorial on parallel GPU programming
- LeetGPU - Learn how to write CUDA and compute shaders
- Optimizing a WebGPU Matmul Kernel - Inspiration for kernel optimization techniques
License
MIT