Tugbars - Overview
Pinned Loading
-
A GPU-accelerated SMC² framework with Rao–Blackwellized inner filters and correlated PMMH rejuvenation, designed to amortize likelihood-based trial-and-error through massive parallelism.
Cuda
-
Hand-written PTX flash attention kernel achieving 58% tensor core utilization on RTX 5080, matching A100's Flash Attention 2 without WGMMA, TMA, or datacenter hardware. 136 TFLOPS FP16.
Cuda
-
BPF Bootstrap Particle Filter — Hand-Written PTX: For educational purposes.
Cuda 1