Tugbars - Overview

Pinned Loading

A GPU-accelerated SMC² framework with Rao–Blackwellized inner filters and correlated PMMH rejuvenation, designed to amortize likelihood-based trial-and-error through massive parallelism.

Cuda
Hand-written PTX flash attention kernel achieving 58% tensor core utilization on RTX 5080, matching A100's Flash Attention 2 without WGMMA, TMA, or datacenter hardware. 136 TFLOPS FP16.

Cuda
BPF Bootstrap Particle Filter — Hand-Written PTX: For educational purposes.

Cuda 1
High-performance ICEEMDAN implementation using Intel MKL. Header-only C++17, OpenMP parallelized, ~11ms @ 2048 samples. Cubic/Akima splines, multiple processing modes (Standard/Finance/Scientific).

C++ 2 1
VectorFFT is a vectorized, pure C FFT library optimized for x86 processors (AVX-512, AVX2, SSE2) with zero external dependencies. It implements mixed-radix algorithms for common sizes and Bluestein…

C 4 1
High-performance Savitzky-Golay filter in C: batch, streaming, and 2D image processing. Embedded-friendly with coefficient export for MCUs. MATLAB-validated.

C 24 3