Tugbars - Overview

Pinned Loading

  1. A GPU-accelerated SMC² framework with Rao–Blackwellized inner filters and correlated PMMH rejuvenation, designed to amortize likelihood-based trial-and-error through massive parallelism.

    Cuda

  2. Hand-written PTX flash attention kernel achieving 58% tensor core utilization on RTX 5080, matching A100's Flash Attention 2 without WGMMA, TMA, or datacenter hardware. 136 TFLOPS FP16.

    Cuda

  3. BPF Bootstrap Particle Filter — Hand-Written PTX: For educational purposes.

    Cuda 1

  4. High-performance ICEEMDAN implementation using Intel MKL. Header-only C++17, OpenMP parallelized, ~11ms @ 2048 samples. Cubic/Akima splines, multiple processing modes (Standard/Finance/Scientific).

    C++ 2 1

  5. VectorFFT is a vectorized, pure C FFT library optimized for x86 processors (AVX-512, AVX2, SSE2) with zero external dependencies. It implements mixed-radix algorithms for common sizes and Bluestein…

    C 4 1

  6. High-performance Savitzky-Golay filter in C: batch, streaming, and 2D image processing. Embedded-friendly with coefficient export for MCUs. MATLAB-validated.

    C 24 3