This directory contains examples demonstrating how to use Nsight Python for profiling and visualizing CUDA kernel performance.
Prerequisites
Required
- Python 3.10+
- CUDA-capable GPU
- NVIDIA Nsight Compute (for profiling)
Python Dependencies
The examples require additional packages beyond the base nsight package:
PyTorch
Most examples use PyTorch for GPU operations:
# Install PyTorch with CUDA support matching your system (e.g., CUDA 12.6, 12.9, 13.0) # Replace cuXXX with your CUDA version (e.g., cu126, cu129, cu130) pip install torch --index-url https://download.pytorch.org/whl/cuXXX
Visit pytorch.org for installation commands matching your specific CUDA version.
Triton (Optional)
For the Triton examples (07_triton_minimal.py):
Quick Start
The examples are numbered in order of complexity. Start with 00_minimal.py:
cd examples
python 00_minimal.pyThis will profile a simple matrix multiplication and generate a plot showing the performance.
Examples Overview
-
00_minimal.py- Simplest possible benchmark- Basic
@nsight.analyze.kernelusage - Single parameter configuration sweep
- Default time-based profiling
- Basic
-
01_compare_throughput.py- Comparing implementations- Multiple annotated regions (different matmul implementations)
- Using NSight Compute metrics (DRAM throughput)
- Using
@nsight.annotateas a function decorator - Printing collected data with
print_data=True
-
02_parameter_sweep.py- Sweeping parameters- Multiple configuration values
- Visualizing performance across problem sizes
-
03_custom_metrics.py- Computing TFLOPs- Using
derive_metricto compute custom metric - Understanding the metric function signature
- Transforming time measurements into performance metrics
- Using
-
04_multi_parameter.py- Multiple parameters- Using
itertools.product()for parameter combinations - Flexible metric functions with
*confpattern - Handling multiple configuration dimensions
- Using
-
05_subplots.py- Creating subplot grids- Using
row_panelsandcol_panels - Organizing multi-dimensional data visually
- Creating publication-ready plots
- Using
-
06_plot_customization.py- Advanced plotting- Customizing plot appearance
- Using
plot_callbackfor advanced control - Line plots vs bar charts
- Annotating data points
-
07_triton_minimal.py- Profiling Triton kernels- Integrating Triton GPU kernels
- Using
variant_fieldsandvariant_annotations - Comparing against PyTorch baselines with
normalize_against - Showing speedup metrics
-
08_multiple_metrics.py- Collecting multiple metrics- Collecting multiple metrics with using a sequence of metric names
- Merged results with
"Metric"column in DataFrame @plotdecorator incompatible with multiple metrics
-
09_advanced_metric_custom.py- Computing advanced custom metric- Using
derive_metricto compute custom metric from multiple metrics
- Using
-
10_multiple_kernels_combine.py- Combining metrics from multiple kernels- Using
combine_kernel_metricsto aggregate metrics from multiple kernels - Summing metrics from consecutive kernel executions
- Using
-
11_output_csv.py- Outputting to CSV- Using
output_csvparameter to enable/disable CSV file generation - Using
output_prefixto specify output file location and naming - Reading and displaying generated CSV files
- Using