nsight-python/examples at main · NVIDIA/nsight-python

This directory contains examples demonstrating how to use Nsight Python for profiling and visualizing CUDA kernel performance.

Prerequisites

Required

  • Python 3.10+
  • CUDA-capable GPU
  • NVIDIA Nsight Compute (for profiling)

Python Dependencies

The examples require additional packages beyond the base nsight package:

PyTorch

Most examples use PyTorch for GPU operations:

# Install PyTorch with CUDA support matching your system (e.g., CUDA 12.6, 12.9, 13.0)
# Replace cuXXX with your CUDA version (e.g., cu126, cu129, cu130)
pip install torch --index-url https://download.pytorch.org/whl/cuXXX

Visit pytorch.org for installation commands matching your specific CUDA version.

Triton (Optional)

For the Triton examples (07_triton_minimal.py):

Quick Start

The examples are numbered in order of complexity. Start with 00_minimal.py:

cd examples
python 00_minimal.py

This will profile a simple matrix multiplication and generate a plot showing the performance.

Examples Overview

  • 00_minimal.py - Simplest possible benchmark

    • Basic @nsight.analyze.kernel usage
    • Single parameter configuration sweep
    • Default time-based profiling
  • 01_compare_throughput.py - Comparing implementations

    • Multiple annotated regions (different matmul implementations)
    • Using NSight Compute metrics (DRAM throughput)
    • Using @nsight.annotate as a function decorator
    • Printing collected data with print_data=True
  • 02_parameter_sweep.py - Sweeping parameters

    • Multiple configuration values
    • Visualizing performance across problem sizes
  • 03_custom_metrics.py - Computing TFLOPs

    • Using derive_metric to compute custom metric
    • Understanding the metric function signature
    • Transforming time measurements into performance metrics
  • 04_multi_parameter.py - Multiple parameters

    • Using itertools.product() for parameter combinations
    • Flexible metric functions with *conf pattern
    • Handling multiple configuration dimensions
  • 05_subplots.py - Creating subplot grids

    • Using row_panels and col_panels
    • Organizing multi-dimensional data visually
    • Creating publication-ready plots
  • 06_plot_customization.py - Advanced plotting

    • Customizing plot appearance
    • Using plot_callback for advanced control
    • Line plots vs bar charts
    • Annotating data points
  • 07_triton_minimal.py - Profiling Triton kernels

    • Integrating Triton GPU kernels
    • Using variant_fields and variant_annotations
    • Comparing against PyTorch baselines with normalize_against
    • Showing speedup metrics
  • 08_multiple_metrics.py - Collecting multiple metrics

    • Collecting multiple metrics with using a sequence of metric names
    • Merged results with "Metric" column in DataFrame
    • @plot decorator incompatible with multiple metrics
  • 09_advanced_metric_custom.py - Computing advanced custom metric

    • Using derive_metric to compute custom metric from multiple metrics
  • 10_multiple_kernels_combine.py - Combining metrics from multiple kernels

    • Using combine_kernel_metrics to aggregate metrics from multiple kernels
    • Summing metrics from consecutive kernel executions
  • 11_output_csv.py - Outputting to CSV

    • Using output_csv parameter to enable/disable CSV file generation
    • Using output_prefix to specify output file location and naming
    • Reading and displaying generated CSV files