ROS 2 Triager : Advanced Runtime Diagnostics for Robotic Systems
ROS 2 Triager is a powerful command-line interface (CLI) plugin designed for real-time runtime graph diagnostics in ROS 2 environments. It goes beyond static configuration checks to actively monitor your running robot, identifying critical issues like dead topics, QoS mismatches, TF tree problems, and more, all while providing actionable suggestions for resolution.
ros2doctoranswers: "Is your ROS 2 system installed correctly?"ros2 triageanswers: "Is your robot behaving correctly right now?"
Key Differentiators & Novelty
ROS 2 Triager stands out by focusing on dynamic runtime analysis, offering capabilities crucial for maintaining the health and performance of complex robotic deployments. Its novel features provide unparalleled insight and automation potential:
- Live Monitoring (
--watch): Continuously observe your robot's health in real-time, with automatic refreshes and clear terminal output, allowing for immediate detection and response to emerging issues. - Snapshot-based Differencing (
--snapshot-save,--snapshot-diff): Establish a "golden" baseline of a healthy system's graph state and then automatically compare current states against it. This enables proactive detection of regressions, unexpected changes, and graph drift over time. - Expected Node Checking (
--expected YAML): Define the anticipated set of running nodes in a simple YAML file. ROS 2 Triager will then report any missing or unexpected nodes, ensuring all critical components are operational. - Actionable Suggestions: Every detected finding is accompanied by clear, concise, and context-aware suggestions, empowering users to quickly diagnose and resolve problems.
- CI/CD Integration (
--json): Generate machine-readable JSON output and leverage exit codes for severity levels, facilitating seamless integration into automated testing and continuous integration/continuous deployment pipelines.
Features
ROS 2 Triager provides a comprehensive suite of checks to ensure the robustness of your robotic applications:
Core Checks
| Check | What it finds | Flag |
|---|---|---|
| Dead Topics | Topics with publishers but no subscribers (or vice versa), indicating communication breakdowns. | --dead-topics / --no-dead-topics |
| QoS Mismatches | Incompatible Quality of Service settings (e.g., reliability, durability) between publishers and subscribers, leading to message loss. | --qos / --no-qos |
| TF Tree Issues | Missing frames, broken transform chains, or inconsistencies in the robot's coordinate transformation tree. | --tf / --no-tf |
| Hz Rate Check | Anomalies in topic publishing rates, flagging topics that are slower than expected. | --check-hz |
| Expected Nodes | Deviations from a predefined list of expected running nodes, identifying missing or rogue processes. | --expected YAML_FILE |
| Graph Drift | Changes in the ROS 2 graph structure compared to a saved baseline snapshot. | --snapshot-diff FILE |
Advanced Diagnostics
| Check | What it finds | Flag |
|---|---|---|
| Latency Analysis | Message timing and jitter (T_arrival - T_header_stamp) for stamped messages. | --check-latency |
| DDS Domain Probe | Port conflicts, domain mismatches, multicast configuration issues. | --check-dds |
| Correlation Engine | Root cause analysis using multi-signal correlation (graph + OS + logs). | --correlate |
Visualization & Monitoring
| Feature | Description | Flag |
|---|---|---|
| Rich TUI | Enhanced terminal output with colors and panels (requires rich). |
--rich / --no-rich |
| Interactive Mode | Keyboard-navigable dashboard for exploring findings. | --interactive |
| Watch Mode | Live monitoring with auto-refresh. | --watch |
| Simulation Mode | Suppress Gazebo/Rviz/visualization topics. | --simulation |
All findings are severity-ranked (1=INFO, 2=WARN, 3=CRIT) and include actionable suggestions.
Architecture
The modular architecture of ROS 2 Triager ensures efficient and extensible diagnostic capabilities. It operates by leveraging a temporary rclpy node to non-intrusively inspect the live ROS 2 graph.
Core Components:
- ROS 2 CLI Integration: Seamlessly integrates as a
ros2subcommand (ros2 triage). - TriageCommand: The primary entry point, handling argument parsing and orchestrating the diagnostic process.
- rclpy Node (Inspector): A transient ROS 2 node responsible for gathering real-time information about topics, nodes, QoS settings, and the TF tree.
- Checks Orchestration: Manages the execution of various diagnostic modules, including both foundational and advanced checks.
- Findings: Standardized data structures encapsulating detected issues, their severity, and suggested resolutions.
- Reporter: Formats findings for human-readable console output (with color-coding) or machine-readable JSON for automated systems.
Workflow
ROS 2 Triager's workflow is designed for both interactive debugging and automated system health monitoring.
Diagnostic Flow:
- Command Execution: A user or automated system invokes
ros2 triage. - Initialization & Argument Parsing: The tool initializes the ROS 2 context and processes command-line arguments to determine the desired checks and output format.
- Execution Mode Selection: Depending on the
--watchflag, it either performs a single diagnostic run or enters a continuous monitoring loop. - Graph Introspection: The Inspector Node builds a real-time snapshot of the ROS 2 graph.
- Check Execution: All enabled diagnostic checks are performed against the current graph state.
- Finding Collection: Results from each check are aggregated into a comprehensive list of findings.
- Reporting: Findings are presented in the specified format (human-readable or JSON).
- Exit Status: The tool exits with a status code reflecting the highest severity finding, enabling CI/CD pipeline integration.
Quickstart
Prerequisites
# ROS 2 Humble (or compatible distro, e.g., Jazzy, Rolling) source /opt/ros/humble/setup.bash
Build
cd ~/ros2-triager colcon build --symlink-install --packages-select ros2_triage source install/local_setup.bash
Run
# Full check (all default checks enabled) ros2 triage # JSON output for CI/CD pipelines ros2 triage --json # Only critical findings (severity 3) ros2 triage --severity-threshold 3 # Skip QoS check ros2 triage --no-qos # Enable Hz rate check (measures for 3 seconds by default) ros2 triage --check-hz # Check against an expected_nodes.yaml file ros2 triage --expected path/to/expected_nodes.yaml # Save a snapshot of the current healthy graph state ros2 triage --snapshot-save healthy_baseline.json # Diff current state against a saved snapshot ros2 triage --snapshot-diff healthy_baseline.json # Live monitoring mode (refreshes every 5 seconds) ros2 triage --watch # Advanced: Check message latency and jitter ros2 triage --check-latency --latency-window 5.0 # Advanced: Probe DDS domain for conflicts ros2 triage --check-dds # Advanced: Enable root cause analysis ros2 triage --correlate # Advanced: Interactive TUI dashboard ros2 triage --interactive # Help ros2 triage --help
Example Output
============================================================
ros2 triage - Runtime Diagnostic Report
============================================================
DEAD TOPICS
Topics with missing publishers or subscribers
------------------------------------------------------------
[CRIT] /cmd_vel
1 subscriber(s) [nav2_node] but 0 publishers - topic is UNPUBLISHED.
Check if the node that should publish this topic is running:
`ros2 node list`. Verify launch files include the publisher node.
QoS MISMATCHES
Publisher <-> Subscriber QoS incompatibilities
------------------------------------------------------------
[CRIT] /sensor_data
Reliability mismatch: publisher [sensor_node]=RELIABLE,
subscriber [processor]=BEST_EFFORT. Messages will be DROPPED.
Change processor subscription QoS to RELIABLE.
============================================================
Summary: 2 CRITICAL 0 WARNING 0 INFO
============================================================
JSON output (--json)
{
"schema_version": "1.0",
"total_findings": 2,
"summary": {"critical": 2, "warning": 0, "info": 0},
"checks": [
{
"name": "dead_topics",
"findings": [
{
"check": "dead_topics",
"topic": "/cmd_vel",
"severity": 3,
"message": "1 subscriber(s) [nav2_node] but 0 publishers - topic is UNPUBLISHED.",
"suggestion": "Check if the node that should publish this topic is running..."
}
]
}
]
}CI Integration
Leverage --json output and exit codes to automate checks in your CI/CD pipelines:
# .github/workflows/ros2_check.yml - name: Run ros2 triage run: | source /opt/ros/humble/setup.bash source install/local_setup.bash ros2 triage --json --severity-threshold 3 > triage_report.json # Exits with code 1 if any severity-3 findings exist
Development & Testing
Project Structure
ros2_triage/
|-- command/
| |-- triage.py # TriageCommand - main CLI entry point
|-- checks/
| |-- finding.py # Finding dataclass + severity constants
| |-- graph_utils.py # rclpy topic graph snapshot
| |-- dead_topic.py # Dead publisher/subscriber detection
| |-- qos_check.py # QoS reliability/durability mismatch
| |-- tf_check.py # TF tree frame connectivity
| |-- hz_check.py # Topic rate anomaly check
| |-- node_check.py # Missing/unexpected node check
| |-- snapshot.py # Graph state save/diff
| |-- latency_engine.py # Message timing analysis
| |-- dds_probe.py # DDS domain conflict detection
|-- correlation_engine.py # Multi-signal root cause analysis
|-- interactive_tui.py # Keyboard-navigable Rich TUI
|-- reporter.py # Human (Rich/colorama) + JSON output
test/
|-- test_dead_topic.py
|-- test_qos_check.py
|-- test_finding.py
|-- test_correlation_engine.py
|-- test_latency_engine.py
|-- test_dds_probe.py
|-- test_reporter.py
Running Tests
cd ~/ros2-triager/src/ros2_triage source /opt/ros/humble/setup.bash python3 -m pytest test/ -v
License
Apache 2.0

