feat: second attempt to support DDS and NonZero op by zewenli98 · Pull Request #3388 · pytorch/TensorRT

feat: second attempt to support DDS and NonZero op by zewenli98 · Pull Request #3388 · pytorch/TensorRT

Description

Added a new path to support Data Dependent Shape (DDS) and NonZero op in this PR.
Static and dynamic shapes go the original path; DDS goes the new path with IOutputAllocator.

Fixes #2516

Type of change

New feature (non-breaking change which adds functionality)

Checklist:

My code follows the style guidelines of this project (You can use the linters)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes
I have added the relevant labels to my PR in so that relevant reviewers are notified

		if (
		node != output_node
		and len(node.users) == 0
		and len(node.all_input_nodes) > 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better to add an assert checking if if has only one input (print the number in the string if it fails)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously reused the code from other lowering pass. it looks like we can directly remove unused ops right?

	if (
	node != output_node
	and len(node.users) == 0
	and len(node.all_input_nodes) > 0
	):
	gm.graph.erase_node(node)
	gm = clean_up_graph_after_modifications(gm)

	logger.debug(f"Removed ops that [num_users=0] nodes:\n{gm.graph}")

do you think if there's any potential issues?

		need_cudagraphs_reset,
		) = self.runtime_states.set_runtime_states(
		cudagraphs_enabled, self.use_pre_allocated_outputs, shape_changed
		self.cudagraphs_enabled, self.use_pre_allocated_outputs, shape_changed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is use_pre_allocated_outputs valid now that you're adding OA feature ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the OA feature will not affact use_pre_allocated_outputs because I didn't change the behavior of CG and use_pre_allocated_outputs has its own context manager as well.

		raise RuntimeError(
		"Both CUDA Graphs and OutputAllocator are enabled. Please disable either one."
		)
		if self.use_output_allocator_outputs:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is use_output_allocator_outputs set ? Is it by using the with context manager by the user ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it will be set by the with context manager by the user. If users don't set it, it will choose standard exec or OA according to the converter decorator.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after minor change