[MAX][GPU] Enable GPTQ MoE execution for Qwen3 by bro4all · Pull Request #6321 · modular/modular

@bro4all

BEGIN_PUBLIC
[Kernels] Add grouped quantized MoE kernel wrapper

Add a kernel-layer grouped quantized matmul entry point for MoE paths and
switch StackedMoE to use it directly.

This keeps the first quantized MoE slice small while establishing the dispatch
seam needed for follow-up grouped GPTQ support.
END_PUBLIC

Signed-off-by: Omar Habra <omarbro4all@gmail.com>

@bro4all bro4all changed the title [MAX] Add grouped quantized MoE kernel wrapper [MAX][GPU] Enable GPTQ MoE execution for Qwen3

Apr 1, 2026
BEGIN_PUBLIC
[Kernel][GPU] Enable GPTQ MoE execution for Qwen3

Add MoEGPTQ wiring, materialize packed GPTQ expert weights, fix the GPU qmatmul repack path for expert operands, and cover Qwen3 GPTQ MoE with unit and integration tests.
END_PUBLIC

Signed-off-by: Omar Habra <omarbro4all@gmail.com>
BEGIN_PUBLIC
[Qwen3] Clarify GPTQ MoE review boundaries

Add pipeline-level tests that encode the current GPTQ support bounds for
Qwen3 MoE. This keeps the review story explicit by covering desc_act
perm_idx pruning and the single-device execution requirement, while
clarifying that GPTQ attention still materializes dense fallback weights
and only the MoE expert path opts into packed execution.
END_PUBLIC

Signed-off-by: Omar Habra <omarbro4all@gmail.com>
BEGIN_PUBLIC
[Qwen3] Tighten GPTQ MoE review assertions

Tighten the Qwen3 GPTQ review-boundary tests by asserting the exact
perm_idx keys that survive conversion and by keeping the formatted test
and comment updates together. This makes the desc_act and single-device
support boundaries easier to audit during human review.
END_PUBLIC

Signed-off-by: Omar Habra <omarbro4all@gmail.com>
BEGIN_PUBLIC
[Qwen3] Assert GPTQ MoE construction boundaries

Add a constructor-level pipeline test that makes the current Qwen3 GPTQ
split explicit: attention is built through the GPTQ attention wrapper
without opting into packed execution, while MoE experts are built as
MoEGPTQ and mark their projections for packed execution.
END_PUBLIC

Signed-off-by: Omar Habra <omarbro4all@gmail.com>
BEGIN_PUBLIC
[Qwen3] Fix GPTQ MoE mypy regressions

Tighten the GPTQLinear and MoEGPTQ typing so the Python nn mypy gate passes again without changing runtime behavior.

- type materialize_gptq_linear_weights against GPTQLinear
- avoid local variable redefinition and ndarray type narrowing issues
- replace ops.cond lambdas with typed local callbacks
END_PUBLIC

Signed-off-by: Omar Habra <omarbro4all@gmail.com>
BEGIN_PUBLIC
[Qwen3] Fix CI typing and lint follow-ups

Tighten the Qwen3 GPTQ typing annotations and MoE helper structure so the
existing review-hardening changes pass lint and mypy in CI without changing
runtime behavior.
END_PUBLIC

Signed-off-by: Omar Habra <omarbro4all@gmail.com>
BEGIN_PUBLIC
[Qwen3] Fix pipeline CI follow-ups

Fix the remaining pipeline CI regressions by avoiding eager allreduce
construction on the single-device path and declaring numpy explicitly for the
pipeline tests.
END_PUBLIC

Signed-off-by: Omar Habra <omarbro4all@gmail.com>
BEGIN_PUBLIC
[Qwen3] Isolate GPTQ MoE construction test

Construct the Qwen3 GPTQ MoE review-boundary test at the transformer block
level so it validates the attention and expert setup directly without depending
on unrelated model-level embedding initialization.
END_PUBLIC

Signed-off-by: Omar Habra <omarbro4all@gmail.com>
BEGIN_PUBLIC
[Qwen3] Fix GPTQ test helper typing

Tighten the GPTQ test helper with an explicit non-null RMSNorm epsilon
assertion so the isolated construction test passes mypy in CI.
END_PUBLIC

Signed-off-by: Omar Habra <omarbro4all@gmail.com>

@bro4all bro4all marked this pull request as ready for review

April 2, 2026 00:42