[MAX][GPU] Enable GPTQ MoE execution for Qwen3 by bro4all · Pull Request #6321 · modular/modular
BEGIN_PUBLIC [Kernels] Add grouped quantized MoE kernel wrapper Add a kernel-layer grouped quantized matmul entry point for MoE paths and switch StackedMoE to use it directly. This keeps the first quantized MoE slice small while establishing the dispatch seam needed for follow-up grouped GPTQ support. END_PUBLIC Signed-off-by: Omar Habra <omarbro4all@gmail.com>
bro4all
changed the title
[MAX] Add grouped quantized MoE kernel wrapper
[MAX][GPU] Enable GPTQ MoE execution for Qwen3
BEGIN_PUBLIC [Kernel][GPU] Enable GPTQ MoE execution for Qwen3 Add MoEGPTQ wiring, materialize packed GPTQ expert weights, fix the GPU qmatmul repack path for expert operands, and cover Qwen3 GPTQ MoE with unit and integration tests. END_PUBLIC Signed-off-by: Omar Habra <omarbro4all@gmail.com>
BEGIN_PUBLIC [Qwen3] Clarify GPTQ MoE review boundaries Add pipeline-level tests that encode the current GPTQ support bounds for Qwen3 MoE. This keeps the review story explicit by covering desc_act perm_idx pruning and the single-device execution requirement, while clarifying that GPTQ attention still materializes dense fallback weights and only the MoE expert path opts into packed execution. END_PUBLIC Signed-off-by: Omar Habra <omarbro4all@gmail.com>
BEGIN_PUBLIC [Qwen3] Tighten GPTQ MoE review assertions Tighten the Qwen3 GPTQ review-boundary tests by asserting the exact perm_idx keys that survive conversion and by keeping the formatted test and comment updates together. This makes the desc_act and single-device support boundaries easier to audit during human review. END_PUBLIC Signed-off-by: Omar Habra <omarbro4all@gmail.com>
BEGIN_PUBLIC [Qwen3] Assert GPTQ MoE construction boundaries Add a constructor-level pipeline test that makes the current Qwen3 GPTQ split explicit: attention is built through the GPTQ attention wrapper without opting into packed execution, while MoE experts are built as MoEGPTQ and mark their projections for packed execution. END_PUBLIC Signed-off-by: Omar Habra <omarbro4all@gmail.com>
BEGIN_PUBLIC [Qwen3] Fix GPTQ MoE mypy regressions Tighten the GPTQLinear and MoEGPTQ typing so the Python nn mypy gate passes again without changing runtime behavior. - type materialize_gptq_linear_weights against GPTQLinear - avoid local variable redefinition and ndarray type narrowing issues - replace ops.cond lambdas with typed local callbacks END_PUBLIC Signed-off-by: Omar Habra <omarbro4all@gmail.com>
BEGIN_PUBLIC [Qwen3] Fix CI typing and lint follow-ups Tighten the Qwen3 GPTQ typing annotations and MoE helper structure so the existing review-hardening changes pass lint and mypy in CI without changing runtime behavior. END_PUBLIC Signed-off-by: Omar Habra <omarbro4all@gmail.com>
BEGIN_PUBLIC [Qwen3] Fix pipeline CI follow-ups Fix the remaining pipeline CI regressions by avoiding eager allreduce construction on the single-device path and declaring numpy explicitly for the pipeline tests. END_PUBLIC Signed-off-by: Omar Habra <omarbro4all@gmail.com>
BEGIN_PUBLIC [Qwen3] Isolate GPTQ MoE construction test Construct the Qwen3 GPTQ MoE review-boundary test at the transformer block level so it validates the attention and expert setup directly without depending on unrelated model-level embedding initialization. END_PUBLIC Signed-off-by: Omar Habra <omarbro4all@gmail.com>
bro4all
marked this pull request as ready for review
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters