NASM lexer fixes by seanthegeek · Pull Request #3059

NASM lexer fixes by seanthegeek · Pull Request #3059 · pygments/pygments

Note: This patch was written by Claude Sonnet 4.6 via the Claude Code CLI. I know the Pygments project may be cautious about LLM-generated contributions, and I'd genuinely welcome feedback on the quality of this work — both the code itself and how well it follows Pygments conventions. I'm using this as a real-world test of how well Claude handles a non-trivial open-source contribution task given a detailed prompt. Any review comments, even harsh ones, are appreciated.

Original prompt (verbatim)

Pygments Lexer: NASM (Netwide Assembler)

Task

Fix the existing NASM (Netwide Assembler) lexer in Pygments. Work inside my local fork of the pygments/pygments repo on a separate branch.

Official references

MANDATORY: Before writing or modifying the lexer, you MUST fetch and read every
URL in this list. This is not background reading — it is a required prerequisite
step. Fetch each page, extract the keywords or function names, and verify them
against the lexer before declaring any work complete.

NASM documentation: https://www.nasm.us/doc/
NASM instruction reference: https://www.nasm.us/doc/nasmdocb.html
NASM preprocessor: https://www.nasm.us/doc/nasmdoc4.html
NASM directives: https://www.nasm.us/doc/nasmdoc7.html
NASM expressions: https://www.nasm.us/doc/nasmdoc3.html
x86/x86-64 instruction reference: https://www.felixcloutier.com/x86/
Pygments issue Fix NasmLexer to support syntax like <sprintf@plt> #1231 (sprintf@plt bug): Fix NasmLexer to support syntax like <sprintf@plt> #1231
Pygments issue NASM lexer: Macros with whitespace before it are not recognized #728 (macro whitespace bug): NASM lexer: Macros with whitespace before it are not recognized #728

Pygments references

Write your own lexer: https://pygments.org/docs/lexerdevelopment/
Contributing to Pygments: https://pygments.org/docs/contributing/
Builtin tokens: https://pygments.org/docs/tokens/
Available lexers: https://pygments.org/docs/lexers/
Pygments GitHub repo: https://github.com/pygments/pygments
Existing ASM lexers (structural reference): pygments/lexers/asm.py
SQL lexer (structural reference for query languages): pygments/lexers/sql.py

Phase 1: Setup and audit

Confirm you're in the root of a Pygments repo checkout (look for pygments/lexers/, tests/, setup.py).
Run git checkout -b fix/nasm main to create a dedicated branch.
Set up a venv: python -m venv venv && source venv/bin/activate && pip install -e ".[dev]".
Run tox -e py to confirm the existing test suite passes.

Establish a baseline — run the existing lexer against a sample and count Error tokens:

echo '<sample code>' | python -m pygments -l nasm -f html | grep -o 'class="err"' | wc -l

Read the existing lexer end-to-end. Understand the current states, token patterns, and keyword sets.

Known issues to fix

Issue Fix NasmLexer to support syntax like <sprintf@plt> #1231: <sprintf@plt> causes sp inside sprintf to be tokenized as a register.
Issue NASM lexer: Macros with whitespace before it are not recognized #728: %define preceded by whitespace produces Error tokens.
Missing registers: x86-64 extended registers, AVX-512, mask registers may have gaps.
Missing preprocessor directives: Some % directives may not be covered.

Phase 1: Research

Before writing any code, fetch and read the official references listed above.

Do not invent or assume any syntax elements. If something is ambiguous in the docs, web-search to verify before including it.

Phase 2: Fix the lexer

Apply fixes to the existing lexer file.

Review the existing lexer at pygments/lexers/asm.py (the NasmLexer class) and fix:

Register matching greediness: The lexer matches register names like sp inside longer words (e.g., sprintf). Fix by using word boundary anchors or negative lookahead.
Macro whitespace: %define and other preprocessor directives must be recognized even when preceded by whitespace, not just at column 0.
Missing registers: Audit and add any missing x86-64 extended registers, AVX-512 registers, mask registers.
Missing directives: Ensure all NASM preprocessor and assembler directives are covered.
Disassembly compatibility: Consider gracefully handling <symbol@plt> patterns and hex address prefixes.

After each fix, run the tests to confirm no regressions:

tox -e py -- tests/snippets/nasm/

Phase 3: Expand tests

Review and expand the existing test snippets in tests/snippets/nasm/. Add snippets that cover the syntax that was previously broken.

Each snippet file is a .txt file containing source code. Run:

tox -- --update-goldens tests/snippets/nasm/new_test.txt

This auto-populates expected tokens. Review them for correctness, then check them in.

Phase 4: Test and iterate

This is the critical phase. Use pygmentize as the feedback loop.

Run tox -e py. Fix any failures.

Test your lexer on the example file and count Error tokens:

python -m pygments -l nasm -f html tests/examplefiles/nasm/* | grep -o 'class="err"' | wc -l

If there are Error tokens, identify the unmatched text:

python -m pygments -l nasm -f testcase tests/examplefiles/nasm/* | grep "Token.Error"

For each Error token:
a. Identify what syntax element the unmatched text represents.
b. Web-search the official docs to confirm the syntax is valid.
c. Fix the lexer rule.
d. Re-run tox -e py -- tests/snippets/nasm/ to confirm no regressions.
e. Re-test with pygmentize to verify the Error is gone.
Repeat until the Error token count is zero.
Run the full test suite one more time: tox -e py.

Visually inspect the HTML output for sanity:

python -m pygments -l nasm -f html -O full,style=monokai tests/examplefiles/nasm/* > /tmp/preview.html
open /tmp/preview.html  # or xdg-open on Linux

Confirm that keywords, functions, operators, strings, numbers, and comments are each highlighted distinctly.

Phase 5: Finalize

Run tox -e py one final time — full pass, zero failures.
Review the diff: git diff --stat. You should have these files:
- pygments/lexers/asm.py (the fixes)
- tests/snippets/nasm/ (new or updated test snippets)
- Possibly tests/examplefiles/nasm/ (expanded example)
Commit: git add -A && git commit -m "Fix NASM (Netwide Assembler) lexer: <summarize fixes>".
Report what you've done: list the keyword count, function count, token types used, and confirm zero Error tokens.

Constraints (applies to all phases)

No hallucinated syntax. Every keyword, function, operator, and language construct must come from the official documentation listed above. If you're unsure, web-search the docs before adding it.
Follow Pygments conventions exactly. Read existing lexers (especially sql.py and the lexer development guide) for patterns. Use words(), bygroups(), include(), and default() helpers appropriately.
Python code must include type hints and pass ruff linter checks.
The Error token count is the ground truth. tox passing is necessary but not sufficient — you must also have zero Token.Error in both test snippets and example files.
Iterate until clean. Do not declare the task complete until both tox -e py passes AND the Error token count is zero.