GitHub - ModelCloud/Tokenicer: A (nicer) tokenizer you want to use for model inference and training: with all known peventable gotchas normalized or auto-fixed.

Toke(n)icer

A (nicer) tokenizer you want to use for model inference and training: with all known peventable gotchas normalized or auto-fixed.

News

03/03/2026 0.0.7: Fix Qwen 3.5 MoE compat.
02/09/2026 0.0.6: Fix ChatGLM compat.
09/04/2025 0.0.5: Fix pad_token_id detection for LongCat model.
02/21/2025 0.0.4: ⚡ Now tokenicer instance dynamically inherits the native tokenizer.__class__ of tokenizer passed in or loaded via our tokenicer.load() api. CI now tests tokenizer compat from 64 different models.
02/10/2025 0.0.2: 🤗 Initial release!

Features:

Compatible with all HF Transformers recognized tokenizers
Auto-fix models not setting padding_token
Auto-Fix models released with wrong padding_token: many models incorrectly use eos_token as pad_token which leads to subtle and hidden errors in post-training and inference when batching is used which is almost always.
Zero external dependency outside of Transformers

Upcoming Features:

Add automatic tokenizer validation to model training and subsequent inference so that not only tokenizer config but actual decode/encode are 100% re-validated on model load. Often the case, inference and training engines modifies the traditional tokenizers causing subtle and inaccurate output when inference performed on a platform that is disjointed from the trainer.

Install

PIP/UV

pip install -v tokenicer
uv pip install -v tokenicer

Install from source

# clone repo
git clone https://github.com/ModelCloud/Tokencier.git && cd Tokenicer

# compile
pip install -v .

Usage

Replace all calls to AutoTokenizer.from_pretrained() with Tokenizer.load(): args are 100% compatible with AutoTokenizer

# Replace `AutoTokenizer.from_pretrained()`
# from tokenizer import AutoTokenizer
# tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-0.5B-Instruct')

# With `Tokenicer.load()`
from tokenicer import Tokenicer

# Returns `Tokenicer` instance that inherits original `Qwen2TokenizerFast` type.
tokenizer = Tokenicer.load('Qwen/Qwen2.5-0.5B-Instruct')

# That's it! Toke(n)icer has auto-fixed Qwen2.5-0.5B-Instruct's incorrect `pad_token`.
# Now this this model can be `trained` and `inferenced` correctly with `batch` and `masks`.
# Now use the new tokenizer like any normal HF PretrainedTokenizer(Fast)
print(f"pad_token: `{tokenizer.pad_token}`")

If you already have a loaded or composite config, pass it directly so Tokenicer can normalize the resolved text config in-place:

tokenizer = Tokenicer.load(tokenizer, model_config=model.config)

Citation

@misc{gptqmodel,
    author = {ModelCloud.ai and qubitium@modelcloud.ai},
    title = {Toke(n)icer},
    year = {2025},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/modelcloud/tokenicer}},
    note = {Contact: qubitium@modelcloud.ai}
}