A (nicer) tokenizer you want to use for model inference and training: with all known peventable gotchas normalized or auto-fixed.
News
-
03/03/2026 0.0.7: Fix Qwen 3.5 MoE compat.
-
02/09/2026 0.0.6: Fix ChatGLM compat.
-
09/04/2025 0.0.5: Fix
pad_token_iddetection forLongCatmodel. -
02/21/2025 0.0.4: âš¡ Now
tokenicerinstance dynamically inherits thenativetokenizer.__class__of tokenizer passed in or loaded via ourtokenicer.load()api. CI now tests tokenizer compat from64different models. -
02/10/2025 0.0.2: 🤗 Initial release!
Features:
- Compatible with all HF
Transformersrecognized tokenizers - Auto-fix
modelsnot settingpadding_token - Auto-Fix
modelsreleased with wrongpadding_token: manymodelsincorrectly useeos_tokenaspad_tokenwhich leads to subtle and hidden errors in post-training and inference whenbatchingis used which is almost always. - Zero external dependency outside of
Transformers
Upcoming Features:
- Add
automatictokenizer validation tomodeltrainingand subsequentinferenceso that not only tokenizer config but actualdecode/encodeare 100% re-validated on model load. Often the case,inferenceandtrainingengines modifies the traditional tokenizers causing subtle and inaccurate output wheninferenceperformed on a platform that is disjointed from thetrainer.
Install
PIP/UV
pip install -v tokenicer uv pip install -v tokenicer
Install from source
# clone repo git clone https://github.com/ModelCloud/Tokencier.git && cd Tokenicer # compile pip install -v .
Usage
- Replace all calls to
AutoTokenizer.from_pretrained()withTokenizer.load(): args are 100% compatible withAutoTokenizer
# Replace `AutoTokenizer.from_pretrained()` # from tokenizer import AutoTokenizer # tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-0.5B-Instruct') # With `Tokenicer.load()` from tokenicer import Tokenicer # Returns `Tokenicer` instance that inherits original `Qwen2TokenizerFast` type. tokenizer = Tokenicer.load('Qwen/Qwen2.5-0.5B-Instruct') # That's it! Toke(n)icer has auto-fixed Qwen2.5-0.5B-Instruct's incorrect `pad_token`. # Now this this model can be `trained` and `inferenced` correctly with `batch` and `masks`. # Now use the new tokenizer like any normal HF PretrainedTokenizer(Fast) print(f"pad_token: `{tokenizer.pad_token}`")
- If you already have a loaded or composite config, pass it directly so Tokenicer can normalize the resolved text config in-place:
tokenizer = Tokenicer.load(tokenizer, model_config=model.config)
Citation
@misc{gptqmodel,
author = {ModelCloud.ai and qubitium@modelcloud.ai},
title = {Toke(n)icer},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/modelcloud/tokenicer}},
note = {Contact: qubitium@modelcloud.ai}
}
