add support for flux2 klein by leejet · Pull Request #1193

add support for flux2 klein by leejet · Pull Request #1193 · leejet/stable-diffusion.cpp

Flux.2 klein 4B

.\bin\Release\sd-cli.exe --diffusion-model  ..\..\ComfyUI\models\diffusion_models\flux-2-klein-4b.safetensors --vae ..\..\ComfyUI\models\vae\flux2_ae.safetensors  --llm ..\..\ComfyUI\models\text_encoders\qwen_3_4b.safetensors -p "a lovely cat" --cfg-scale 1.0 --steps 4 -v --offload-to-cpu --diffusion-fa

Flux.2 klein 9B

.\bin\Release\sd-cli.exe --diffusion-model  ..\..\ComfyUI\models\diffusion_models\flux-2-klein-9b.safetensors --vae ..\..\ComfyUI\models\vae\flux2_ae.safetensors  --llm ..\..\ComfyUI\models\text_encoders\qwen_3_8b.safetensors -p "a lovely cat" --cfg-scale 1.0 --steps 4 -v --offload-to-cpu --diffusion-fa

Flux.2 klein 4B edit

.\bin\Release\sd-cli.exe --diffusion-model  ..\..\ComfyUI\models\diffusion_models\flux-2-klein-4b.safetensors --vae ..\..\ComfyUI\models\vae\flux2_ae.safetensors  --llm ..\..\ComfyUI\models\text_encoders\qwen_3_4b.safetensors -r .\kontext_input.png -p "change 'flux.cpp' to 'klein.cpp'" --cfg-scale 1.0 --sampling-method euler -v --diffusion-fa --offload-to-cpu --steps 4

Flux.2 klein 9B edit

.\bin\Release\sd-cli.exe --diffusion-model  ..\..\ComfyUI\models\diffusion_models\flux-2-klein-9b.safetensors --vae ..\..\ComfyUI\models\vae\flux2_ae.safetensors  --llm ..\..\ComfyUI\models\text_encoders\qwen_3_8b.safetensors -r .\kontext_input.png -p "change 'flux.cpp' to 'klein.cpp'" --cfg-scale 1.0 --sampling-method euler -v --diffusion-fa --offload-to-cpu --steps 4

Currently, there are still some issues with the support for flux.2 klein. Padding needs to be applied during tokenization and attention_mask must be used in llm.hpp, but at the moment the llm.hpp’s handling of attention_mask may have problems. When attention_mask is enabled, the results become NaN. This is the same issue seen with longcat image. I am still investigating and working on a fix.

So the --clip-on-cpu workaround should also work there?

It doesn’t work on my side.

I think I’ve correctly fixed the attention_mask issue.

The quality of Flux.2 klein 4B doesn’t seem as good as z-image turbo.

Flux.2 klein 4b

.\bin\Release\sd-cli.exe --diffusion-model  ..\..\ComfyUI\models\diffusion_models\flux-2-klein-4b.safetensors --vae ..\..\ComfyUI\models\vae\flux2_ae.safetensors  --llm ..\..\ComfyUI\models\text_encoders\qwen_3_4b.safetensors -p "A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: 'THE CITY IS A CIRCUIT BOARD, AND I AM A BROKEN TRANSISTOR.' -- moody, atmospheric, profound, dark academic" --cfg-scale 5.0 --steps 4 -v --offload-to-cpu --diffusion-fa -v -H 1024 -W 512 --rng cpu

Flux.2 klein base 4b

.\bin\Release\sd-cli.exe --diffusion-model  ..\..\ComfyUI\models\diffusion_models\flux-2-klein-base-4b.safetensors --vae ..\..\ComfyUI\models\vae\flux2_ae.safetensors  --llm ..\..\ComfyUI\models\text_encoders\qwen_3_4b.safetensors -p "A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: 'THE CITY IS A CIRCUIT BOARD, AND I AM A BROKEN TRANSISTOR.' -- moody, atmospheric, profound, dark academic" --cfg-scale 5.0 --steps 20 -v --offload-to-cpu --diffusion-fa -v -H 1024 -W 512 --rng cpu

The quality of Flux.2 klein 4B doesn’t seem as good as z-image turbo.

Flux.2 klein 4b

.\bin\Release\sd-cli.exe --diffusion-model  ..\..\ComfyUI\models\diffusion_models\flux-2-klein-4b.safetensors --vae ..\..\ComfyUI\models\vae\flux2_ae.safetensors  --llm ..\..\ComfyUI\models\text_encoders\qwen_3_4b.safetensors -p "A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: 'THE CITY IS A CIRCUIT BOARD, AND I AM A BROKEN TRANSISTOR.' -- moody, atmospheric, profound, dark academic" --cfg-scale 5.0 --steps 4 -v --offload-to-cpu --diffusion-fa -v -H 1024 -W 512 --rng cpu

Not sure about cfg, but they use guidance_scale=1.0 for the distilled (non-base) model.

Also they use guidance_scale=4.0 and num_inference_steps=50 for the base model.

(ref is 4b variants on hf)

edit: cfg of 5 seems comparatively high for models that take larger llm embedding inputs.
edit2: logger.warning(f"Guidance scale {guidance_scale} is ignored for step-wise distilled models.") hmm

edit3:

    def do_classifier_free_guidance(self):
        return self._guidance_scale > 1 and not self.config.is_distilled

So cfg should be 1 for the distilled model.

Not sure about cfg, but they use guidance_scale=1.0 for the distilled (non-base) model.

They changed the README on Hugging Face. When I first checked it, the distill model was also using guidance_scale = 4.0. After changing guidance_scale to 1.0f, the image quality did improve a bit, but it’s still not as good as z-image turbo.

https://huggingface.co/black-forest-labs/FLUX.2-klein-4B/commit/5e67da950fce4a097bc150c22958a05716994cea

.\bin\Release\sd-cli.exe --diffusion-model  ..\..\ComfyUI\models\diffusion_models\flux-2-klein-4b.safetensors --vae ..\..\ComfyUI\models\vae\flux2_ae.safetensors  --llm ..\..\ComfyUI\models\text_encoders\qwen_3_4b.safetensors -p "A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: 'THE CITY IS A CIRCUIT BOARD, AND I AM A BROKEN TRANSISTOR.' -- moody, atmospheric, profound, dark academic" --cfg-scale 1.0 --steps 4 -v --offload-to-cpu --diffusion-fa -v -H 1024 -W 512 --rng cpu

Also they use guidance_scale=4.0 and num_inference_steps=50 for the base model.

In fact, many Hugging Face examples for base models use relatively large step counts, like 40–50 — for example, SDXL uses 40 — but in practice, using around 20 steps often already gives good results.

This is the result with 50 steps. The quality has improved somewhat, but not by much.

.\bin\Release\sd-cli.exe --diffusion-model  ..\..\ComfyUI\models\diffusion_models\flux-2-klein-base-4b.safetensors --vae ..\..\ComfyUI\models\vae\flux2_ae.safetensors  --llm ..\..\ComfyUI\models\text_encoders\qwen_3_4b.safetensors -p "A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: 'THE CITY IS A CIRCUIT BOARD, AND I AM A BROKEN TRANSISTOR.' -- moody, atmospheric, profound, dark academic" --cfg-scale 5.0 --steps 50 -v --offload-to-cpu --diffusion-fa -v -H 1024 -W 512 --rng cpuheric, profound, dark academic" --cfg-scale 5.0 --steps 50 -v --offload-to-cpu --diffusion-fa -v -H 1024 -W 512

@leejet you talk about guidance scale, but your command only shows the cfg scale change. Or did you code the guidance scale?

Oh and have you tried reference image(s) ? This is a clear advantage over eg z-image.

@leejet you talk about guidance scale, but your command only shows the cfg scale change. Or did you code the guidance scale?

guidance_scale in diffusers == --cfg-scale in sd.cpp

Oh and have you tried reference image(s) ? This is a clear advantage over eg z-image.

Here I’m comparing the performance for T2I. Using a reference image means it’s image editing, which is a different task. Currently, z-image turbo does not support image editing.

@leejet you talk about guidance scale, but your command only shows the cfg scale change. Or did you code the guidance scale?

guidance_scale in diffusers == --cfg-scale in sd.cpp

Guidance scale as defined in [Classifier-Free Diffusion Guidance]

You are right, I did not know that.

Oh and have you tried reference image(s) ? This is a clear advantage over eg z-image.

Here I’m comparing the performance for T2I. Using a reference image means it’s image editing, which is a different task. Currently, z-image turbo does not support image editing.

Yes, I was asking because you did not show any examples yet. :)

Yes, I was asking because you did not show any examples yet. :)

I’ve updated some examples of image editing. You can take a look. I think the overall quality of the image edits is pretty good.

Hello, what i miss with default steps i get bad images, maybe i miss something? for example Z-Image is working full power . Maybe i miss something? there people already get ok images with 4 steps, but for me 4 steps is only a messy image.
_FLUX2_Klein.cmd.txt

Hello, what i miss with default steps i get bad images, maybe i miss something? for example Z-Image is working full power . Maybe i miss something? there people already get ok images with 4 steps, but for me 4 steps is only a messy image. _FLUX2_Klein.cmd.txt

There are 2 versions. A distilled model and an undistilled model (base) which is what you are using. 4 steps will only give good results with the distilled version.

Hello, what i miss with default steps i get bad images, maybe i miss something? for example Z-Image is working full power . Maybe i miss something? there people already get ok images with 4 steps, but for me 4 steps is only a messy image. _FLUX2_Klein.cmd.txt

There are 2 versions. A distilled model and an undistilled model (base) which is what you are using. 4 steps will only give good results with the distilled version.

Thank you for very useful note, i am new in AI stuff and this is good to know. I started to use Stable-Diffusion.cpp because i hate fat bloated software with passion and for me SDcpp is easier to use and is very portable, light, fast and do not depend on system paths etc.

I presume this is general rule about distilled vs undistilled? Undistilled just needs a very high steps?

Thank you for very useful note, i am new in AI stuff and this is good to know. I started to use Stable-Diffusion.cpp because i hate fat bloated software with passion and for me SDcpp is easier to use and is very portable, light, fast and do not depend on system paths etc.

I presume this is general rule about distilled vs undistilled? Undistilled just needs a very high steps?

This is getting a bit offtopic, so feel free to open a discussion for further questions. (or pm on tox or something).

Generally there are different forms of "distillation". In this case here it was a step-distillation AND a cfg-distillation. Both reduce how often the diffusion model has to be run per image.

Also generally, every model requires its own set of parameters. Some work better than others.
Eg. most transformer based models(flux/z-image...) work best with simple/smoothstep schedulers and non-ancestral samplers. But this is very much model dependent.