More Flux.1-based models! Go faster with FP8 or NF4! New LoRAs and ControlNets! There is quite a bit of interest with this model, as evidenced by the speed of community-led enhancements.

The original Flux.1 FP16 models are 23.8 GB, and too large for many PC setups. Hence in my previous post, I set the weight_dtype parameter to FP8 . However, I still had to download the full model plus the CLIP and T5 encoders and VAE separately... so that, even the most basic workflow was a more involved when compared with SDXL workflows.

FP8 Installation

Now the team at Comfy have simplified using Flux.1 and (almost) halved the file size - just download the FP8 version as a .safetensor (even better) of Comfy-Org/flux1-schnell and / or Comfy-Org/flux1-dev. They are 17.2 GB each, and bundle the encoders and VAE.

NF4 Installation

Recently, Illyasviel, the genius behind Forge and Fooocus, released lllyasviel/flux1-dev-bnb-nf4, which is now v2. I quote,

“For GPUs with 6GB/8GB VRAM, the speed-up is about 1.3x to 2.5x (pytorch 2.4, cuda 12.4) or about 1.3x to 4x (pytorch 2.1, cuda 12.1)” and “NF4 may outperform FP8 (e4m3fn/e5m2) in numerical precision, and it does outperform e4m3fn/e5m2 in many (in fact, most) cases/benchmarks... NF4 is technically granted to outperform FP8 (e4m3fn/e5m2) in dynamic range in 100% cases.”

Now there is a beta node for ComfyUI too, comfyanonymous/ComfyUI_bitsandbytes_NF4, which is a copy of the Forge code. It’s currently in the Dev channel and you may need edit ComfyUI\custom_nodes\ComfyUI-Manager\config.ini so that security_level = weak. Run Comfy once, then kill it once it says After restarting ComfyUI, please refresh the browser and the start it again. That worked for me... but don’t do what I do, please!

FP8 Worflow

Here is a very familiar looking workflow - note the use of the CLIPTextEncodeFlux node which includes the Guidance parameter, so there is no need to use the FluxGuidance node.

Flux.1-Dev-FP8 workflow

NF4 Workflow

The node you need is CheckpointLoaderNF4. I have less luck with this workflow on my potato GPU. As you can see from the timings generated by Comfyui Profiler extension, NF4 is much slower (both v1 and v2 are about 1100+ seconds), than FP8 for me (~210s).

Flux.1-Dev-NF4 workflow

LoRAs

No time to test!