city96 has published GGUF versions of the Flux1.Dev model and T5 XXL text encoder, along with custom nodes to use them in ComfyUI - thought I’d try them on my M2 Mac mini, hoping for faster inference!
Installing ComfyUI
Note that I’m running this on macOS Sonoma 14.5 with the native Python 3.9.6. My mac has only 16GB RAM.
Refer to my previous post, Faster Stable Diffusion on M-series macs, for details on installing ComfyUI:
# get ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
# create virtual environment and install required libraries
python3 -m venv v
source v/bin/activate
pip install -r requirements.txt
Your mileage with torch may vary, since support for macOS MPS is still lacking. I installed a known working verison, per ComfyUI issue #4165:
pip install torch==2.3.1 torchaudio==2.3.1 torchvision==0.18.1
Finally, install the city96/ComfyUI-GGUF custom nodes either via the ComfyUI Manager or manually:
cd custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF
pip install gguf
cd ..
BTW if you already have ComfyUI installed previously, do remember to update it first using git pull!
Downloading GGUF models
-
Download one of the GGUF-encoded Flux.1 models from city96/FLUX.1-dev-gguf - I downloaded
flux1-dev-Q4_K_S.gguf(small, 6.81 GB) toComfyUI/models/unet, but you can try the others. -
Download one of the GGUF-encoded T5-XXL text encoders from city96/t5-v1_1-xxl-encoder-gguf - I downloaded
t5-v1_1-xxl-encoder-Q4_K_M.gguf(medium, 2.9 GB) toComfyUI/models/clip, but you can try the others. -
Download the standard Flux.1 VAE - save
ae.safetensors(335 MB) toComfyUI/models/vae. -
And download the standard CLIP text encoder from comfyanonymous/flux_text_encoders (which is probably the same as the
text_encoder/model.safetensorsin Black Forrest Labs Hugging Face page above) - saveclip_l.safetensors(246 MB) toComfyUI/models/clip.
Issues
First, I got the error:
NotImplementedError: The operator 'aten::__rshift__.Scalar' is not currently implemented for the MPS device.
which I fixed by starting ComfyUI with the fallback argument like this:
source v/bin/activate
PYTORCH_ENABLE_MPS_FALLBACK=1 python main.py
Then, I kept getting black output images. Traced the issue to the VAE node, since the preview (using the fast Latent2RGB method) was working fine (see the KSampler preview in the screenshot below).
Realized that I used the argument --fp16-vae to launch ComfyUI, and removing that fixed it for me. For other users, per the aforementioned Issue #4165, adding --fp32vae worked:
source v/bin/activate
PYTORCH_ENABLE_MPS_FALLBACK=1 python main.py --fp32vae
Workflow
A standard Flux.1 workflow, with two important new nodes:
- Unet Loader (GGUF)
- DualCLIPLoader (GGUF)
On my mac, generating a 512x512 image from a fresh ComfyUI run took ~252 seconds... and, memory pressure was in the green! In the middle of inference, prior to VAE, I captured under 14.7 GB memory used with about 1.6 GB swap used.
However, a 1024x1024 image took ~760s, and memory pressure hit yellow quite a bit, and hit the red when loading the VAE, with 4.6 GB swap used. I previously mentioned, my Nvidia 2060 PC took ~210 seconds with FP8 at this image size... over 3x faster.