Skip to content

Latest commit

 

History

History
638 lines (540 loc) · 55.9 KB

File metadata and controls

638 lines (540 loc) · 55.9 KB

Model Type Support In SwarmUI

Model Architecture Year Author Scale Censored? Quality/Status
Stable Diffusion XL unet 2023 Stability AI 2B Partial Old but some finetunes remain popular
SD1 and SDXL Turbo Variants unet 2023 Stability AI and others 2B Partial Outdated
Stable Diffusion 3 MMDiT 2024 Stability AI 2B Yes Outdated, prefer 3.5
Stable Diffusion 3.5 Large MMDiT 2024 Stability AI 8B Partial Outdated, Good Quality for its time
Stable Diffusion 3.5 Medium MMDiT 2024 Stability AI 2B Partial Outdated, Good Quality for its time
AuraFlow MMDiT 2024 Fal.AI 6B Yes Outdated
Flux.1 MMDiT 2024 Black Forest Labs 12B Partial Outdated, High Quality for its time
Flux.2 MMDiT 2025 Black Forest Labs 4B, 9B, 32B Minimal Recent, Incredible Quality, choice of speed or quality preference
Chroma MMDiT 2025 Lodestone Rock 8.9B No Recent, Decent Quality
Chroma Radiance Pixel MMDiT 2025 Lodestone Rock 8.9B No Recent, Bad Quality (WIP)
Lumina 2.0 NextDiT 2025 Alpha-VLLM 2.6B Partial Modern, Passable Quality
Qwen Image MMDiT 2025 Alibaba-Qwen 20B Minimal Modern, Great Quality, very memory intense
Hunyuan Image 2.1 MMDiT 2025 Tencent 17B No Modern, Great Quality, very memory intense
Z-Image S3-DiT 2025 Tongyi MAI (Alibaba) 6B No Modern, Great Quality, lightweight
Kandinsky 5 DiT 2025 Kandinsky Lab 6B No Modern, Decent Quality
Anima DiT 2026 Circlestone Labs 2B WTF Modern, very small, decent for anime

Old or bad options also tracked listed via Obscure Model Support:

Model Architecture Year Author Scale Censored? Quality/Status
Stable Diffusion v1 and v2 unet 2022 Stability AI 1B No Outdated
Stable Diffusion v1 Inpainting Models unet 2022 RunwayML 1B No Outdated
Segmind SSD 1B unet 2023 Segmind 1B Partial Outdated
Stable Cascade unet cascade 2024 Stability AI 5B Partial Outdated
PixArt Sigma DiT 2024 PixArt 1B ? Outdated
Nvidia Sana DiT 2024 NVIDIA 1.6B No Just Bad
Nvidia Cosmos Predict2 DiT 2025 NVIDIA 2B/14B Partial Just Bad
HiDream i1 MMDiT 2025 HiDream AI (Vivago) 17B Minimal Good Quality, lost community attention
OmniGen 2 MLLM 2025 VectorSpaceLab 7B No Modern, Decent Quality, quickly outclassed
Ovis MMDiT 2025 AIDC-AI (Alibaba) 7B No Passable quality, but outclassed on launch
  • Architecture is the fundamental machine learning structure used for the model, UNet's were used in the past but DiT (Diffusion Transformers) are the modern choice

  • Scale is how big the model is - "B" for "Billion", so for example "2B" means "Two billion parameters".

    • One parameter is one number value, so for example in fp16 (16 bit, ie 2 bytes per number), a 2B model is 4 gigabytes. In fp8 (8 bit, ie 1 byte per number), a 2B model is 2 gigabytes.
    • If you often use fp8 or q8 models, just read the "B" as "gigabytes" for a good approximation
  • Censored? is tested by generating eg "a photo of a naked woman" on the model.

    • This test only refers to the base models, finetunes can add nudity and other "risque" content back in.
    • Most base models will not generate genitalia, and have limited quality with other body parts and poses. Every popular model has finetunes available to add those capabilities, if you want them.
      • Sometimes it's not even intentional censorship, just the simple fact that broad base models aren't good at any one thing - so, again, content-specific finetunes fix that.
    • Model censorship can take other forms (eg does it recognize names of celebrities/artists/brands, can it do gore, etc.) so if a model sounds right to you you may want do your own testing to see if it's capable of the type of content you like
    • "No" means it generates what was asked,
    • "Minimal" means it's eg missing genitals but otherwise complete,
    • "Partial" means it's clearly undertrained at NSFW content (eg difficult to prompt for or poor quality body) but doesn't explicitly refuse,
    • "Yes" means it's entirely incapable or provides an explicit refusal response,
    • "WTF" means it's the opposite of censored, may generate inappropriate content even without being asked.
  • Quality/Status is a loose vibe-based metric to imply whether it's worth using in the current year or not.

  • Video models are in Video Model Support

  • Audio models are in Audio Model Support

Current Recommendations

Image model(s) most worth using, as of January 2026:

  • Z-Image is the best right now, especially for photoreal gens.
  • Flux.2 Klein is pretty great too, for Editing or for art style variety.
  • Flux.2 Dev is massive, but is the smartest of the bunch if you have the hardware and patience for it.

General Info

  • Swarm natively supports .safetensors format models with ModelSpec metadata
    • can also import metadata from some legacy formats used by other UIs (auto webui thumbnails, matrix jsons, etc)
    • can also fallback to a .swarm.json sidecar file for other supported file formats
  • Swarm can load other model file formats, see Alternative Model Formats
    • Notably, quantization technique formats. "Quantization" means shrinking a model to use lower memory than is normally reasonable.
      • Normal sizes are named like "BF16", "FP16", "FP8", ... ("BF"/"FP" prefixes are standard formats)
      • Quantized sizes have names like "NF4", "Q4_K_M", "Q8", "SVDQ-4", "Int-4", ("Q" means quantized, but there are technique-specific labels)
    • BnB NF4 (not recommended, quantization technique)
    • GGUF (recommended, good quality quantization technique, slower speed)
    • Nunchaku (very recommended, great quality high speed quantization technique)
    • TensorRT (not recommended, speedup technique)

Image Models

  • Image demos included below are mini-grids of seeds 1, 2, 3 of the prompt wide shot, photo of a cat with mixed black and white fur, sitting in the middle of an open roadway, holding a cardboard sign that says "Meow I'm a Cat". In the distance behind is a green road sign that says "Model Testing Street". ran on each model.
  • For all models, "standard parameters" are used.
    • Steps is set to 20 except for Turbo models. Turbo models are ran at their standard fast steps (usually 4).
    • CFG is set appropriate to the model.
    • Resolution is model default.
  • This prompt is designed to require (1) multiple complex components (2) photorealism (3) text (4) impossible actions (cat holding a sign - Most models get very confused how to do this).
  • All generations are done on the base model of the relevant class, not on any finetune/lora/etc. Finetunes are likely to significantly change the qualitative capabilities, but unlikely to significantly change general ability to understand and follow prompts.
  • This is not a magic perfect test prompt, just a decent coverage of range to showcase approximately what you can expect from the model in terms of understanding and handling challenges.
    • You could make a point that maybe I should have set CFG different or used a sigma value or changed up prompt phrasing or etc. and get better quality - this test intentionally uses very bland parameters to maximize identical comparison. Keep in mind that you can get better results out of a model by fiddling parameters.
  • You'll note models started being able to do decently well on this test in late 2024. Older models noticeable fail at the basic requirements of this test.

Stable Diffusion XL

img

SDXL models work as normal, with the bonus that by default enhanced inference settings will be used (eg scaled up rescond).

Additional, SDXL-Refiner architecture models can be inferenced, both as refiner or even as a base (you must manually set res to 512x512 and it will generate weird results).

SDXL Controlnets

  • There are official SDXL ControlNet LoRAs from Stability AI here
  • and there's a general collection of community ControlNet models here that you can use.

SD1 and SDXL Turbo Variants

Turbo, LCM (Latent Consistency Models), Lightning, etc. models work the same as regular models, just set CFG Scale to 1 and: - For Turbo, Steps to 1 Under the Sampling group set Scheduler to Turbo. - For LCM, Steps to 4. Under the Sampling group set Sampler to lcm. - For lightning, (?)

Stable Diffusion 3

img

Stable Diffusion 3 Medium is supported and works as normal.

By default the first time you run an SD3 model, Swarm will download the text encoders for you.

Under the Sampling parameters group, a parameter named SD3 TextEncs is available to select whether to use CLIP, T5, or both. By default, CLIP is used (no T5) as results are near-identical but CLIP-only has much better performance, especially on systems with limited resources.

Under Advanced Sampling, the parameter Sigma Shift is available. This defaults to 3 on SD3, but you can lower it to around ~1.5 if you wish to experiment with different values. Messing with this value too much is not recommended.

For upscaling with SD3, the Refiner Do Tiling parameter is highly recommended (SD3 does not respond well to regular upscaling without tiling).

Stable Diffusion 3.5 Large

img

Stable Diffusion 3.5 Medium

img

  • Stable Diffusion 3.5 Medium is supported and works as normal.
  • They behave approximately the same as the SD3 Medium models, including same settings and all.
  • You can also use GGUF Versions of the models.
  • SD 3.5 Medium support resolutions from 512x512 to 1440x1440, and the model metadata of the official model recommends 1440x1440. However, the official model is not good at this resolution. You will want to click the hamburger menu on a model, then Edit Metadata, then change the resolution to 1024x1024 for better results. You can of course set the Aspect Ratio parameter to Custom and the edit resolutions on the fly per-image.

AuraFlow

img (above image is AuraFlow v0.2)

  • Fal.ai's AuraFlow v0.1 and v0.2 and v0.3 are supported in Swarm, but you must manually select architecture to use it.
  • The model used "Pile T5-XXL" as it's text encoder.
  • The model used the SDXL VAE as its VAE.
  • This model group was quickly forgotten by the community due to quality issues, but came back into popular attention much later via community finetune "Pony v7".
    • Pony wants to be in the diffusion_models folder, but regular AuraFlow goes in Stable-Diffusion folder
  • Parameters and usage is the same as any other normal model.
    • CFG recommended around 3.5 or 4.
    • Pony v7 allows higher resolutions than base AuraFlow normally targets.

Black Forest Labs' Flux.1 Models

Flux Dev

img

Flux Schnell

img

Install

Parameters

  • CFG Scale: For both models, use CFG Scale = 1 (negative prompt won't work).
    • For the Dev model, there is also a Flux Guidance Scale parameter under Sampling, which is a distilled embedding value that the model was trained to use.
    • Dev can use some slightly-higher CFG values (allowing for negative prompt), possibly higher if you reduce the Flux Guidance value and/or use Dynamic Thresholding.
  • Sampler: Leave it at default (Euler + Simple)
  • Steps: For Schnell use Steps=4 (or lower, it can even do 1 step), for Dev use Steps=20 or higher
  • Resolution: It natively supports any resolution up to 2 mp (1920x1088), and any aspect ratio thereof. By default will use 1MP 1024x1024 in Swarm. You can take it down to 256x256 and still get good results.
    • You can mess with the resolution quite a lot and still get decent results. It's very flexible even past what it was trained on.
  • You can do a refiner upscale 2x and it will work but take a long time and might not have excellent quality.
    • Enable Refiner Do Tiling for any upscale target resolution above 1536x1536.

Performance

  • Flux is best on a very high end GPU (eg 4090) for now. It is a 12B model.
    • Smaller GPUs can run it, but will be slow. This requires a lot of system RAM (32GiB+). It's been shown to work as low down as an RTX 2070 or 2060 (very slowly).
  • On a 4090, schnell takes about 4/5 seconds to generate a 4-step image, very close to SDXL 20 steps in time, but much higher quality.
  • By default swarm will use fp8_e4m3fn for Flux, if you have a very very big GPU and want to use fp16/bf16, under Advanced Sampling set Preferred DType to Default (16 bit)

Flux.1 Tools

  • The Flux.1 Tools announced here by BFL are supported in SwarmUI
  • For "Redux", a Flux form of image prompting:
    • Download the Redux model to (SwarmUI)/Models/style_models
    • (Don't worry about sigclip, it is automanaged)
    • Drag an image to the prompt area
    • On the top left, find the Image Prompting parameter group
    • Select the Use Style Model parameter to the Redux model
    • There's an advanced Style Model Apply Start param to allow better structural control from your text prompt
      • set to 0.1 or 0.2 or so to have the text prompt guide structure before redux takes over styling
      • at 0, text prompt is nearly ignored
    • The advanced Style Model Merge Strength param lets you partial merge the style model against the nonstyled input, similar to Multiply Strength
    • The advanced Style Model Multiply Strength param directly multiplies the style model output, similar to Merge Strength
  • For "Canny" / "Depth" models, they work like regular models (or LoRAs), but require an Init Image to function.
    • Goes in the regular diffusion_models or lora folder depending on which you downloaded.
    • You must input an appropriate image. So eg for the depth model, input a Depth Map.
      • You can use the controlnet parameter group to generate depth maps or canny images from regular images.
        • (TODO: Native interface to make that easier instead of janking controlnet)
    • Make sure to set Creativity to 1.
    • This is similar in operation to Edit models.
  • For "Fill" (inpaint model), it works like other inpaint models.
    • It's a regular model file, it goes in the regular diffusion_models folder same as other flux models.
    • "Edit Image" interface encouraged.
    • Mask a region and go.
    • Creativity 1 works well.
    • Larger masks recommended. Small ones may not replace content.
    • Boosting the Flux Guidance Scale way up to eg 30 may improve quality
  • For "Kontext" (edit model), it works like other edit models.
  • If you want to use the ACE Plus Models (Character consistency)
    • Download the LoRAs from https://huggingface.co/ali-vilab/ACE_Plus/tree/main and save as normal loras
    • Enable the Flux Fill model, enable the LoRA you chose
    • Set Flux Guidance Scale way up to 50
    • Open an image editor for the image you want to use as an input (Drag to center area, click Edit Image)
    • set Init Image Creativity to 1 (max)
    • Change your Resolution parameters to have double the Width (eg 1024 input, double to 2048)
    • Add a Mask, draw a dot anywhere in the empty area (this is just a trick to tell the editor to automask all the empty area to the side, you don't need to mask it manually)
    • Type your prompt, hit generate

Flux 2

img

  • Black Forest Labs' Flux.2 Models are supported in SwarmUI
  • The main "Dev" model is an extremely massive model (32B diffusion model, 24B text encoder) that will demand significant RAM availability on your PC.
    • This can easily fill up 128 gigs of system RAM in usage, but does still work on 64 gig systems. Lower than 64 may not be possible, or may require heavily using swapfile.
    • The smaller Klein model is preferred for more normal PC hardware.
  • Download the standard FP8 model here silveroxides/FLUX.2-dev-fp8_scaled
  • The VAE is a brand new 16x16 downsample VAE with 128 channels. It will be autodownloaded.
  • The Text Encoder is 24B Mistral Small 3.2 (2506). It will be autodownloaded.
  • Parameters:
    • Prompt: Prompting guide from the model creators here https://docs.bfl.ai/guides/prompting_guide_flux2
      • Notably, they trained heavily on complex JSON structured prompts to allow for very complex scene control, though this is not required
      • They used a powerful LLM for inputs, allow for multiple languages and a variety of ways of phrasing/formatting text to work out
    • Resolution: Flux2 supports just about any resolution you can think of, from 64x64 up to 4 megapixels (2048x2048)
    • CFG Scale: 1
    • Steps: They recommend 50, 20 still works but may have some quality reduction
    • Sigma Shift: Defaults to 2.02
    • Flux Guidance Scale: Defaults to 3.5, fiddling this value up a bit (to eg 4) may be better
    • Sampler: Defaults to regular Euler
    • Scheduler: Defaults to Flux2, a new specialty scheduler added for Flux.2 to use, but it makes very little difference
    • Prompt Images: add up to a max of 6 images to the prompt box to be used as reference images. This uses significantly more memory.

Flux.2 Klein

  • Klein is a smaller variant of Flux.2
    • It is lower quality vs the full Flux.2-Dev, but runs much faster. Certain aspects of the quality can actually be better, notably visual quality seems to have been tuned better, overall intelligence is lower.
    • There is a 4B and a 9B variant, while the 9B is larger it often seems like the 4B is smarter.
    • It uses a smaller text encoder (Qwen 4B for Klein 4B, and Qwen 8B for Klein 9B). It will be autodownloaded.
    • Download Klein 4b here
      • or a gguf here or base gguf here
      • It has a distilled variant (Steps=8, CFG=1), and a "Base" variant (high steps, high CFG)
    • or Klein 9b here (you may need to accept a license here)
    • or klein 9b base here
    • or klein 9b-kv cache version here
      • This version uses "KV Cache" to accelerate image editing.
      • Any model with KV Cache support MUST HAVE 9b-kv in the filename. This is how Swarm detects and applies KV Cache behavior to the model. (There is no way to automatically detect).
      • This requires significantly more VRAM, so most people do not want this.
    • Save the file into diffusion_models
    • Broadly works the same as Flux.2-Dev
    • On the distilled model set Steps to 8, on base model use normal high step counts
    • On the distilled model set CFG Scale to 1, on base model use normal CFG eg 7
    • They have an official prompting guide here

Chroma

  • Chroma is a derivative of Flux, and is supported in SwarmUI
    • FP8 Scaled versions here: silveroxides/Chroma1-HD-fp8-scaled
    • Or GGUF versions here: silveroxides/Chroma-GGUF
    • Or original BF16 here (not recommended): lodestones/Chroma
    • Model files goes in diffusion_models
    • Uses standard CFG, not distilled to 1 like other Flux models
    • Original official reference workflow used Scheduler=Align Your Steps with Steps=26 and CFG Scale=4
      • (It's named Optimal Steps in their workflow, but Swarm's AYS scheduler is equivalent to that)
      • "Sigmoid Offset" scheduler was their later recommendation, it requires a custom node
        • You can git clone https://github.com/silveroxides/ComfyUI_SigmoidOffsetScheduler into your ComfyUI custom_nodes, and then restart SwarmUI, and it will be available from the Scheduler param dropdown
      • Or, "power_shift" / "beta42" from ComfyUI_PowerShiftScheduler may be better
        • Works the same, git clone https://github.com/silveroxides/ComfyUI_PowerShiftScheduler into your ComfyUI custom_nodes and restart
    • Generally works better with longer prompts. Adding some "prompt fluff" on the end can help clean it up. This is likely related to it being a beta model with an odd training dataset.
  • Parameters
    • CFG Scale: around 3.5
    • Sampler: Defaults to regular Euler
    • Scheduler: Defaults to Beta
    • Steps: Normal step counts work, official recommendation is 26
    • Sigma Shift: Defaults to 1
    • Resolution: 1024x1024 or nearby values. The HD models were trained extra on 1152x1152.

Chroma Radiance

  • Chroma Radiance is a pixel-space model derived from Flux, and is supported in SwarmUI
    • It is a work in progress, expect quality to be limited for now
    • Download here lodestones/Chroma1-Radiance
      • Model files goes in diffusion_models
    • It does not use a VAE
  • Parameters
    • CFG Scale: around 3.5
    • Sampler: Defaults to regular Euler
    • Scheduler: Defaults to Beta
    • Steps: Normal step counts work, higher is recommended to reduce quality issues
    • Sigma Shift: Defaults to 1. Set to 0 to explicitly remove shift and use the underlying comfy default behavior.
    • Prompt: Long and detailed prompts are recommended.
    • Negative Prompt: Due to the model's experimental early train status, a good negative prompt is essential.
      • Official example: This low quality greyscale unfinished sketch is inaccurate and flawed. The image is very blurred and lacks detail with excessive chromatic aberrations and artifacts. The image is overly saturated with excessive bloom. It has a toony aesthetic with bold outlines and flat colors.

Lumina 2

img (Generated with the highest degree of image-text alignment preprompt, CFG=4, SigmaShift=6, Steps=20)

  • Lumina 2 is an image diffusion transformer model, similar in structure to SD3/Flux/etc. rectified flow DiTs, with an LLM (Gemma 2 2B) as its input handler.
  • It is a 2.6B model, similar size to SDXL or SD3.5M, much smaller than Flux or SD3.5L
  • You can download the Comfy Org repackaged version of the model for use in SwarmUI here: https://huggingface.co/Comfy-Org/Lumina_Image_2.0_Repackaged/blob/main/all_in_one/lumina_2.safetensors
  • Because of the LLM input, you have to prompt it like an LLM.
    • This means a cat yields terrible results, instead give it: You are an assistant designed to generate superior images with the superior degree of image-text alignment based on textual prompts or user prompts. <Prompt Start> a cat to get good results
    • Lumina's published reference list of prompt prefixes from source code:
      • You are an assistant designed to generate high-quality images with the highest degree of image-text alignment based on textual prompts. <Prompt Start>
      • You are an assistant designed to generate high-quality images based on user prompts. <Prompt Start>
      • You are an assistant designed to generate high-quality images with highest degree of aesthetics based on user prompts. <Prompt Start>
      • You are an assistant designed to generate superior images with the superior degree of image-text alignment based on textual prompts or user prompts. <Prompt Start>
      • You are an assistant designed to generate four high-quality images with highest degree of aesthetics arranged in 2x2 grids based on user prompts. <Prompt Start>
      • You can absolutely make up your own though.
      • For longer prompts the prefix becomes less needed.
  • The model uses the Flux.1 VAE
  • Parameters:
    • CFG: 4 is their base recommendation
    • Sigma Shift: The default is 6 per Lumina reference script, Comfy recommends 3 for use with lower step counts, so you can safely mess with this parameter if you want to. 6 seems to be generally better for structure, while 3 is better for fine details by sacrificing structure, but may have unwanted artifacts. Raising step count reduces some artifacts.
    • Steps: The usual 20 steps is fine, but reference Lumina script uses 250(?!) by default (it has a weird sampler that is akin to Euler at 36 steps actually supposedly?)
      • Quick initial testing shows that raising steps high doesn't work any particularly different on this model than others, but the model at SigmaShift=6 produces some noise artifacts at regular 20 steps, raising closer to 40 cuts those out.
    • Renorm CFG: Lumina 2 reference code sets a new advanced parameter Renorm CFG to 1. This is available in Swarm under Advanced Sampling.
      • The practical difference is subjective and hard to predict, but enabling it seems to tend towards more fine detail

Qwen Image

img (Qwen Image ran at CFG=4, Steps=50, Res=1328x1328. This took me about 3 minutes per image. This comparison is unfair to the other models, but this model seems intended to be a 'slow but smart' model, so this is the way to run it for now. The test prompt seems to be particularly hard on Qwen Image, I promise it's smarter than this makes it look lol.)

  • Qwen Image is natively supported in SwarmUI.
    • Download the model here Comfy-Org/Qwen-Image_ComfyUI
    • The text encoder is Qwen 2.5 VL 7B (LLM), and will be automatically downloaded.
    • It has its own VAE, and will be automatically downloaded.
    • SageAttention has compatibility issues, if you use Sage it will need to be disabled.
    • CFG: You can use CFG=1 for best performance. You can also happily use higher CFGs, eg CFG=4, at a performance cost.
    • Steps: normal ~20 works, but higher steps (eg 50) is recommended for best quality
    • Resolution: 1328x1328 is their recommended resolution, but you can shift it around to other resolutions in a range between 928 up to 1472.
    • Performance: Can be fast on Res=928x928 CFG=1 Steps=20, but standard params are very slow (one full minute for a standard res 20 step cfg 4 image on a 4090, compared to ~10 seconds for Flux on the same).
      • Requires >30 gigs of system RAM just to load at all in fp8. If you have limited sysram you're gonna have a bad time. Pagefile can help.
    • Prompts: TBD, but it seems very friendly to general prompts in both natural language and booru-tag styles. Official recommendations are very long LLM-ish prompts though.
    • Sigma Shift: Comfy defaults it to 1.15, but this ruins fine details, so Swarm defaults it to 3 instead. Many different values are potentially valid. Proper guidance on choices TBD.

Controlnets

  • There are three controlnet versions available for Qwen Image currently
    • Regular form
      • There's a regular controlnet-union available here InstantX/Qwen-Image-ControlNet-Union (be sure to rename the file when you save it)
      • works like any other controlnet. Select as controlnet model, give it an image, select a preprocessor. Probably lower the strength a bit.
      • Compatible with lightning loras.
      • If not using Lightning, probably raise your CFG a bit to ensure your prompt is stronger than the controlnet.
    • "Model Patch"
    • LoRA form
      • Download here Comfy-Org/Qwen-Image-DiffSynth-ControlNets: loras
      • Save to loras folder
      • Select the lora, use with a regular qwen image base model
      • Upload a prompt image of controlnet input (depth or canny)
        • You can create this from an existing image by using the Controlnet Parameter group, select the preprocessor (Canny, or MiDAS Depth), and hit "Preview"
      • You cannot use the controlnet parameters directly for actual generation due to the weird lora-hack this uses
  • Note that Qwen Image controlnets do not work the best on the Qwen Image Edit model.

Qwen Image Edit

  • The Qwen Image Edit model can be downloaded here: Comfy-Org/Qwen-Image-Edit_ComfyUI
    • qwen_image_edit_2511_fp8mixed recommended currently
    • Or GGUF version here: unsloth/Qwen-Image-Edit-2511-GGUF or old 2509 QuantStack/Qwen-Image-Edit-2509-GGUF (or old version QuantStack/Qwen-Image-Edit-GGUF)
    • Or nunchaku version here: nunchaku-qwen-image-edit-2509 (or old version nunchaku-qwen-image-edit)
    • For original Edit or v2509, the architecture cannot be autodetected and must be set manually. 2511 can autodetect.
      • Click the hamburger menu on a model, then Edit Metadata, then change Architecture to Qwen Image Edit Plus and hit Save
        • For the original model (prior to 2509), use Qwen Image Edit
    • Most params are broadly the same as regular Qwen Image
    • CFG must be 1, Edit is not compatible with higher CFGs normally (unless using an advanced alternate guidance option)
    • Sigma Shift: 3 or lower (as low as 0.5) is a valid range. Some users report that a value below 1 might be ideal for single-image inputs.
    • You can insert image(s) to the prompt box to have it edit that image
      • It will focus the first image, but you can get it to pull features from additional images (with limited quality)
      • Qwen Image Edit Plus works with up to 3 images well
      • Use phrasing like The person in Picture 1 to refer to the content of specific input images in the prompt
      • There are a few samples of how to prompt here https://www.alibabacloud.com/help/en/model-studio/qwen-image-edit-api
      • Smart Image Prompt Resizing parameter (top-left, under Image Prompting) will resize your input images automatically. Turn this off if you've carefully sized your images in advance.
        • Some versions of Qwen Edit require strict sizing to work well. 2511 reportedly works fine within a range of options.
    • There are a couple dedicated Qwen Image Edit Lightning Loras lightx2v/Qwen-Image-Edit-2511-Lightning or for older copies lightx2v/Qwen-Image-Lightning
      • Take care to separate the Edit lora vs the base Qwen Image lora.

Hunyuan Image 2.1

img

  • Hunyuan Image 2.1 is supported in SwarmUI.
    • The main model's official original download here: tencent/HunyuanImage-2.1, save to diffusion_models
    • There is also a distilled variant, you can download here: Comfy-Org/HunyuanImage_2.1_ComfyUI.
    • They also provide and recommend a Refiner model, you can download that here: hunyuanimage-refiner
      • FP8 download link pending
      • Or GGUF: QuantStack/HunyuanImage-2.1-Refiner-GGUF
      • This naturally is meant to be used via the Refine/Upscale parameter group in Swarm.
        • Set Refiner Control Percentage to 1, set Refiner Steps to 4, set Refiner CFG Scale to 1
        • You may also want to mess with the prompt, official recommend is some hacky LLM stuff: <|start_header_id|>system<|end_header_id|>Describe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background: <|eot_id|><|start_header_id|>user<|end_header_id|> Make the image high quality<|eot_id|>. You can use <base> my prompt here <refiner> that llm junk here in Swarm to automatically emit refiner-specific prompts.
      • This specific model is not required. In fact, it's pretty bad. It can be replaced with other models of other architectures - pick the model with details you like and refine with that instead.
      • Running the base model without a refiner works too, but fine detail quality is bad. You'll want to pick a refiner. (Possibly finetunes will fix the base in the future, as happened eg with SDXL Base years ago.)
    • CFG Scale: Normal CFG range, recommended around 3.5. The distilled model is capable of CFG=1. The refiner requires CFG=1.
    • Steps: Normal step values, around 20. Refiner prefers 4.
    • Resolution: Targets 2048x2048, can work at lower resolutions too.
      • The VAE is a 32x32 downscale (vs most image models use 8x8), so it's a much smaller latent image than other models would have at this scale.
      • 2048 on this model is the same latent size as 512 on other models.
    • Sigma Shift: Default is 5. Refine defaults to 4.
    • TBD: Info specific to Distilled variant usage (doesn't seem to work well with their documented settings, testing TBD or comfy fix), and dedicated Refiner model

Z-Image

img

(Steps=9, Z-Image Turbo)

  • Z-Image and Z-Image Turbo are supported in SwarmUI!
    • It is a 6B scaled model, with both a strong base and an official turbo designed to run extremely fast while competing at the top level of image models
    • "Edit" and "Omni" variants are still expected
  • The "Turbo" model was the first version officially released download here Z-Image-Turbo-FP8Mixed
  • The "Base" model was released around 2 months later, download here (Pending: good fp8mixed)
  • Uses the Flux.1 VAE, will be downloaded and handled automatically
    • You might prefer swapping to the UltraFlux VAE which gets better photorealism quality (be sure to rename the file when you save it, eg Flux/UltraFlux-vae.safetensors)
  • Parameters:
    • Prompt: Supports general prompting in any format just fine. Speaks English and Chinese deeply, understands other languages decently well too.
    • Sampler: Default is fine. Some users find Euler Ancestral can be better on photorealism detail. Comfy examples suggests Res MultiStep.
    • Scheduler: Default is fine. Some users find Beta can be very slightly better.
    • CFG Scale: For Turbo, 1, for base normal CFG ranges (eg 4 or 7)
    • Steps: For Turbo, small numbers are fine. 4 will work, 8 is better. For Base, 20+ steps as normal.
      • Original Turbo repo suggests 5/9, but this appears redundant in Swarm.
      • For particularly difficult prompts, raising Steps up to 20 on Turbo or 50 on Base may help get the full detail.
    • Resolution: Side length 1024 is the standard, but anywhere up to 2048 is good. 512 noticeably loses some quality, above 2048 corrupts the image.
    • Sigma Shift: Default is 3, raising to 6 can yield stronger coherence.
    • Here's a big ol' grid of Z-Image Turbo params: Z-Image MegaGrid

Z-Image Turbo Seed Variety Trick

  • There's a trick to get better seed variety in Z-Image Turbo: (This is less needed on Base)
    • Add an init image (Any image, doesn't matter much - the broad color bias of the image may be used, but that's about it).
    • Set Steps higher than normal (say 8 instead of 4)
    • Set Init Image Creativity to a relatively high value (eg 0.7)
    • Set Advanced Sampling -> Sigma Shift to a very high value like 22
    • Hit generate.
    • (This basically just screws up the model in a way it can recover from, but the recovery makes it take very different paths depending on seed)

Z-Image Controlnets

  • There's a "DiffSynth Model Patch" controlnet-union available here alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union-2.1
    • This goes in your regular ControlNets folder
      • Comfy treats this as separate "model_patches", to use Comfy folder format, add ;model_patches to the end of Server Config->Paths->SDControlNetsFolder
    • Proper Architecture ID is Z-Image ControlNet (DiffPatch)
    • Works like any other controlnet. Select as controlnet model, give it an image, select a preprocessor. Fiddle the strength to taste.
    • Despite being a Union controlnet, the Union Type parameter is not used.
    • Because it is "Model Patch" based, the Start and End parameters also do not work.

Kandinsky 5

Anima

  • Anima by Circlestone Labs is a 2B anime model build on Cosmos, and it is fully supported in SwarmUI.
    • It is designed to be tiny, lightweight, fast, but built on a strong architecture.
    • It is the first model architecture publicly released that was sponsored by Comfy Org!
    • It is explicitly still in Preview status, they will be training it further before it's entirely ready.
  • Download the preview version here
    • Save in diffusion_models
  • It uses a tiny Qwen 3 600M ("0.6B") text encoder. This will be autodownloaded.
  • It uses the Qwen Image VAE. This will be autodownloaded.
  • Parameters:
    • Prompt: Trained on both booru style tag prompts (1girl, etc) and natural language prompts. They have official specific writing guidance here
    • CFG Scale: Regular CFG scales (eg 4) work.
    • Steps: Regular 20+ steps.
    • Resolution: Side length 1024 recommend, but any lower value works too. Higher values do not work well. Refiner upscale needs tiling due to corruption at high res.
    • Sampler: Defaults to ER-SDE-Solver, but all common samplers work. They officially recommend also trying out Euler Ancestral or DPM++ 2M SDE
    • Scheduler: Default is fine (Simple), or you can experiment at will. The model is adaptable.

Video Models

  • Video models are documented in Video Model Support.
  • You can use some (not all) Text2Video models as Text2Image models.
    • Generally, just set Text2Video Frames to 1 and it will be treated as image gen.
    • Some models may favor different parameters (CFG, Steps, Shift, etc.) for images vs videos.

Alternative Model Formats

Bits-and-Bytes NF4 Format Models

  • BnB NF4 and FP4 format models, such as this copy of Flux Dev lllyasviel/flux1-dev-bnb-nf4, are partially supported in SwarmUI automatically.
    • The detection internally works by looking for bitsandbytes__nf4 or bitsandbytes__fp4 in the model's keys
    • The first time you try to load a BNB-NF4 or BNB-FP4 model, it will give you a popup asking to install support
    • You can accept this popup, and it will install and reload the backend
    • Then try to generate again, and it should work
  • Note that BnB-NF4 and BNB-FP4 models have multiple compatibility limitations, including even LoRAs don't apply properly.
    • If you want a quantized flux model, GGUF is recommended instead.
    • Support is barely tested, latest bnb doesn't work with comfy but old bnb is incompatible with other dependencies, good luck getting it to load.
      • Seriously, just use GGUF or something. bnb is not worth it.

GGUF Quantized Models

  • GGUF Quantized diffusion_models models are supported in SwarmUI automatically.
    • The detection is based on file extension.
    • They go in (Swarm)/Models/diffusion_models and work similar to other diffusion_models format models
      • Required VAE & TextEncoders will be autodownloaded if you do not already have them.
    • The first time you try to load a GGUF model, it will give you a popup asking to install support
      • This will autoinstall city96/ComfyUI-GGUF which is developed by city96.
      • You can accept this popup, and it will install and reload the backend
      • Then try to generate again, and it should just work

Nunchaku (MIT Han Lab)

  • MIT Han Lab's "Nunchaku" / 4-bit SVDQuant models are a unusual quant format that is supported in SwarmUI.
    • Nunchaku is a very dense quantization of models (eg 6GiB for Flux models) that runs very fast (4.4 seconds for a 20 step Flux Dev image on Windows RTX 4090, vs fp8 is ~11 seconds on the same)
    • It is optimized for modern nvidia GPUs, with different optimizations per gpu generation
      • RTX 30xx and 40xx cards need "int4" format nunchaku models
      • RTX 50xx or newer cards need "fp4" format nunchaku models
    • They go in (Swarm)/Models/diffusion_models and work similar to other diffusion_models format models
      • Make sure you download a "singlefile" nunchaku file, not a legacy "SVDQuant" folder
      • Required VAE & TextEncoders will be autodownloaded if you do not already have them.
    • For the older "SVDQuant" Folder Models mit-han-lab/svdquant, The detection is based on the folder structure, you need the files transformer_blocks.safetensors and comfy_config.json inside the folder. You cannot have unrelated files in the folder.
    • The first time you try to load a Nunchaku model, it will give you a popup asking to install support
      • This will autoinstall mit-han-lab/ComfyUI-nunchaku and its dependencies
      • You can accept this popup, and it will install and reload the backend
      • Then try to generate again, and it should just work
    • Nunchaku has various compatibility limitations due to hacks in the custom nodes. Not all lora, textenc, etc. features will work as intended.
      • It does not work on all python/torch/etc. versions, as they have deeply cursed dependency distribution
    • The Nunchaku Cache Threshold param is available to enable block-caching, which improves performance further at the cost of quality.

TensorRT

  • TensorRT support (.engine) is available for SDv1, SDv2-768-v, SDXL Base, SDXL Refiner, SVD, SD3-Medium
  • TensorRT is an nvidia-specific accelerator library that provides faster SD image generation at the cost of reduced flexibility. Generally this is best for heavy usages, especially for API/Bots/etc. and less useful for regular individual usage.
  • You can generate TensorRT engines from the model menu. This includes a button on-page to autoinstall TRT support your first time using it, and configuration of graph size limits and optimal scales. (TensorRT works fastest when you generate at the selected optimal resolution, and slightly less fast at any dynamic resolution outside the optimal setting.)
  • Note that TensorRT is not compatible with LoRAs, ControlNets, etc.
  • Note that you need to make a fresh TRT engine for any different model you want to use.

Obscure Model Redirection

Stable Diffusion v1 and v2

SegMind SSD-1B

Stable Cascade

PixArt Sigma

NVIDIA Sana

HiDream-i1

Cosmos Predict2

OmniGen 2

Ovis

These obscure/old/bad/unpopular/etc. models have been moved to Obscure Model Support