The 'CUDA out of memory' error in Stable Diffusion means your GPU's VRAM (video memory) is exhausted before generation completes. Unlike system RAM, VRAM is finite and can't be paged to disk without major slowdowns. This guide covers every VRAM optimization technique from basic (lower resolution) to advanced (xformers, VAE tiling, split-attention), ranked from fastest to try to most involved.
RuntimeError: CUDA out of memory. Tried to allocate X GiB (GPU 0; Y GiB total capacity; Z GiB already allocated)
torch.cuda.OutOfMemoryError: CUDA out of memory
Generation starts but crashes mid-denoising β partial image appears then error
AUTOMATIC1111 WebUI shows 'OutOfMemoryError' in the terminal and aborts generation
ComfyUI shows red node errors with 'CUDA out of memory' message in the execution log
System becomes unresponsive or monitor flickers before the error (GPU driver crash under extreme OOM)
Error occurs consistently at a specific step number in the denoising loop
Works at lower resolutions but consistently fails at target resolution
VRAM requirements scale roughly with the square of image resolution. Going from 512x512 to 1024x1024 requires ~4x the VRAM, not 2x. SD 1.5 base runs at 512x512 on 4GB VRAM; SDXL base needs 6-8GB at 1024x1024. Running SDXL at 1536x1536 requires 12-16GB or more. Resolution is the single biggest VRAM lever.
SDXL models (6.9GB+ checkpoint files) require significantly more VRAM than SD 1.5 models (2GB checkpoint files). A GPU that runs SD 1.5 smoothly may OOM immediately on SDXL. SDXL needs minimum 6GB VRAM (with optimizations); SD 1.5 runs on 4GB. Check your checkpoint file size β anything over 5GB is likely SDXL or a large fine-tuned merge.
float32 uses exactly double the VRAM of float16 for identical models and images. AUTOMATIC1111 and most UIs default to float16 on modern GPUs, but some configurations, older drivers, or specific model types force float32. This doubles your effective VRAM usage with no quality benefit in most scenarios.
Generating 4 images at once (batch size 4) requires roughly 4x the VRAM of a single image. Beginners often increase batch size for efficiency, not realizing it's multiplicative on VRAM rather than additive. For 8GB GPUs, batch size 1 is usually the only viable option at 512-768px resolutions.
Each add-on consumes additional VRAM: a LoRA adds ~100-500MB, ControlNet adds 1-3GB, and high-res fix runs generation twice (once at low res, once upscaled). Stacking multiple ControlNet models or running HR fix at 2x upscale on SDXL is very VRAM-intensive. On 8GB GPUs, SDXL + ControlNet + HR fix often causes OOM.
Other GPU-accelerated apps β web browsers with hardware acceleration, games, video players, OBS streaming β all consume VRAM. A browser with 20 tabs open can occupy 1-2GB of VRAM. This headroom disappears during Stable Diffusion generation and causes OOM errors that don't occur when the browser is closed.
After a CUDA OOM error, VRAM sometimes remains partially allocated β the GPU driver doesn't always clean up properly. Subsequent generations can fail immediately even though the theoretical VRAM would be enough, because fragmented allocations prevent contiguous blocks from forming. A full restart of the WebUI (not just canceling generation) clears this.
When to try: First fix β largest VRAM impact, no quality loss if you use proper upscaling afterward
Lower your width and height in the generation settings. SD 1.5 target: 512x512 or 512x768 (native training resolution). SDXL target: 1024x1024 (native resolution). If you need larger output, generate at native resolution and then upscale with a separate upscaler (Real-ESRGAN in AUTOMATIC1111 via Extras tab, or Ultimate SD Upscale script). Do not try to generate directly at 1920x1080 β generate at 512x768 and upscale 2-4x.
When to try: If batch size is greater than 1
In AUTOMATIC1111: Batch size field (separate from batch count). Set it to 1. Batch count (how many times to run) is fine at any value. Batch size > 1 multiplies VRAM use per generation. This single change resolves many OOM errors, especially on 4-8GB GPUs. You can generate multiple images by using batch count instead β it just takes longer rather than running in parallel.
When to try: For 4-8GB VRAM cards getting consistent OOM errors, especially at moderate resolutions
In AUTOMATIC1111: edit webui-user.bat (Windows) or webui-user.sh (Mac/Linux). Find the COMMANDLINE_ARGS line and add --medvram (for 4-6GB GPUs) or --lowvram (for GPUs under 4GB or extreme VRAM pressure). --medvram moves model components to system RAM when not in use, reducing peak VRAM by ~30-40% at the cost of 10-20% slower generation. --lowvram is more aggressive, reducing VRAM further but with a larger speed penalty. These flags are the standard fix for consistent OOM on lower-VRAM cards.
When to try: For NVIDIA GPU users before resorting to medvram/lowvram β better speed/VRAM tradeoff
xformers is a memory-efficient attention library that reduces VRAM usage by 10-40% depending on the model and resolution, often with a small speed boost. In AUTOMATIC1111: add --xformers to your COMMANDLINE_ARGS in webui-user.bat. Then restart. You should see 'xformers' mentioned in the startup log if it loaded correctly. To install xformers if not present: in the AUTOMATIC1111 folder, run 'pip install xformers'. Note: xformers is NVIDIA-only β AMD GPU users need a different approach (see ROCm alternatives).
When to try: If Hires. fix is enabled and OOM occurs during the high-res pass (typically steps 60-80% of the progress bar)
In AUTOMATIC1111's txt2img tab, the 'Hires. fix' checkbox doubles the effective VRAM required β it generates the image twice. If enabled, either disable it (generate at base resolution and use Extras β Upscale instead), or reduce the Hires fix upscale ratio from 2x to 1.5x to reduce peak VRAM usage. At 1.5x, most of the quality benefit is preserved while reducing the second-pass VRAM overhead significantly.
When to try: If OOM errors started recently or are inconsistent (sometimes works, sometimes doesn't)
Before generating: close your web browser (Chrome/Firefox use significant GPU memory with hardware acceleration), close any games, video players, or streaming software. On Windows: open Task Manager β Performance β GPU β check 'Dedicated GPU Memory' usage before starting generation. You want as close to 0 as possible. On Linux: nvidia-smi in terminal shows per-process VRAM usage. Browsers are the biggest surprise VRAM consumer β 1-2GB is common.
When to try: If the first generation after startup works but subsequent generations fail with OOM
After a CUDA OOM error, do a full restart of AUTOMATIC1111 or ComfyUI β not just canceling the generation. Close the Python process entirely and relaunch via webui.bat or webui.sh. This clears any fragmented VRAM allocations that remain after OOM events. In AUTOMATIC1111, you can also go to Settings β Actions β Unload model from VRAM, which frees the model without restarting (useful mid-session).
When to try: If you have --no-half or --precision full in your launch flags
For AUTOMATIC1111, add --precision full --no-half to force float32 check (and remove if present), or verify --precision half or --upcast-sampling are configured correctly. Most modern setups default to float16 automatically, but some configurations, especially on older GPUs (Pascal/Maxwell) or unusual model types, revert to float32. Check your COMMANDLINE_ARGS for these flags and remove any --no-half or --precision full settings if present, as they double VRAM usage.
When to try: If OOM occurs at the very end of generation (near 100% progress bar) rather than during denoising
In AUTOMATIC1111: Settings β VAE β Check 'Decode in full precision (not recommended)' OFF, then enable 'Tiled VAE'. The VAE decode step (converting latent space back to pixels) is VRAM-intensive at large sizes. Tiled VAE processes the image in tiles, dramatically reducing peak VRAM during decode at the cost of negligible quality loss. This allows generating larger images (1024x1024+ on SD 1.5) that would otherwise OOM during the final decode step.
Know your GPU's VRAM: SD 1.5 at 512px works on 4GB; SDXL at 1024px needs 6-8GB minimum with optimizations
Always close your browser and GPU-heavy apps before starting long generation queues
For NVIDIA users, install xformers once and leave it enabled β it's free VRAM savings
Use batch count (sequential) instead of batch size (parallel) on GPUs with 8GB or less
Generate at native model resolution (512px for SD 1.5, 1024px for SDXL) and upscale with Real-ESRGAN afterward
In ComfyUI, use the 'Free model and clip from VRAM' node after generation to release memory between runs
Stable Diffusion is open-source software β there's no official support desk. Instead, use these resources: AUTOMATIC1111 issues are tracked at github.com/AUTOMATIC1111/stable-diffusion-webui/issues (search before opening a new issue). ComfyUI issues: github.com/comfyanonymous/ComfyUI. Community support at reddit.com/r/StableDiffusion β describe your GPU (make, model, VRAM), OS, Stable Diffusion version, and launch flags. Post your full terminal error output β the 'Tried to allocate X GiB' line is critical context.
This error means your GPU's VRAM (video RAM) was full before Stable Diffusion could complete the operation. 'Tried to allocate X GiB' tells you the single allocation that failed β compare it to your GPU's total VRAM (shown in the error as 'Y GiB total capacity'). The 'already allocated' figure shows how much was in use before the failure. The gap between total and already-allocated is your available headroom. If the allocation attempt exceeds that headroom, you get OOM.
Minimum by model: SD 1.5 (4GB VRAM with --medvram, 6GB comfortable); SDXL (6GB with xformers and medvram, 8GB comfortable, 12GB for HR fix); SD 3 and Flux models (10-16GB for full quality). AMD GPUs need more VRAM than NVIDIA equivalents due to ROCm overhead. For ControlNet or multiple LoRAs, add 1-3GB to these estimates. The RTX 3060 (12GB) and RTX 4060 Ti (16GB) are currently the best value cards for SDXL generation without compromises.
No. --medvram reduces peak VRAM usage by moving model components between GPU and system RAM, not by reducing model precision or quality. The generated images are identical to those produced without the flag. The only cost is speed: generation takes 10-30% longer on --medvram and up to 50-70% longer on --lowvram. For anyone with under 8GB VRAM, --medvram is almost always the right tradeoff.
Not directly and not efficiently. GPU computation requires data in VRAM; system RAM is too slow for real-time inference. Some workarounds exist: --lowvram swaps model components to system RAM and moves them back per-step, which works but is very slow (minutes per image). GGUF-format models (emerging in 2025-2026) allow mixed GPU/CPU inference with much better performance. For CPU-only generation (no GPU), Stable Diffusion runs but is extremely slow: 1-5 minutes per 512px image on a modern CPU.
SDXL's checkpoint files are 6.9GB for the base model versus ~2GB for SD 1.5. More importantly, SDXL generates at native 1024x1024 (4x the pixel count of 512x512), which requires roughly 4x the VRAM for the same denoising steps. SDXL also uses a two-stage architecture (base + refiner), doubling peak memory if you run both. Minimum viable SDXL: 6GB VRAM with --medvram --xformers at exactly 1024x1024, batch size 1, no ControlNet.
4GB is the practical minimum for SD 1.5. Required setup: AUTOMATIC1111 with --medvram --xformers --opt-sdp-attention in COMMANDLINE_ARGS, resolution 512x512 only, batch size 1, no ControlNet, no HR fix. SD 1.5 fine-tuned models work fine. SDXL is not viable at 4GB even with all optimizations. For 2026, the free cloud alternatives (Google Colab free tier, Kaggle GPU notebooks) are often more practical than fighting VRAM limits on 4GB cards β they provide 15-40GB VRAM for free with usage caps.
ComfyUI OOM fixes: (1) Add --lowvram or --medvram to ComfyUI's command-line arguments in the run script. (2) Use the 'Load Checkpoint' node's 'weight_dtype' set to 'fp8_e4m3fn' if your ComfyUI version supports it (reduces model VRAM by ~50%). (3) Add a 'Free Model and Clip Memory' node after generation completes. (4) In ComfyUI Manager, install the ComfyUI-Impact-Pack which includes memory management nodes. (5) Reduce resolution in the 'Empty Latent Image' node. ComfyUI's modular graph means you can also identify which node causes OOM by watching which node turns red in the execution log.
Yes. For low-VRAM scenarios: SD-Turbo and SDXL-Turbo use fewer denoising steps (1-4 instead of 20-50) so peak VRAM is lower. LCM (Latent Consistency Models) and Lightning models also reduce step count significantly. GGUF-quantized versions of SD3 and Flux models (available on Civitai and HuggingFace as of 2025) run in significantly less VRAM than their full-precision counterparts. SD 1.5 remains the most VRAM-efficient option for general creative work. Verified April 2026.