add ernie image support by leejet · Pull Request #1427 · leejet/stable-diffusion.cpp

leejet · 2026-04-15T18:52:26Z

.\bin\Release\sd-cli.exe --diffusion-model  ..\..\ComfyUI\models\diffusion_models\ernie-image-turbo.safetensors --vae ..\..\ComfyUI\models\vae\flux2_ae.safetensors  --llm ..\..\ComfyUI\models\text_encoders\ministral-3-3b.safetensors -p "a lovely cat holding a sign says 'ernie.cpp'" --cfg-scale 1.0 --steps 8 -v --offload-to-cpu --diffusion-fa

candrews · 2026-04-16T13:56:40Z

Can an ernie.md file please be added under docs that includes how to use ernie, ernie-turbo, and the prompt enhancer?

Green-Sky · 2026-04-16T15:10:40Z

Tried some quants for turbo with flux2 vae smaller decoder #1402:


q6_K
q5_K
q5_0
q4_K
q4_0
q3_K

Quants work really well with this model. Must be the arch.

Green-Sky · 2026-04-16T15:29:42Z

1280x1280 q4_k turbo small-vae

[INFO ] ggml_extend.hpp:1957 - ernie_image offload params (4357.36 MB, 409 tensors) to runtime backend (CUDA0), taking 0.38s
[DEBUG] ggml_extend.hpp:1859 - ernie_image compute buffer size: 1011.37 MB(VRAM)
  |==================================================| 8/8 - 8.23s/it
[INFO ] stable-diffusion.cpp:3195 - sampling completed, taking 66.26s

Looks really good.

kuhnchris · 2026-04-16T15:37:42Z

Interesting, using the text_encoder (the .safetensors) seem to fail to load, as they seem to hae all the data in a sub-node "language_model". Using a different Ministral-3B does work tho.

Failing:
VAE: https://huggingface.co/baidu/ERNIE-Image-Turbo/blob/main/vae/diffusion_pytorch_model.safetensors
LLM: https://huggingface.co/baidu/ERNIE-Image-Turbo/tree/main/text_encoder
Model: https://huggingface.co/unsloth/ERNIE-Image-Turbo-GGUF/blob/main/ernie-image-turbo-Q8_0.gguf

~/sd-cli-ernie --diffusion-model ernie-image-turbo-Q8_0.gguf --vae ERNIE-vae.safetensors --llm ERNIE-llm.safetensors -H 1024 -W 1024 --diffusion-fa --flow-shift 3 -p 'An playing card rave, cartoon/anime style, flashing disco lights' -o test.png
...
[INFO ] model.cpp:1617 - unknown tensor 'text_encoders.llm.language_model.model.embed_tokens.weight | bf16 | 2 [3072, 131072, 1, 1, 1]' in model file
[INFO ] model.cpp:1617 - unknown tensor 'text_encoders.llm.language_model.model.layers.0.input_layernorm.weight | bf16 | 1 [3072, 1, 1, 1, 1]' in model file
[INFO ] model.cpp:1617 - unknown tensor 'text_encoders.llm.language_model.model.layers.0.mlp.down_proj.weight | bf16 | 2 [9216, 3072, 1, 1, 1]' in model file
[INFO ] model.cpp:1617 - unknown tensor 'text_encoders.llm.language_model.model.layers.0.mlp.gate_proj.weight | bf16 | 2 [3072, 9216, 1, 1, 1]' in model file
...
[ERROR] model.cpp:1658 - tensor 'text_encoders.llm.model.embed_tokens.weight' not in model file
[ERROR] model.cpp:1658 - tensor 'text_encoders.llm.model.layers.0.input_layernorm.weight' not in model file
[ERROR] model.cpp:1658 - tensor 'text_encoders.llm.model.layers.0.mlp.down_proj.weight' not in model file
[ERROR] model.cpp:1658 - tensor 'text_encoders.llm.model.layers.0.mlp.gate_proj.weight' not in model file

Working:
VAE: https://huggingface.co/baidu/ERNIE-Image-Turbo/blob/main/vae/diffusion_pytorch_model.safetensors
LLM: https://huggingface.co/unsloth/Ministral-3-3B-Instruct-2512-GGUF?show_file_info=Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf
Model: https://huggingface.co/unsloth/ERNIE-Image-Turbo-GGUF/blob/main/ernie-image-turbo-Q8_0.gguf

~/sd-cli-ernie --diffusion-model ernie-image-turbo-Q8_0.gguf --vae ERNIE-vae.safetensors --llm Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf --diffusion-fa -p 'An playing card rave, cartoon/anime style' -o test.png --cfg-scale 1.0 --steps 8 -v
--
[INFO ] auto_encoder_kl.hpp:517  - vae decoder: ch = 128
[DEBUG] ggml_extend.hpp:2046 - vae params backend buffer size =  94.72 MB(VRAM) (140 tensors)
[INFO ] stable-diffusion.cpp:774  - Using flash attention in the diffusion model
[DEBUG] stable-diffusion.cpp:803  - loading weights
[DEBUG] model.cpp:1333 - using 48 threads for model loading
[DEBUG] model.cpp:1355 - loading tensors from ernie-image-turbo-Q8_0.gguf
  |======================>                           | 409/893 - 4.88GB/s
[DEBUG] model.cpp:1355 - loading tensors from Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf
  |====================================>             | 645/893 - 3.44GB/s
[DEBUG] model.cpp:1355 - loading tensors from ERNIE-vae.safetensors
  |==================================================| 893/893 - 3.43GB/s
[INFO ] model.cpp:1584 - loading tensors completed, taking 2.99s (process: 0.00s, read: 0.07s, memcpy: 0.00s, convert: 0.02s, copy_to_backend: 1.70s)
[DEBUG] stable-diffusion.cpp:843  - finished loaded file
[INFO ] stable-diffusion.cpp:895  - total params memory size = 11690.70MB (VRAM 11690.70MB, RAM 0.00MB): text_encoders 3303.90MB(VRAM), diffusion_model 8292.08MB(VRAM), vae 94.72MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:977  - running in FLOW mode
[INFO ] stable-diffusion.cpp:3130 - generate_image 512x512
[INFO ] denoiser.hpp:499  - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:2706 - sampling using Euler method
[DEBUG] conditioner.hpp:1695 - parse 'An playing card rave, cartoon/anime style' to [['', 1], ['An playing card rave, cartoon/anime style', 1], ['', 1], ]
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "" to tokens []
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "An playing card rave, cartoon/anime style" to tokens ["An", "Ġplaying", "Ġcard", "Ġra", "ve", ",", "Ġcartoon", "/an", "ime", "Ġstyle", ]
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "" to tokens []
[DEBUG] ggml_extend.hpp:1859 - ministral3.3b compute buffer size: 1.20 MB(VRAM)
[DEBUG] conditioner.hpp:1949 - computing condition graph completed, taking 101 ms
[INFO ] stable-diffusion.cpp:3060 - get_learned_condition completed, taking 0.10s
[INFO ] stable-diffusion.cpp:3164 - generating image: 1/1 - seed 42
[DEBUG] ggml_extend.hpp:1859 - ernie_image compute buffer size: 163.19 MB(VRAM)
  |==================================================| 8/8 - 3.08it/s
[INFO ] stable-diffusion.cpp:3195 - sampling completed, taking 2.65s
[INFO ] stable-diffusion.cpp:3213 - generating 1 latent images completed, taking 2.65s
[INFO ] stable-diffusion.cpp:3084 - decoding 1 latents
[DEBUG] ggml_extend.hpp:1859 - vae compute buffer size: 1664.50 MB(VRAM)
[DEBUG] vae.hpp:206  - computing vae decode graph completed, taking 0.13s
[INFO ] stable-diffusion.cpp:3100 - latent 1 decoded, taking 0.13s
[INFO ] stable-diffusion.cpp:3104 - decode_first_stage completed, taking 0.13s
[INFO ] stable-diffusion.cpp:3225 - generate_image completed in 2.95s
[INFO ] main.cpp:438  - save result image 0 to 'test.png' (success)
[INFO ] main.cpp:487  - 1/1 images saved

While this works, as soon as I provide any of the officially supported width/height parameters (-W 1024 -H 1024) i only get a white output...

Resolution:
1024x1024
848x1264
1264x848
768x1376
896x1200
1376x768
1200x896

 ~/sd-cli-ernie --diffusion-model ernie-image-turbo-Q8_0.gguf --vae ERNIE-vae.safetensors --llm Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf --diffusion-fa -p 'An playing card rave, cartoon/anime style' -o test.png --cfg-scale 1.0 --steps 8 -v -W 1024 -H 1024 --seed 32

[INFO ] stable-diffusion.cpp:267  - loading diffusion model from 'ernie-image-turbo-Q8_0.gguf'
[INFO ] model.cpp:331  - load ernie-image-turbo-Q8_0.gguf using gguf format
[DEBUG] model.cpp:377  - init from 'ernie-image-turbo-Q8_0.gguf'
[INFO ] stable-diffusion.cpp:314  - loading llm from 'Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf'
[INFO ] model.cpp:331  - load Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf using gguf format
[DEBUG] model.cpp:377  - init from 'Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf'
[INFO ] stable-diffusion.cpp:328  - loading vae from 'ERNIE-vae.safetensors'
[INFO ] model.cpp:334  - load ERNIE-vae.safetensors using safetensors format
[DEBUG] model.cpp:468  - init from 'ERNIE-vae.safetensors', prefix = 'vae.'
[INFO ] stable-diffusion.cpp:353  - Version: Ernie Image
[INFO ] stable-diffusion.cpp:381  - Weight type stat:                      f32: 203  |    q8_0: 253  |    q4_K: 100  |    q5_K: 30   |    q6_K: 33   |  iq4_xs: 20   |    bf16: 254
[INFO ] stable-diffusion.cpp:382  - Conditioner weight type stat:          f32: 53   |    q4_K: 100  |    q5_K: 30   |    q6_K: 33   |  iq4_xs: 20
[INFO ] stable-diffusion.cpp:383  - Diffusion model weight type stat:      f32: 150  |    q8_0: 253  |    bf16: 6
[INFO ] stable-diffusion.cpp:384  - VAE weight type stat:                 bf16: 248
[DEBUG] stable-diffusion.cpp:386  - ggml tensor size = 400 bytes
[DEBUG] mistral_tokenizer.cpp:23   - vocab size: 131072
[DEBUG] mistral_tokenizer.cpp:31   - merges size 269443
[DEBUG] llm.hpp:693  - llm: num_layers = 26, vocab_size = 131072, hidden_size = 3072, intermediate_size = 9216
[INFO ] ernie_image.hpp:376  - ernie_image: layers = 36, hidden_size = 4096, heads = 32, ffn_hidden_size = 12288, in_channels = 128, out_channels = 128
[DEBUG] ggml_extend.hpp:2046 - ministral3.3b params backend buffer size =  3303.90 MB(VRAM) (236 tensors)
[DEBUG] ggml_extend.hpp:2046 - ernie_image params backend buffer size =  8292.08 MB(VRAM) (409 tensors)
[INFO ] stable-diffusion.cpp:679  - using VAE for encoding / decoding
[INFO ] auto_encoder_kl.hpp:517  - vae decoder: ch = 128
[DEBUG] ggml_extend.hpp:2046 - vae params backend buffer size =  94.72 MB(VRAM) (140 tensors)
[INFO ] stable-diffusion.cpp:774  - Using flash attention in the diffusion model
[DEBUG] stable-diffusion.cpp:803  - loading weights
[DEBUG] model.cpp:1333 - using 48 threads for model loading
[DEBUG] model.cpp:1355 - loading tensors from ernie-image-turbo-Q8_0.gguf
  |======================>                           | 409/893 - 4.77GB/s
[DEBUG] model.cpp:1355 - loading tensors from Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf
  |====================================>             | 645/893 - 3.41GB/s
[DEBUG] model.cpp:1355 - loading tensors from ERNIE-vae.safetensors
  |==================================================| 893/893 - 3.39GB/s
[INFO ] model.cpp:1584 - loading tensors completed, taking 3.02s (process: 0.00s, read: 0.07s, memcpy: 0.00s, convert: 0.02s, copy_to_backend: 1.74s)
[DEBUG] stable-diffusion.cpp:843  - finished loaded file
[INFO ] stable-diffusion.cpp:895  - total params memory size = 11690.70MB (VRAM 11690.70MB, RAM 0.00MB): text_encoders 3303.90MB(VRAM), diffusion_model 8292.08MB(VRAM), vae 94.72MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:977  - running in FLOW mode
[INFO ] stable-diffusion.cpp:3130 - generate_image 1024x1024
[INFO ] denoiser.hpp:499  - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:2706 - sampling using Euler method
[DEBUG] conditioner.hpp:1695 - parse 'An playing card rave, cartoon/anime style' to [['', 1], ['An playing card rave, cartoon/anime style', 1], ['', 1], ]
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "" to tokens []
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "An playing card rave, cartoon/anime style" to tokens ["An", "Ġplaying", "Ġcard", "Ġra", "ve", ",", "Ġcartoon", "/an", "ime", "Ġstyle", ]
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "" to tokens []
[DEBUG] ggml_extend.hpp:1859 - ministral3.3b compute buffer size: 1.20 MB(VRAM)
[DEBUG] conditioner.hpp:1949 - computing condition graph completed, taking 110 ms
[INFO ] stable-diffusion.cpp:3060 - get_learned_condition completed, taking 0.11s
[INFO ] stable-diffusion.cpp:3164 - generating image: 1/1 - seed 32
[DEBUG] ggml_extend.hpp:1859 - ernie_image compute buffer size: 647.69 MB(VRAM)
  |==================================================| 8/8 - 1.23s/it
[INFO ] stable-diffusion.cpp:3195 - sampling completed, taking 10.05s
[INFO ] stable-diffusion.cpp:3213 - generating 1 latent images completed, taking 10.06s
[INFO ] stable-diffusion.cpp:3084 - decoding 1 latents
[DEBUG] ggml_extend.hpp:1859 - vae compute buffer size: 6658.00 MB(VRAM)
[DEBUG] vae.hpp:206  - computing vae decode graph completed, taking 0.49s
[INFO ] stable-diffusion.cpp:3100 - latent 1 decoded, taking 0.51s
[INFO ] stable-diffusion.cpp:3104 - decode_first_stage completed, taking 0.51s
[INFO ] stable-diffusion.cpp:3225 - generate_image completed in 10.83s
[INFO ] main.cpp:438  - save result image 0 to 'test.png' (success)
[INFO ] main.cpp:487  - 1/1 images saved

However, removing the parameter "-W" and "-H" it all works again.
Not sure if this is connected to using the unsloth GGUFs tho.
Without -W and -H the generation is also far faster.

[DEBUG] ggml_extend.hpp:1859 - ernie_image compute buffer size: 163.19 MB(VRAM)
  |==================================================| 8/8 - 3.08it/s

(the default output seems to be 512x512, so that would check out)

GreenShadows · 2026-04-16T15:38:32Z

It would probably be a little faster once SD.cpp syncs with GGML and incorporates the latest optimizations.
ggml-org/llama.cpp#21713

leejet · 2026-04-16T16:47:32Z

Can an ernie.md file please be added under docs that includes how to use ernie, ernie-turbo, and the prompt enhancer?

Done — ernie_image.md has been added under docs with usage for ernie image and ernie image turbo.

The prompt enhancer isn’t built into sd.cpp; it’s just standard LLM-based prompt expansion and can be done via tools like llama.cpp or ChatGPT / Gemini.

leejet · 2026-04-16T16:50:05Z

Interesting, using the text_encoder (the .safetensors) seem to fail to load, as they seem to hae all the data in a sub-node "language_model". Using a different Ministral-3B does work tho.

@kuhnchris This is just a naming convention issue. You can download the compatible .safetensors files here: https://huggingface.co/Comfy-Org/ERNIE-Image/tree/main/text_encoders

leejet added 2 commits April 16, 2026 02:50

add ernie image support

2815988

reuse Qwen::TimestepEmbedding

5d9d266

leejet mentioned this pull request Apr 15, 2026

[Feature] ERNIE-Image #1419

Open

add docs

6729944

leejet merged commit 5c243db into master Apr 16, 2026
9 of 15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add ernie image support#1427

add ernie image support#1427
leejet merged 3 commits intomasterfrom
ernie

leejet commented Apr 15, 2026

Uh oh!

candrews commented Apr 16, 2026 •

edited

Loading

Uh oh!

Green-Sky commented Apr 16, 2026

Uh oh!

Green-Sky commented Apr 16, 2026

Uh oh!

kuhnchris commented Apr 16, 2026 •

edited

Loading

Uh oh!

GreenShadows commented Apr 16, 2026

Uh oh!

leejet commented Apr 16, 2026

Uh oh!

leejet commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

leejet commented Apr 15, 2026

Uh oh!

candrews commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented Apr 16, 2026

Uh oh!

Green-Sky commented Apr 16, 2026

1280x1280 q4_k turbo small-vae

Uh oh!

kuhnchris commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GreenShadows commented Apr 16, 2026

Uh oh!

leejet commented Apr 16, 2026

Uh oh!

leejet commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

candrews commented Apr 16, 2026 •

edited

Loading

kuhnchris commented Apr 16, 2026 •

edited

Loading