Skip to content

add ernie image support#1427

Merged
leejet merged 3 commits intomasterfrom
ernie
Apr 16, 2026
Merged

add ernie image support#1427
leejet merged 3 commits intomasterfrom
ernie

Conversation

@leejet
Copy link
Copy Markdown
Owner

@leejet leejet commented Apr 15, 2026

.\bin\Release\sd-cli.exe --diffusion-model  ..\..\ComfyUI\models\diffusion_models\ernie-image-turbo.safetensors --vae ..\..\ComfyUI\models\vae\flux2_ae.safetensors  --llm ..\..\ComfyUI\models\text_encoders\ministral-3-3b.safetensors -p "a lovely cat holding a sign says 'ernie.cpp'" --cfg-scale 1.0 --steps 8 -v --offload-to-cpu --diffusion-fa
output

@leejet leejet mentioned this pull request Apr 15, 2026
@candrews
Copy link
Copy Markdown

candrews commented Apr 16, 2026

Can an ernie.md file please be added under docs that includes how to use ernie, ernie-turbo, and the prompt enhancer?

@Green-Sky
Copy link
Copy Markdown
Contributor

Tried some quants for turbo with flux2 vae smaller decoder #1402:

q6_K output
q5_K output
q5_0 output
q4_K output
q4_0 output
q3_K output

Quants work really well with this model. Must be the arch.

@Green-Sky
Copy link
Copy Markdown
Contributor

1280x1280 q4_k turbo small-vae

tomato_Image_2026-04-16_17-27-40 770
[INFO ] ggml_extend.hpp:1957 - ernie_image offload params (4357.36 MB, 409 tensors) to runtime backend (CUDA0), taking 0.38s
[DEBUG] ggml_extend.hpp:1859 - ernie_image compute buffer size: 1011.37 MB(VRAM)
  |==================================================| 8/8 - 8.23s/it
[INFO ] stable-diffusion.cpp:3195 - sampling completed, taking 66.26s

Looks really good.

@kuhnchris
Copy link
Copy Markdown

kuhnchris commented Apr 16, 2026

Interesting, using the text_encoder (the .safetensors) seem to fail to load, as they seem to hae all the data in a sub-node "language_model". Using a different Ministral-3B does work tho.

Failing:
VAE: https://huggingface.co/baidu/ERNIE-Image-Turbo/blob/main/vae/diffusion_pytorch_model.safetensors
LLM: https://huggingface.co/baidu/ERNIE-Image-Turbo/tree/main/text_encoder
Model: https://huggingface.co/unsloth/ERNIE-Image-Turbo-GGUF/blob/main/ernie-image-turbo-Q8_0.gguf

~/sd-cli-ernie --diffusion-model ernie-image-turbo-Q8_0.gguf --vae ERNIE-vae.safetensors --llm ERNIE-llm.safetensors -H 1024 -W 1024 --diffusion-fa --flow-shift 3 -p 'An playing card rave, cartoon/anime style, flashing disco lights' -o test.png
...
[INFO ] model.cpp:1617 - unknown tensor 'text_encoders.llm.language_model.model.embed_tokens.weight | bf16 | 2 [3072, 131072, 1, 1, 1]' in model file
[INFO ] model.cpp:1617 - unknown tensor 'text_encoders.llm.language_model.model.layers.0.input_layernorm.weight | bf16 | 1 [3072, 1, 1, 1, 1]' in model file
[INFO ] model.cpp:1617 - unknown tensor 'text_encoders.llm.language_model.model.layers.0.mlp.down_proj.weight | bf16 | 2 [9216, 3072, 1, 1, 1]' in model file
[INFO ] model.cpp:1617 - unknown tensor 'text_encoders.llm.language_model.model.layers.0.mlp.gate_proj.weight | bf16 | 2 [3072, 9216, 1, 1, 1]' in model file
...
[ERROR] model.cpp:1658 - tensor 'text_encoders.llm.model.embed_tokens.weight' not in model file
[ERROR] model.cpp:1658 - tensor 'text_encoders.llm.model.layers.0.input_layernorm.weight' not in model file
[ERROR] model.cpp:1658 - tensor 'text_encoders.llm.model.layers.0.mlp.down_proj.weight' not in model file
[ERROR] model.cpp:1658 - tensor 'text_encoders.llm.model.layers.0.mlp.gate_proj.weight' not in model file

Working:
VAE: https://huggingface.co/baidu/ERNIE-Image-Turbo/blob/main/vae/diffusion_pytorch_model.safetensors
LLM: https://huggingface.co/unsloth/Ministral-3-3B-Instruct-2512-GGUF?show_file_info=Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf
Model: https://huggingface.co/unsloth/ERNIE-Image-Turbo-GGUF/blob/main/ernie-image-turbo-Q8_0.gguf

~/sd-cli-ernie --diffusion-model ernie-image-turbo-Q8_0.gguf --vae ERNIE-vae.safetensors --llm Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf --diffusion-fa -p 'An playing card rave, cartoon/anime style' -o test.png --cfg-scale 1.0 --steps 8 -v
--
[INFO ] auto_encoder_kl.hpp:517  - vae decoder: ch = 128
[DEBUG] ggml_extend.hpp:2046 - vae params backend buffer size =  94.72 MB(VRAM) (140 tensors)
[INFO ] stable-diffusion.cpp:774  - Using flash attention in the diffusion model
[DEBUG] stable-diffusion.cpp:803  - loading weights
[DEBUG] model.cpp:1333 - using 48 threads for model loading
[DEBUG] model.cpp:1355 - loading tensors from ernie-image-turbo-Q8_0.gguf
  |======================>                           | 409/893 - 4.88GB/s
[DEBUG] model.cpp:1355 - loading tensors from Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf
  |====================================>             | 645/893 - 3.44GB/s
[DEBUG] model.cpp:1355 - loading tensors from ERNIE-vae.safetensors
  |==================================================| 893/893 - 3.43GB/s
[INFO ] model.cpp:1584 - loading tensors completed, taking 2.99s (process: 0.00s, read: 0.07s, memcpy: 0.00s, convert: 0.02s, copy_to_backend: 1.70s)
[DEBUG] stable-diffusion.cpp:843  - finished loaded file
[INFO ] stable-diffusion.cpp:895  - total params memory size = 11690.70MB (VRAM 11690.70MB, RAM 0.00MB): text_encoders 3303.90MB(VRAM), diffusion_model 8292.08MB(VRAM), vae 94.72MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:977  - running in FLOW mode
[INFO ] stable-diffusion.cpp:3130 - generate_image 512x512
[INFO ] denoiser.hpp:499  - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:2706 - sampling using Euler method
[DEBUG] conditioner.hpp:1695 - parse 'An playing card rave, cartoon/anime style' to [['', 1], ['An playing card rave, cartoon/anime style', 1], ['', 1], ]
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "" to tokens []
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "An playing card rave, cartoon/anime style" to tokens ["An", "Ġplaying", "Ġcard", "Ġra", "ve", ",", "Ġcartoon", "/an", "ime", "Ġstyle", ]
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "" to tokens []
[DEBUG] ggml_extend.hpp:1859 - ministral3.3b compute buffer size: 1.20 MB(VRAM)
[DEBUG] conditioner.hpp:1949 - computing condition graph completed, taking 101 ms
[INFO ] stable-diffusion.cpp:3060 - get_learned_condition completed, taking 0.10s
[INFO ] stable-diffusion.cpp:3164 - generating image: 1/1 - seed 42
[DEBUG] ggml_extend.hpp:1859 - ernie_image compute buffer size: 163.19 MB(VRAM)
  |==================================================| 8/8 - 3.08it/s
[INFO ] stable-diffusion.cpp:3195 - sampling completed, taking 2.65s
[INFO ] stable-diffusion.cpp:3213 - generating 1 latent images completed, taking 2.65s
[INFO ] stable-diffusion.cpp:3084 - decoding 1 latents
[DEBUG] ggml_extend.hpp:1859 - vae compute buffer size: 1664.50 MB(VRAM)
[DEBUG] vae.hpp:206  - computing vae decode graph completed, taking 0.13s
[INFO ] stable-diffusion.cpp:3100 - latent 1 decoded, taking 0.13s
[INFO ] stable-diffusion.cpp:3104 - decode_first_stage completed, taking 0.13s
[INFO ] stable-diffusion.cpp:3225 - generate_image completed in 2.95s
[INFO ] main.cpp:438  - save result image 0 to 'test.png' (success)
[INFO ] main.cpp:487  - 1/1 images saved

While this works, as soon as I provide any of the officially supported width/height parameters (-W 1024 -H 1024) i only get a white output...

Resolution:
1024x1024
848x1264
1264x848
768x1376
896x1200
1376x768
1200x896
 ~/sd-cli-ernie --diffusion-model ernie-image-turbo-Q8_0.gguf --vae ERNIE-vae.safetensors --llm Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf --diffusion-fa -p 'An playing card rave, cartoon/anime style' -o test.png --cfg-scale 1.0 --steps 8 -v -W 1024 -H 1024 --seed 32

[INFO ] stable-diffusion.cpp:267  - loading diffusion model from 'ernie-image-turbo-Q8_0.gguf'
[INFO ] model.cpp:331  - load ernie-image-turbo-Q8_0.gguf using gguf format
[DEBUG] model.cpp:377  - init from 'ernie-image-turbo-Q8_0.gguf'
[INFO ] stable-diffusion.cpp:314  - loading llm from 'Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf'
[INFO ] model.cpp:331  - load Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf using gguf format
[DEBUG] model.cpp:377  - init from 'Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf'
[INFO ] stable-diffusion.cpp:328  - loading vae from 'ERNIE-vae.safetensors'
[INFO ] model.cpp:334  - load ERNIE-vae.safetensors using safetensors format
[DEBUG] model.cpp:468  - init from 'ERNIE-vae.safetensors', prefix = 'vae.'
[INFO ] stable-diffusion.cpp:353  - Version: Ernie Image
[INFO ] stable-diffusion.cpp:381  - Weight type stat:                      f32: 203  |    q8_0: 253  |    q4_K: 100  |    q5_K: 30   |    q6_K: 33   |  iq4_xs: 20   |    bf16: 254
[INFO ] stable-diffusion.cpp:382  - Conditioner weight type stat:          f32: 53   |    q4_K: 100  |    q5_K: 30   |    q6_K: 33   |  iq4_xs: 20
[INFO ] stable-diffusion.cpp:383  - Diffusion model weight type stat:      f32: 150  |    q8_0: 253  |    bf16: 6
[INFO ] stable-diffusion.cpp:384  - VAE weight type stat:                 bf16: 248
[DEBUG] stable-diffusion.cpp:386  - ggml tensor size = 400 bytes
[DEBUG] mistral_tokenizer.cpp:23   - vocab size: 131072
[DEBUG] mistral_tokenizer.cpp:31   - merges size 269443
[DEBUG] llm.hpp:693  - llm: num_layers = 26, vocab_size = 131072, hidden_size = 3072, intermediate_size = 9216
[INFO ] ernie_image.hpp:376  - ernie_image: layers = 36, hidden_size = 4096, heads = 32, ffn_hidden_size = 12288, in_channels = 128, out_channels = 128
[DEBUG] ggml_extend.hpp:2046 - ministral3.3b params backend buffer size =  3303.90 MB(VRAM) (236 tensors)
[DEBUG] ggml_extend.hpp:2046 - ernie_image params backend buffer size =  8292.08 MB(VRAM) (409 tensors)
[INFO ] stable-diffusion.cpp:679  - using VAE for encoding / decoding
[INFO ] auto_encoder_kl.hpp:517  - vae decoder: ch = 128
[DEBUG] ggml_extend.hpp:2046 - vae params backend buffer size =  94.72 MB(VRAM) (140 tensors)
[INFO ] stable-diffusion.cpp:774  - Using flash attention in the diffusion model
[DEBUG] stable-diffusion.cpp:803  - loading weights
[DEBUG] model.cpp:1333 - using 48 threads for model loading
[DEBUG] model.cpp:1355 - loading tensors from ernie-image-turbo-Q8_0.gguf
  |======================>                           | 409/893 - 4.77GB/s
[DEBUG] model.cpp:1355 - loading tensors from Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf
  |====================================>             | 645/893 - 3.41GB/s
[DEBUG] model.cpp:1355 - loading tensors from ERNIE-vae.safetensors
  |==================================================| 893/893 - 3.39GB/s
[INFO ] model.cpp:1584 - loading tensors completed, taking 3.02s (process: 0.00s, read: 0.07s, memcpy: 0.00s, convert: 0.02s, copy_to_backend: 1.74s)
[DEBUG] stable-diffusion.cpp:843  - finished loaded file
[INFO ] stable-diffusion.cpp:895  - total params memory size = 11690.70MB (VRAM 11690.70MB, RAM 0.00MB): text_encoders 3303.90MB(VRAM), diffusion_model 8292.08MB(VRAM), vae 94.72MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:977  - running in FLOW mode
[INFO ] stable-diffusion.cpp:3130 - generate_image 1024x1024
[INFO ] denoiser.hpp:499  - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:2706 - sampling using Euler method
[DEBUG] conditioner.hpp:1695 - parse 'An playing card rave, cartoon/anime style' to [['', 1], ['An playing card rave, cartoon/anime style', 1], ['', 1], ]
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "" to tokens []
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "An playing card rave, cartoon/anime style" to tokens ["An", "Ġplaying", "Ġcard", "Ġra", "ve", ",", "Ġcartoon", "/an", "ime", "Ġstyle", ]
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "" to tokens []
[DEBUG] ggml_extend.hpp:1859 - ministral3.3b compute buffer size: 1.20 MB(VRAM)
[DEBUG] conditioner.hpp:1949 - computing condition graph completed, taking 110 ms
[INFO ] stable-diffusion.cpp:3060 - get_learned_condition completed, taking 0.11s
[INFO ] stable-diffusion.cpp:3164 - generating image: 1/1 - seed 32
[DEBUG] ggml_extend.hpp:1859 - ernie_image compute buffer size: 647.69 MB(VRAM)
  |==================================================| 8/8 - 1.23s/it
[INFO ] stable-diffusion.cpp:3195 - sampling completed, taking 10.05s
[INFO ] stable-diffusion.cpp:3213 - generating 1 latent images completed, taking 10.06s
[INFO ] stable-diffusion.cpp:3084 - decoding 1 latents
[DEBUG] ggml_extend.hpp:1859 - vae compute buffer size: 6658.00 MB(VRAM)
[DEBUG] vae.hpp:206  - computing vae decode graph completed, taking 0.49s
[INFO ] stable-diffusion.cpp:3100 - latent 1 decoded, taking 0.51s
[INFO ] stable-diffusion.cpp:3104 - decode_first_stage completed, taking 0.51s
[INFO ] stable-diffusion.cpp:3225 - generate_image completed in 10.83s
[INFO ] main.cpp:438  - save result image 0 to 'test.png' (success)
[INFO ] main.cpp:487  - 1/1 images saved

However, removing the parameter "-W" and "-H" it all works again.
Not sure if this is connected to using the unsloth GGUFs tho.
Without -W and -H the generation is also far faster.

[DEBUG] ggml_extend.hpp:1859 - ernie_image compute buffer size: 163.19 MB(VRAM)
  |==================================================| 8/8 - 3.08it/s

(the default output seems to be 512x512, so that would check out)

@GreenShadows
Copy link
Copy Markdown

It would probably be a little faster once SD.cpp syncs with GGML and incorporates the latest optimizations.
ggml-org/llama.cpp#21713

@leejet
Copy link
Copy Markdown
Owner Author

leejet commented Apr 16, 2026

Can an ernie.md file please be added under docs that includes how to use ernie, ernie-turbo, and the prompt enhancer?

Done — ernie_image.md has been added under docs with usage for ernie image and ernie image turbo.

The prompt enhancer isn’t built into sd.cpp; it’s just standard LLM-based prompt expansion and can be done via tools like llama.cpp or ChatGPT / Gemini.

@leejet
Copy link
Copy Markdown
Owner Author

leejet commented Apr 16, 2026

Interesting, using the text_encoder (the .safetensors) seem to fail to load, as they seem to hae all the data in a sub-node "language_model". Using a different Ministral-3B does work tho.

@kuhnchris This is just a naming convention issue. You can download the compatible .safetensors files here: https://huggingface.co/Comfy-Org/ERNIE-Image/tree/main/text_encoders

@leejet leejet merged commit 5c243db into master Apr 16, 2026
9 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants