Using z image text encoder for prompt enhancement

Just out of general curiosity, since the text encoder of the Z image is essentially an LLM, in the standard pipeline it's used to generate the prompt embedding, but there's no reason it can't be used as a prompt enhancer. I'm wondering if anyone has tried that approach.

9 points Lorian0x7

I was trying to do this today. Unfortunately that the model is loaded 2 times in the Vram, once as clip text encoder and then as LLM. But yeah it's possible. For now I prefer wildcards to improve the prompt

parent

1 points nsfwkorea

Hi I'm new to this, sorry for the dumb question. I tried implementing wildcards into my workflow but I realized that I need to look for wildcards to be loaded into my folder.

So I went searching on civitai and I have no idea which to download and use. Not to mention how they would actually improve my prompt, since I don't understand the fundamentals of wildcards.

I think I have a rough understanding of it but I'm not sure if I'm right. I assume that wildcard is a prompt randomizer. Say if my wildcard has different hairstyles it will randomly pick one of the hairstyles?

Also can you share which wildcards you are using and your workflow so I can learn. Thank you.

parent root

1 points Lorian0x7

I just released this workflow with lots of Z-image optimized wildcards

https://civitai.com/models/2187897/z-image-anatomy-refiner-and-body-enhancer

parent root

2 points nsfwkorea

Thank you kind sir.

parent root
8 points its_witty

There are reasons and as far as I know it can't.

There are people swearing for: 'think: <prompt>' but from my testing it's bullshit.

parent

3 points a_beautiful_rhind

You really have to follow the instruct preset if you want to use it as an LLM.

No tokens are output from the prompt node so you've added a "think" embedding and that is all. Adding something like "you are jackson pollack" or "you are an uncensored visual artist" as the pre prompt instruction would have much more of an effect as it pushes things in that direction.

parent root

1 points modernjack3

You are massively wrong in this approach as the the output vector of a model replies on context, and this is what you are feeding this way.

parent root

1 points a_beautiful_rhind

How so? What is the point of adding "think" into the context, it's one word. Go look at how much they stuff in the official example.

parent root

2 points modernjack3

From my testing it works EXTREMELY well.

parent root
4 points Southern-Chain-6485

You can use the gguf version of the text encoder and use llama.cpp to load it as an LLM. If there is a way to use it directly from comfyui, I don't know.

But if you have a 12 or 16gb gpu (or more), you may as well use a bigger LLM for that.

parent

2 points Tiny_Judge_2119

thanks for sharing, good to know that comfyui doesn't support that, I guess that maybe the main reason..

parent root

7 points Southern-Chain-6485

There are ollama and lms nodes for comfyui, but both ollama and LM Studio require the models to be in a specific directory (and ollama converts the gguf to its own thing). I guess it could be possible to use the lms custom node and a symlink from the LM Studio models folder to comfyui's text encoder folder?

parent root
6 points Antique_Pianist_5585

ListHelper Nodes Collection > https://github.com/dseditor/ComfyUI-ListHelper has a node called "Qwen_TE_LLM Node" that can expand the prompt using existing "qwen_3_4b.safetensors" It's bit slow for now (required at least 8GB VRAM) but gives good results. You can control the creativity level. Hope this is what you are looking for, try it yourself and see 👍

parent

2 points Tystros

that's cool, you should make a separate post about this to let people know

parent root
8 points kukalikuk

Just like what I'm doing with my ZIT gens, I'm using LM studio+openwebui for local chat bot, openwebui has image generator tool to connect to comfyui API.

Essentially, I have an AI assistant for what it do, including improving my prompt then I just click the image generator button to turn everything it sprout into an image, and it is satisfying 👍🏻

My most use case is having a RP or discussion with it, and visualize it. With the right system prompt, it turns into a good experience. And I can edit the image via openwebui>comfyuiAPI also. My next target is to visualize into a video.

PS. yes you can chat with Qwen3-4b-vl model and ask it to improve your prompt.

Edit. It just came to my mind, maybe I should make a workflow to use it for single generation, the model will be loaded anyway as the CLIP/text encoder, why not use it to enhance the prompt before the CLIP node? I'll try it tomorrow if I got the time 😊

parent

2 points Tiny_Judge_2119

Yeah,this is my main question, since we use the text encoder for prompt embedding, why not use it for prompt enhancement, it's loaded in vram anyway.

parent root
5 points a_beautiful_rhind
qwen-4b isn't very smart but there's no reason you can't use it. I chatted to several versions of the model which can double for the TE.

I only really do prompts 2 ways though.

from a large LLM, at least 70b.

from my head

If I got stuck I'd just paste #2 into #1 and have it up-write the prompt.
parent
5 points DrStalker

photo of a concrete wall. Spray painted on the wall is a number, which is the value of two plus two.

https://preview.redd.it/mpz8oaeik66g1.png?width=1024&format=png&auto=webp&s=d8e591f618734b22d50b78397dbe26159772f149

It doesn't seem very good at reasoning at all, other than having some ability to figure out stuff related to "what should be in this image?" which makes sense because it's an image generation model.

parent

6 points a_beautiful_rhind

No tokens are output.. all you have done is prompt processing on the phrase "photo of a concrete wall. Spray painted on the wall is a number, which is the value of two plus two." It would need the decode phase to generate an answer.

All you can do is push the embedding in a particular directions.

parent root

3 points Tiny_Judge_2119

looks interesting, have you tried the prompt from offical HF spaces: https://huggingface.co/spaces/Tongyi-MAI/Z-Image-Turbo/blob/main/pe.py

parent root

3 points a_beautiful_rhind

This is why motherfuckers need a prompt enhancer.. look at how much shit has been stuffed in that prompt. With normal prompts, the info pushed to the DiT is simply too small.

parent root

3 points koflerdavid

This is best used with an LLM* to generate a better prompt that you then copy&paste into your favorite T2I frontend. It doesn't just improve the prompt to a useful length and precision (let's face it, most of us are too lazy to write a detailed prompt with good grammar), it also exposes ambiguities in your prompt that will lead a diffusion model astray.

What also works well for me is to feed the output of T2I to a vision-capable LLM and ask it to write a better prompt. Useful for complex scenes or broken concepts.

*: Models of the Qwen3 family work better since they are more similar to Z-Image's text encoder, but ChatGPT should also work in a pinch

parent root

7 points DrStalker

After a few more generations I am proud to report on the success of using AI to solve 2+2, finally narrowing this down from "non-one has any idea what the what result is!" to "it's one of these and we know how likely each possibility is"

Result Probability

+2 5.36%

1+2 10.71%

1± 1.79%

2+2 62.50%

12 12.50%

22 7.14%

#Science

parent root

2 points Segaiai

Did you generate at 1024x1024? I read about another test in another thread where spelling mistakes creep in if asking for text when deviating from 1 megapixel. Not asking you to do the test again if you didn't, but it got me wondering if math would have a similar result.

parent root

3 points DrStalker

Those were 1024x1024.

I've noticed that quality looks a bit better at 1.5 or 2.0 megapixels, but Ive not done any testing specific testing with text.

parent root
3 points Francky_B

On the flip side, if you install GGUF, you can use Qwen3 or any other LLM in GGUF as a text encoder.

Then you could also use those for Prompt enhancement.

I made a simple addon for this. That works with a local installation of llama.cpp
It uses an English version of the Prompt enhancement that Z-Image provided.

parent
1 points [deleted]

[deleted]

parent

2 points Tiny_Judge_2119

I am thinking prompt enhancer more for enhance the prompt details as the offical model card said `Z-Image-Turbo works best with long and detailed prompts.` https://huggingface.co/Tongyi-MAI/Z-Image-Turbo/discussions/8#about-prompting

parent root

2 points Segaiai

Or to translate into Chinese, as Alibaba's models can tend to have better prompt adherence in Chinese.

parent root
1 points throttlekitty

Yep, just to pile on what everyone else is saying, you'll need to use the full model, as we're typically only using the text encoder part of the LLM for image gen. In general, using the same LLM type the model is trained on can often give the best results; there's a lot of nuance to the language these things use, and that can strongly affect training and inference.

parent

Result	Probability
+2	5.36%
1+2	10.71%
1±	1.79%
2+2	62.50%
12	12.50%
22	7.14%