Just out of general curiosity, since the text encoder of the Z image is essentially an LLM, in the standard pipeline it's used to generate the prompt embedding, but there's no reason it can't be used as a prompt enhancer. I'm wondering if anyone has tried that approach.

  • I was trying to do this today. Unfortunately that the model is loaded 2 times in the Vram, once as clip text encoder and then as LLM. But yeah it's possible. For now I prefer wildcards to improve the prompt

    Hi I'm new to this, sorry for the dumb question. I tried implementing wildcards into my workflow but I realized that I need to look for wildcards to be loaded into my folder.

    So I went searching on civitai and I have no idea which to download and use. Not to mention how they would actually improve my prompt, since I don't understand the fundamentals of wildcards.

    I think I have a rough understanding of it but I'm not sure if I'm right. I assume that wildcard is a prompt randomizer. Say if my wildcard has different hairstyles it will randomly pick one of the hairstyles?

    Also can you share which wildcards you are using and your workflow so I can learn. Thank you.

    Thank you kind sir.

  • There are reasons and as far as I know it can't.

    There are people swearing for: 'think: <prompt>' but from my testing it's bullshit.

    You really have to follow the instruct preset if you want to use it as an LLM.

    No tokens are output from the prompt node so you've added a "think" embedding and that is all. Adding something like "you are jackson pollack" or "you are an uncensored visual artist" as the pre prompt instruction would have much more of an effect as it pushes things in that direction.

    You are massively wrong in this approach as the the output vector of a model replies on context, and this is what you are feeding this way.

    How so? What is the point of adding "think" into the context, it's one word. Go look at how much they stuff in the official example.

    From my testing it works EXTREMELY well.

  • You can use the gguf version of the text encoder and use llama.cpp to load it as an LLM. If there is a way to use it directly from comfyui, I don't know.

    But if you have a 12 or 16gb gpu (or more), you may as well use a bigger LLM for that.

    thanks for sharing, good to know that comfyui doesn't support that, I guess that maybe the main reason..

    There are ollama and lms nodes for comfyui, but both ollama and LM Studio require the models to be in a specific directory (and ollama converts the gguf to its own thing). I guess it could be possible to use the lms custom node and a symlink from the LM Studio models folder to comfyui's text encoder folder?

  • ListHelper Nodes Collection > https://github.com/dseditor/ComfyUI-ListHelper has a node called "Qwen_TE_LLM Node" that can expand the prompt using existing "qwen_3_4b.safetensors" It's bit slow for now (required at least 8GB VRAM) but gives good results. You can control the creativity level. Hope this is what you are looking for, try it yourself and see 👍

    that's cool, you should make a separate post about this to let people know

  • Just like what I'm doing with my ZIT gens, I'm using LM studio+openwebui for local chat bot, openwebui has image generator tool to connect to comfyui API.

    Essentially, I have an AI assistant for what it do, including improving my prompt then I just click the image generator button to turn everything it sprout into an image, and it is satisfying 👍🏻

    My most use case is having a RP or discussion with it, and visualize it. With the right system prompt, it turns into a good experience. And I can edit the image via openwebui>comfyuiAPI also. My next target is to visualize into a video.

    PS. yes you can chat with Qwen3-4b-vl model and ask it to improve your prompt.

    Edit. It just came to my mind, maybe I should make a workflow to use it for single generation, the model will be loaded anyway as the CLIP/text encoder, why not use it to enhance the prompt before the CLIP node? I'll try it tomorrow if I got the time 😊

    Yeah,this is my main question, since we use the text encoder for prompt embedding, why not use it for prompt enhancement, it's loaded in vram anyway.

  • qwen-4b isn't very smart but there's no reason you can't use it. I chatted to several versions of the model which can double for the TE.

    I only really do prompts 2 ways though.

    1. from a large LLM, at least 70b.
    2. from my head

    If I got stuck I'd just paste #2 into #1 and have it up-write the prompt.

  • No tokens are output.. all you have done is prompt processing on the phrase "photo of a concrete wall. Spray painted on the wall is a number, which is the value of two plus two." It would need the decode phase to generate an answer.

    All you can do is push the embedding in a particular directions.

    This is why motherfuckers need a prompt enhancer.. look at how much shit has been stuffed in that prompt. With normal prompts, the info pushed to the DiT is simply too small.

    This is best used with an LLM* to generate a better prompt that you then copy&paste into your favorite T2I frontend. It doesn't just improve the prompt to a useful length and precision (let's face it, most of us are too lazy to write a detailed prompt with good grammar), it also exposes ambiguities in your prompt that will lead a diffusion model astray.

    What also works well for me is to feed the output of T2I to a vision-capable LLM and ask it to write a better prompt. Useful for complex scenes or broken concepts.

    *: Models of the Qwen3 family work better since they are more similar to Z-Image's text encoder, but ChatGPT should also work in a pinch

    After a few more generations I am proud to report on the success of using AI to solve 2+2, finally narrowing this down from "non-one has any idea what the what result is!" to "it's one of these and we know how likely each possibility is"

    Result Probability
    +2 5.36%
    1+2 10.71%
    1.79%
    2+2 62.50%
    12 12.50%
    22 7.14%

    #Science

    Did you generate at 1024x1024? I read about another test in another thread where spelling mistakes creep in if asking for text when deviating from 1 megapixel. Not asking you to do the test again if you didn't, but it got me wondering if math would have a similar result.

    Those were 1024x1024.

    I've noticed that quality looks a bit better at 1.5 or 2.0 megapixels, but Ive not done any testing specific testing with text.

  • On the flip side, if you install GGUF, you can use Qwen3 or any other LLM in GGUF as a text encoder.

    Then you could also use those for Prompt enhancement.

    I made a simple addon for this. That works with a local installation of llama.cpp
    It uses an English version of the Prompt enhancement that Z-Image provided.

  • [deleted]

    Or to translate into Chinese, as Alibaba's models can tend to have better prompt adherence in Chinese.

  • Yep, just to pile on what everyone else is saying, you'll need to use the full model, as we're typically only using the text encoder part of the LLM for image gen. In general, using the same LLM type the model is trained on can often give the best results; there's a lot of nuance to the language these things use, and that can strongly affect training and inference.