we concatenate the image embedding and text embedding and input them into the network? 

Hi Dear author, I am retraining your network to adapt to other object level tasks, such as object level, and found that the predicted metal is too small, close to 0. You input image embeddings in the training model pipeline, while the diffusion model usually inputs text embeddings (if I can give it a material prompt word). Can you tell me why? If the text embedding is input, will the prediction results of the trained model become better or worse? Or can we concatenate the image embedding and text embedding and input them into the network? Thank you very much if you can answer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

we concatenate the image embedding and text embedding and input them into the network? #18

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

we concatenate the image embedding and text embedding and input them into the network? #18

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions