Hi Dear author, I am retraining your network to adapt to other object level tasks, such as object level, and found that the predicted metal is too small, close to 0. You input image embeddings in the training model pipeline, while the diffusion model usually inputs text embeddings (if I can give it a material prompt word). Can you tell me why? If the text embedding is input, will the prediction results of the trained model become better or worse? Or can we concatenate the image embedding and text embedding and input them into the network? Thank you very much if you can answer
Hi Dear author, I am retraining your network to adapt to other object level tasks, such as object level, and found that the predicted metal is too small, close to 0. You input image embeddings in the training model pipeline, while the diffusion model usually inputs text embeddings (if I can give it a material prompt word). Can you tell me why? If the text embedding is input, will the prediction results of the trained model become better or worse? Or can we concatenate the image embedding and text embedding and input them into the network? Thank you very much if you can answer