Using an Image Encoder in Place of a Text Encoder for Direct Concept Representation

Hello,

I was wondering if it would be feasible to use an image encoder to replace the text encoder in the swapping process. This way, we could directly input images to represent concepts instead of using text.

Would this be possible, and what implications or challenges might we encounter with such a change?

Thank you!