Hi @kaushav07 !
I see that while speech based control is one of the core features of this project, no one has taken it up yet. I would like to propose the following workflow:
- Speech-to-text algorithm to convert the request of the user to text
- Use an image captioning algorithm like CLIP or DinoV2 to help complete the user's request based on visual input.
Would be happy to take this up!
Hi @kaushav07 !
I see that while speech based control is one of the core features of this project, no one has taken it up yet. I would like to propose the following workflow:
Would be happy to take this up!