Feature Request: Speech based Controls

Hi @kaushav07 ! 

I see that while speech based control is one of the core features of this project, no one has taken it up yet. I would like to propose the following workflow:

1. Speech-to-text algorithm to convert the request of the user to text
2. Use an image captioning algorithm like CLIP or DinoV2 to help complete the user's request based on visual input.

Would be happy to take this up!