One killer feature for me would be the ability to see the text of an audiobook along with the audio (with a highlight of the current word or sentence) so I can read and listen simultaneously, along with the ability to click on a word or phrase and jump there.
To enable this, the idea is to use a local speech-to-text model (maybe Whisper, or a newer model like IBM Granite 4.0 1B Speech which benchmarks surprisingly well at a smaller size) to convert an audiobook to text with timestamps, and use that data to display and highlight text in sync with playback. Being fully local means no data ever leaves the device.
This is something I plan on implementing eventually and eventually opening a PR for, assuming this is something you're interested in supporting, as it would massively expand the scope and size of the app (that model is going to be several gigabytes if it ends up bundled with the app).
Mostly creating this issue to keep track of this as a feature request and to discuss whether it's wanted.
One killer feature for me would be the ability to see the text of an audiobook along with the audio (with a highlight of the current word or sentence) so I can read and listen simultaneously, along with the ability to click on a word or phrase and jump there.
To enable this, the idea is to use a local speech-to-text model (maybe Whisper, or a newer model like IBM Granite 4.0 1B Speech which benchmarks surprisingly well at a smaller size) to convert an audiobook to text with timestamps, and use that data to display and highlight text in sync with playback. Being fully local means no data ever leaves the device.
This is something I plan on implementing eventually and eventually opening a PR for, assuming this is something you're interested in supporting, as it would massively expand the scope and size of the app (that model is going to be several gigabytes if it ends up bundled with the app).
Mostly creating this issue to keep track of this as a feature request and to discuss whether it's wanted.