My current thinking is that in order to support testing from audio input, we need to look at the end-state of a conversation, not the intermediate state of NLU results.
The main challenge here is described in #241
It begs the question on what we can really test from audio. While I believe it is still useful to test intents, it is challenging to test entities, and the only real way to validate that the correct entities were detected is to look at an end-state (such as a fully resolved / disambiguated entity from text).
It may be that we want to recommend an alternative of testing from labeled ASR generated text, rather than testing from the audio itself.
My current thinking is that in order to support testing from audio input, we need to look at the end-state of a conversation, not the intermediate state of NLU results.
The main challenge here is described in #241
It begs the question on what we can really test from audio. While I believe it is still useful to test intents, it is challenging to test entities, and the only real way to validate that the correct entities were detected is to look at an end-state (such as a fully resolved / disambiguated entity from text).
It may be that we want to recommend an alternative of testing from labeled ASR generated text, rather than testing from the audio itself.