Hi CAMMA team,
I'm Hari Vignesh Balaji, an AI engineer from India.
Currently reproducing the Rendezvous evaluation on CholecT45-crossval. The per-class verb AP breakdown raises a question I haven't found addressed in the paper or the annotation protocol.
When I inspect frames in which the model misclassifies 'dissect' vs 'cut', they appear visually nearly identical. the same hook instrument, the same tissue contact, the same spatial configuration. The difference seems to lie in surgical intent and the motion vector rather than any single-frame visual feature.
My question is about the annotation design: was there an explicit criterion in the annotation guide that allowed annotators to distinguish 'dissect' from 'cut' in ambiguous frames? Or was the boundary set at the clinical-semantic level (i.e., annotators were surgeons who applied clinical context that the model could not recover from a frozen frame)?
I'm trying to understand whether the verb mAP ceiling on these classes reflects a dataset labelling ambiguity, a fundamental single-frame limitation, or something addressable with temporal context, which I noticed the Rendezvous-in-Time paper partially addresses.
Thank you for any insight and for making the code and ivtmetrics publicly available. They made reproduction straightforward.
Best regards,
Hari Vignesh BALAJI
India
https://www.linkedin.com/in/harivignz/
https://github.com/Harivignz
Hi CAMMA team,
I'm Hari Vignesh Balaji, an AI engineer from India.
Currently reproducing the Rendezvous evaluation on CholecT45-crossval. The per-class verb AP breakdown raises a question I haven't found addressed in the paper or the annotation protocol.
When I inspect frames in which the model misclassifies 'dissect' vs 'cut', they appear visually nearly identical. the same hook instrument, the same tissue contact, the same spatial configuration. The difference seems to lie in surgical intent and the motion vector rather than any single-frame visual feature.
My question is about the annotation design: was there an explicit criterion in the annotation guide that allowed annotators to distinguish 'dissect' from 'cut' in ambiguous frames? Or was the boundary set at the clinical-semantic level (i.e., annotators were surgeons who applied clinical context that the model could not recover from a frozen frame)?
I'm trying to understand whether the verb mAP ceiling on these classes reflects a dataset labelling ambiguity, a fundamental single-frame limitation, or something addressable with temporal context, which I noticed the Rendezvous-in-Time paper partially addresses.
Thank you for any insight and for making the code and ivtmetrics publicly available. They made reproduction straightforward.
Best regards,
Hari Vignesh BALAJI
India
https://www.linkedin.com/in/harivignz/
https://github.com/Harivignz