Hi, great work on AutoGaze! The results on NVILA are impressive.
I'm wondering if there are any plans to extend AutoGaze support to the Qwen3-VL or Qwen3.5 model families? Given their strong performance on video understanding benchmarks and growing adoption in the community, it would be really valuable to see AutoGaze integrated with them.
Specifically:
- Qwen3-VL: Uses a different ViT architecture (ViT-600M with native dynamic resolution) and a distinct visual token compression scheme (spatial merge). Curious whether AutoGaze's patch selection can be adapted to work with their vision encoder.
- Qwen3.5: Builds on Qwen3-VL with further improvements. Same question applies.
The main architectural differences I see compared to NVILA:
- Qwen uses
SigLIP-based ViT with 2D-RoPE instead of InternViT
- Spatial merge (2×2 → 1 token) as the default compression, vs. NVILA's tile-based approach
- Different positional encoding scheme for video frames
Would love to hear if this is on the roadmap or if there are any known blockers. Happy to help with integration/testing if there's interest!
Thanks.
Hi, great work on AutoGaze! The results on NVILA are impressive.
I'm wondering if there are any plans to extend AutoGaze support to the Qwen3-VL or Qwen3.5 model families? Given their strong performance on video understanding benchmarks and growing adoption in the community, it would be really valuable to see AutoGaze integrated with them.
Specifically:
The main architectural differences I see compared to NVILA:
SigLIP-based ViT with 2D-RoPE instead of InternViTWould love to hear if this is on the roadmap or if there are any known blockers. Happy to help with integration/testing if there's interest!
Thanks.