Plans for supporting Qwen3-VL or Qwen3.5?

Hi, great work on AutoGaze! The results on NVILA are impressive.

I'm wondering if there are any plans to extend AutoGaze support to the Qwen3-VL or Qwen3.5 model families? Given their strong performance on video understanding benchmarks and growing adoption in the community, it would be really valuable to see AutoGaze integrated with them.

Specifically:
- **Qwen3-VL**: Uses a different ViT architecture (ViT-600M with native dynamic resolution) and a distinct visual token compression scheme (spatial merge). Curious whether AutoGaze's patch selection can be adapted to work with their vision encoder.
- **Qwen3.5**: Builds on Qwen3-VL with further improvements. Same question applies.

The main architectural differences I see compared to NVILA:
1. Qwen uses `SigLIP`-based ViT with 2D-RoPE instead of InternViT
2. Spatial merge (2×2 → 1 token) as the default compression, vs. NVILA's tile-based approach
3. Different positional encoding scheme for video frames

Would love to hear if this is on the roadmap or if there are any known blockers. Happy to help with integration/testing if there's interest!

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plans for supporting Qwen3-VL or Qwen3.5? #2

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Plans for supporting Qwen3-VL or Qwen3.5? #2

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions