This repository was archived by the owner on Feb 3, 2026. It is now read-only.

This repository was archived by the owner on Feb 3, 2026. It is now read-only.

Is the inference done using only one <image> token? #75

Open

opened

I have run the demo.sh and the prompt seems to digest all the images but has only 1 <image> token. The results seem to get ambiguous as the video gets longer. Is it doing inference based on the first frame it sees after segmenting?

Metadata

Assignees

No one assigned

Labels

No labels

No labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests