Question about next vision token prediction

Thank you for sharing your code. As the paper describes the vision loss as AR, the [code of this line](https://github.com/AlenjandroWang/ASVR/blob/a6b8cc2843a9a4e5d8debba871e9b856bed3b853/asvr/model/language_model/asvr_llama.py#L189) seems to directly reconstruct the current token rather than the next token. I wonder where the label shift operation for AR is, or if I might have misunderstood this part? Thx.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about next vision token prediction #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question about next vision token prediction #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions