Skip to content
This repository was archived by the owner on Feb 3, 2026. It is now read-only.
This repository was archived by the owner on Feb 3, 2026. It is now read-only.

Image token process malfunction #78

@Stevetich

Description

@Stevetich

It seems that in model_utils.py, only one token is passed in pllava_answer.

def pllava_answer(conv: Conversation, model, processor, img_list, do_sample=True, max_new_tokens=200, num_beams=1, min_length=1, top_p=0.9,
repetition_penalty=1.0, length_penalty=1, temperature=1.0, stop_criteria_keywords=None, print_res=False):
# torch.cuda.empty_cache()
prompt = conv.get_prompt()
inputs = processor(text=prompt, images=img_list, return_tensors="pt")
if inputs['pixel_values'] is None:
inputs.pop('pixel_values')
inputs = inputs.to(model.device)

However, in eval_utils, the multiple tokens are passed for the model to perform video inference. Is that a bug in model_utils.py?

def answer(self, conv: Conversation, img_list, max_new_tokens=200, num_beams=1, min_length=1, top_p=0.9,
repetition_penalty=1.0, length_penalty=1, temperature=1.0):
torch.cuda.empty_cache()
prompt = conv.get_prompt()
if prompt.count(conv.mm_token) < len(img_list):
diff_mm_num = len(img_list) - prompt.count(conv.mm_token)
for i in range(diff_mm_num):
conv.user_query("", is_mm=True)
prompt = conv.get_prompt()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions