We would like to express concern regarding the fairness of the comparisons in the experiment results. Both LLaVA and ASVR yield results that are lower than those reported in the original LLaVA paper. We ran the authors' code and also tested a version without the image loss, keeping all other settings identical (i.e., the LLaVA version). We used the datasets specified in the paper for both pretraining and finetuning. Our findings show that, in benchmark performance, the version without image loss outperforms the version with image loss (ASVR) in nearly every case. The ASVR test results align with those reported in the paper, while the LLaVA results are significantly higher than those presented in the original paper.
As a result, we question the fairness of the comparison involving LLaVA in the paper. Below are some of our test results:
| model |
gqa |
vizwiz |
scienceq |
textvqa |
pope |
mme |
| llava |
62.13229 |
56.28155 |
70.79822 |
59.044 |
87.55556 |
1441.554 |
| asvr |
60.50246 |
58.93031 |
69.06296 |
54.038 |
86.67778 |
1429.783 |
We would like to express concern regarding the fairness of the comparisons in the experiment results. Both LLaVA and ASVR yield results that are lower than those reported in the original LLaVA paper. We ran the authors' code and also tested a version without the image loss, keeping all other settings identical (i.e., the LLaVA version). We used the datasets specified in the paper for both pretraining and finetuning. Our findings show that, in benchmark performance, the version without image loss outperforms the version with image loss (ASVR) in nearly every case. The ASVR test results align with those reported in the paper, while the LLaVA results are significantly higher than those presented in the original paper.
As a result, we question the fairness of the comparison involving LLaVA in the paper. Below are some of our test results: