Excellent work!
I noticed in the paper that the LLM part is fine-tuned based on Qwen2.5-VL. Is it possible to separate the weights of the image part and the LLM part, so that other mature frameworks (such as TensorRT and Ollama) can be used for inference later?
Excellent work!
I noticed in the paper that the LLM part is fine-tuned based on Qwen2.5-VL. Is it possible to separate the weights of the image part and the LLM part, so that other mature frameworks (such as TensorRT and Ollama) can be used for inference later?