Running llava 7B locally is bottlenecked by hardware — the model gets heavily quantized on limited RAM which hurts both speed (~5-20s per pair) and accuracy.
PR #46 already merged a working Gemini 2.0 Flash client at app/services/gemini.py. Switching the VLM to use this would mean:
- No local Ollama dependency
- Faster inference (sub-second)
- Higher accuracy from a larger model
- Works on any machine with an API key
The existing build_prompt(), parse_damage_score(), and score_to_label() in generate_vlm.py are model-agnostic and can stay as-is. The main change is replacing the ollama.chat() call (line 119) with client.models.generate_content() using the Gemini client pattern already in app/services/gemini.py.
GEMINI_API_KEY goes in .env.
Running llava 7B locally is bottlenecked by hardware — the model gets heavily quantized on limited RAM which hurts both speed (~5-20s per pair) and accuracy.
PR #46 already merged a working Gemini 2.0 Flash client at
app/services/gemini.py. Switching the VLM to use this would mean:The existing
build_prompt(),parse_damage_score(), andscore_to_label()ingenerate_vlm.pyare model-agnostic and can stay as-is. The main change is replacing theollama.chat()call (line 119) withclient.models.generate_content()using the Gemini client pattern already inapp/services/gemini.py.GEMINI_API_KEYgoes in.env.