feat: #592 Abstract VLM providers into an extensible Strategy Pattern#668
feat: #592 Abstract VLM providers into an extensible Strategy Pattern#668suhaniiz wants to merge 2 commits into
Conversation
|
Hi @param20h 👋 As the original author of issue #592, I wanted to submit my official implementation for this feature as part of GSSoC '26. While I notice another PR was opened alongside this, my implementation explicitly fulfills the architectural requirements outlined in the issue description. I have fully abstracted the VLM logic out of the core execution loop into a strictly decoupled I would highly appreciate it if you could review this solution, as it ensures long-term extensibility for future providers (like Gemini or Claude) without breaking the core pipeline. Thank you! 🚀 |
📋 PR Checklist
🔗 Related Issue
Closes #592
📝 What does this PR do?
Decouples the core image captioning pipeline from individual vendor integrations by implementing a Strategy Pattern.
BaseVisionProviderabstract interface.OpenAIVisionProviderstrategy (including base64 encoding preparation for standard multimodal payload handling).VISION_PROVIDER_REGISTRYdict factory mapping to allow seamless plug-and-play scaling for future models (e.g., Gemini, Claude, Ollama) without polluting core operational files, adhering strictly to the Open-Closed Principle.🗂️ Type of Change
🧪 How was this tested?
_ocr_caption) without breaking chunk execution.📸 Screenshots (if UI change)
N/A
The integration remains completely backwards-compatible; if no VLM engine string matches the registry, it safely skips ahead to local OCR execution as before.
✅ Self-Review Checklist
dev, notmainmainbranch or any HuggingFace deployment config