Skip to content

feat: #592 Abstract VLM providers into an extensible Strategy Pattern#668

Open
suhaniiz wants to merge 2 commits into
param20h:devfrom
suhaniiz:feat-vlm-strategy-pattern-592
Open

feat: #592 Abstract VLM providers into an extensible Strategy Pattern#668
suhaniiz wants to merge 2 commits into
param20h:devfrom
suhaniiz:feat-vlm-strategy-pattern-592

Conversation

@suhaniiz

Copy link
Copy Markdown
Contributor

📋 PR Checklist

Thank you for contributing to PDF-Assistant-RAG! 🎉
Please fill out this template before submitting. PRs without it filled in will be closed.


🔗 Related Issue

Closes #592


📝 What does this PR do?

Decouples the core image captioning pipeline from individual vendor integrations by implementing a Strategy Pattern.

  • Introduced a BaseVisionProvider abstract interface.
  • Refactored the previous inline OpenAI placeholder logic into a standalone, production-ready OpenAIVisionProvider strategy (including base64 encoding preparation for standard multimodal payload handling).
  • Introduced a central VISION_PROVIDER_REGISTRY dict factory mapping to allow seamless plug-and-play scaling for future models (e.g., Gemini, Claude, Ollama) without polluting core operational files, adhering strictly to the Open-Closed Principle.

🗂️ Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 🔧 Refactor / code cleanup
  • 📝 Documentation update
  • 🎨 UI / styling change
  • ⚙️ CI / tooling / config change
  • 🧪 Tests

🧪 How was this tested?

  • Verified code changes sit strictly within the registry mapping and fallback handlers.
  • Tested fallback logic to ensure unconfigured or failing provider runs gracefully chain downstream to the local OCR pipeline (_ocr_caption) without breaking chunk execution.

📸 Screenshots (if UI change)

N/A


⚠️ Anything to flag for reviewers?

The integration remains completely backwards-compatible; if no VLM engine string matches the registry, it safely skips ahead to local OCR execution as before.


✅ Self-Review Checklist

  • My branch is based on dev, not main
  • I have not added any secrets / API keys
  • I have not modified main branch or any HuggingFace deployment config
  • My code follows the existing style (no unnecessary formatting changes)
  • I have updated relevant docs / comments if needed

@suhaniiz suhaniiz requested a review from param20h as a code owner June 22, 2026 13:02
@suhaniiz

Copy link
Copy Markdown
Contributor Author

Hi @param20h 👋

As the original author of issue #592, I wanted to submit my official implementation for this feature as part of GSSoC '26.

While I notice another PR was opened alongside this, my implementation explicitly fulfills the architectural requirements outlined in the issue description. I have fully abstracted the VLM logic out of the core execution loop into a strictly decoupled BaseVisionProvider interface and a VISION_PROVIDER_REGISTRY factory. Additionally, the concrete OpenAIVisionProvider strategy has been updated to support modern, production-ready base64 image data URI encoding for real multimodal execution, rather than relying on standard text prompts or outdated endpoints.

I would highly appreciate it if you could review this solution, as it ensures long-term extensibility for future providers (like Gemini or Claude) without breaking the core pipeline. Thank you! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEAT] Abstract VLM providers into an extensible Strategy Pattern

1 participant