feat: #592 Abstract VLM providers into an extensible Strategy Pattern by suhaniiz · Pull Request #668 · param20h/PDF-Assistant-RAG

suhaniiz · 2026-06-22T13:02:42Z

📋 PR Checklist

Thank you for contributing to PDF-Assistant-RAG! 🎉
Please fill out this template before submitting. PRs without it filled in will be closed.

🔗 Related Issue

Closes #592

📝 What does this PR do?

Decouples the core image captioning pipeline from individual vendor integrations by implementing a Strategy Pattern.

Introduced a BaseVisionProvider abstract interface.
Refactored the previous inline OpenAI placeholder logic into a standalone, production-ready OpenAIVisionProvider strategy (including base64 encoding preparation for standard multimodal payload handling).
Introduced a central VISION_PROVIDER_REGISTRY dict factory mapping to allow seamless plug-and-play scaling for future models (e.g., Gemini, Claude, Ollama) without polluting core operational files, adhering strictly to the Open-Closed Principle.

🗂️ Type of Change

🧪 How was this tested?

Verified code changes sit strictly within the registry mapping and fallback handlers.
Tested fallback logic to ensure unconfigured or failing provider runs gracefully chain downstream to the local OCR pipeline (_ocr_caption) without breaking chunk execution.

📸 Screenshots (if UI change)

N/A

⚠️ Anything to flag for reviewers?

The integration remains completely backwards-compatible; if no VLM engine string matches the registry, it safely skips ahead to local OCR execution as before.

✅ Self-Review Checklist

My branch is based on dev, not main
I have not added any secrets / API keys
I have not modified main branch or any HuggingFace deployment config
My code follows the existing style (no unnecessary formatting changes)
I have updated relevant docs / comments if needed

… Pattern

suhaniiz · 2026-06-22T13:12:31Z

Hi @param20h 👋

As the original author of issue #592, I wanted to submit my official implementation for this feature as part of GSSoC '26.

While I notice another PR was opened alongside this, my implementation explicitly fulfills the architectural requirements outlined in the issue description. I have fully abstracted the VLM logic out of the core execution loop into a strictly decoupled BaseVisionProvider interface and a VISION_PROVIDER_REGISTRY factory. Additionally, the concrete OpenAIVisionProvider strategy has been updated to support modern, production-ready base64 image data URI encoding for real multimodal execution, rather than relying on standard text prompts or outdated endpoints.

I would highly appreciate it if you could review this solution, as it ensures long-term extensibility for future providers (like Gemini or Claude) without breaking the core pipeline. Thank you! 🚀

feat: param20h#592 Abstract VLM providers into an extensible Strategy…

740d695

… Pattern

suhaniiz requested a review from param20h as a code owner June 22, 2026 13:02

Merge branch 'dev' into feat-vlm-strategy-pattern-592

c35e2fd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: #592 Abstract VLM providers into an extensible Strategy Pattern#668

feat: #592 Abstract VLM providers into an extensible Strategy Pattern#668
suhaniiz wants to merge 2 commits into
param20h:devfrom
suhaniiz:feat-vlm-strategy-pattern-592

suhaniiz commented Jun 22, 2026

Uh oh!

suhaniiz commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

suhaniiz commented Jun 22, 2026

📋 PR Checklist

🔗 Related Issue

📝 What does this PR do?

🗂️ Type of Change

🧪 How was this tested?

📸 Screenshots (if UI change)

⚠️ Anything to flag for reviewers?

✅ Self-Review Checklist

Uh oh!

suhaniiz commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant