Skip to content

feat(datasource): wire Docling into GDrive connector for binary document parsing#150

Merged
sks merged 6 commits intomainfrom
feat/docling-gdrive-content-transformer
Mar 31, 2026
Merged

feat(datasource): wire Docling into GDrive connector for binary document parsing#150
sks merged 6 commits intomainfrom
feat/docling-gdrive-content-transformer

Conversation

@npkanaka
Copy link
Copy Markdown
Contributor

Summary

  • Wire Docling Serve into the GDrive datasource connector so binary files (PDF, DOCX, PPTX, XLSX, images) are downloaded and parsed into searchable markdown
  • Add DownloadFile method to GDrive Service interface for raw byte access without MIME-type rejection
  • Add DocParser field to ConnectorOptions for injecting the document parsing provider
  • Request markdown output from Docling via to_formats=md form field
  • Graceful fallback to header-only stubs when Docling is unavailable, unconfigured, or parsing fails

…ent parsing

Binary files (PDF, DOCX, PPTX, XLSX, images) in Google Drive folders were
indexed with header-only stubs — title and URL but no content. This wires
Docling Serve into the GDrive datasource connector so non-text files are
downloaded and parsed into searchable markdown via the existing docparser
package. Gracefully falls back to header stubs when Docling is unavailable
or parsing fails.
@npkanaka npkanaka requested a review from a team as a code owner March 28, 2026 02:58
Copilot AI review requested due to automatic review settings March 28, 2026 02:58
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates Docling-based document parsing into the Google Drive connector, enabling text extraction from binary files such as PDFs and DOCX. Key changes include the addition of a DownloadFile method to the GDrive service, updates to the connector factory for parser injection, and logic to concatenate parsed pages into markdown. A potential resource leak was identified in the GDrive wrapper where the response body should be closed if an error occurs during a file download.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR wires the docparser (Docling Serve) provider into the Google Drive datasource connector so binary Drive files can be downloaded and parsed into searchable markdown, while keeping a header-only fallback when parsing is unavailable or fails.

Changes:

  • Added DownloadFile(ctx, fileID) to the GDrive Service (and wrapper + fake) to allow raw byte downloads for binary documents.
  • Extended the GDrive datasource connector with an optional DocParser (via functional options + factory wiring) to parse non-text MIME types and concatenate parsed pages into a single item’s content.
  • Updated the Docling provider to request markdown output (to_formats=md) and added/expanded tests for both the Docling request and GDrive parsing/fallback behavior.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/tools/google/gdrive/wrapper.go Implements DownloadFile using Drive API media download for raw bytes.
pkg/tools/google/gdrive/gdrivefakes/fake_service.go Updates generated fake to include DownloadFile for connector tests.
pkg/tools/google/gdrive/gdrive.go Extends Service interface with DownloadFile(io.ReadCloser).
pkg/tools/google/gdrive/datasource.go Adds optional doc parser wiring + binary parsing flow + page concatenation.
pkg/tools/google/gdrive/connector_factory.go Wires ConnectorOptions.DocParser into the GDrive connector via type assertion.
pkg/tools/google/gdrive/datasource_test.go Adds coverage for Docling parsing, concatenation, and fallback scenarios.
pkg/datasource/docparser/docling.go Adds to_formats=md multipart field to request markdown output.
pkg/datasource/docparser/parser_test.go Adds test verifying multipart includes to_formats=md.
pkg/datasource/builder.go Adds DocParser any to ConnectorOptions for dependency injection into factories.
pkg/app/app.go Instantiates the configured doc parser and passes it via ConnectorOptions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 28, 2026

…LEVEL env var

Adds structured logging throughout the Docling document parsing pipeline
(request/response, parse results, routing decisions) and allows setting
log level via GENIE_LOG_LEVEL environment variable when the CLI flag is
at its default.
Copilot AI review requested due to automatic review settings March 30, 2026 17:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…tasource

docparser.Parse() now returns []ParsedPage instead of []NormalizedItem,
breaking the circular import that forced builder.go to use `any` for the
DocParser field. The field is now typed as docparser.Provider with
compile-time safety. Callers map ParsedPages into their own storage model.

Addresses PR review: #150 (comment)
@npkanaka npkanaka requested a review from sks March 30, 2026 19:24
… doc parser init

- Revert env var log level override in cmd/root.go (out of scope for this PR)
- Use if/else for doc parser creation since New() never returns (nil, nil)
Copilot AI review requested due to automatic review settings March 31, 2026 00:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sks sks merged commit c43fa2f into main Mar 31, 2026
11 checks passed
@sks sks deleted the feat/docling-gdrive-content-transformer branch March 31, 2026 00:46
@github-actions github-actions bot locked and limited conversation to collaborators Mar 31, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants