Conversation
…ent parsing Binary files (PDF, DOCX, PPTX, XLSX, images) in Google Drive folders were indexed with header-only stubs — title and URL but no content. This wires Docling Serve into the GDrive datasource connector so non-text files are downloaded and parsed into searchable markdown via the existing docparser package. Gracefully falls back to header stubs when Docling is unavailable or parsing fails.
There was a problem hiding this comment.
Code Review
This pull request integrates Docling-based document parsing into the Google Drive connector, enabling text extraction from binary files such as PDFs and DOCX. Key changes include the addition of a DownloadFile method to the GDrive service, updates to the connector factory for parser injection, and logic to concatenate parsed pages into markdown. A potential resource leak was identified in the GDrive wrapper where the response body should be closed if an error occurs during a file download.
There was a problem hiding this comment.
Pull request overview
This PR wires the docparser (Docling Serve) provider into the Google Drive datasource connector so binary Drive files can be downloaded and parsed into searchable markdown, while keeping a header-only fallback when parsing is unavailable or fails.
Changes:
- Added
DownloadFile(ctx, fileID)to the GDriveService(and wrapper + fake) to allow raw byte downloads for binary documents. - Extended the GDrive datasource connector with an optional
DocParser(via functional options + factory wiring) to parse non-text MIME types and concatenate parsed pages into a single item’s content. - Updated the Docling provider to request markdown output (
to_formats=md) and added/expanded tests for both the Docling request and GDrive parsing/fallback behavior.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/tools/google/gdrive/wrapper.go | Implements DownloadFile using Drive API media download for raw bytes. |
| pkg/tools/google/gdrive/gdrivefakes/fake_service.go | Updates generated fake to include DownloadFile for connector tests. |
| pkg/tools/google/gdrive/gdrive.go | Extends Service interface with DownloadFile(io.ReadCloser). |
| pkg/tools/google/gdrive/datasource.go | Adds optional doc parser wiring + binary parsing flow + page concatenation. |
| pkg/tools/google/gdrive/connector_factory.go | Wires ConnectorOptions.DocParser into the GDrive connector via type assertion. |
| pkg/tools/google/gdrive/datasource_test.go | Adds coverage for Docling parsing, concatenation, and fallback scenarios. |
| pkg/datasource/docparser/docling.go | Adds to_formats=md multipart field to request markdown output. |
| pkg/datasource/docparser/parser_test.go | Adds test verifying multipart includes to_formats=md. |
| pkg/datasource/builder.go | Adds DocParser any to ConnectorOptions for dependency injection into factories. |
| pkg/app/app.go | Instantiates the configured doc parser and passes it via ConnectorOptions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
…ontent-transformer
…LEVEL env var Adds structured logging throughout the Docling document parsing pipeline (request/response, parse results, routing decisions) and allows setting log level via GENIE_LOG_LEVEL environment variable when the CLI flag is at its default.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…tasource docparser.Parse() now returns []ParsedPage instead of []NormalizedItem, breaking the circular import that forced builder.go to use `any` for the DocParser field. The field is now typed as docparser.Provider with compile-time safety. Callers map ParsedPages into their own storage model. Addresses PR review: #150 (comment)
… doc parser init - Revert env var log level override in cmd/root.go (out of scope for this PR) - Use if/else for doc parser creation since New() never returns (nil, nil)
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
DownloadFilemethod to GDriveServiceinterface for raw byte access without MIME-type rejectionDocParserfield toConnectorOptionsfor injecting the document parsing providerto_formats=mdform field