Summary
Replace the current pdf-parse library with OpenAI's native PDF upload capability which became available in March 2025. This will simplify our PDF processing pipeline and potentially improve extraction accuracy for complex documents.
Current Implementation
- We use
pdf-parse library to extract text from PDFs in src/actions/extractTextFromFile.ts
- Extracted text is then sent to OpenAI API using Vercel AI SDK
- We're already using OpenAI models (GPT-4o-mini) via
generateObject from the AI SDK
Proposed Solution
Leverage OpenAI's direct PDF file input support announced in March 2025:
- Send PDFs directly to the OpenAI API without pre-parsing
- Use the existing Vercel AI SDK which should support file uploads
- Keep pdf-parse as a fallback for when direct upload fails (known reliability issues)
Benefits
- Simplified pipeline: Remove text extraction step
- Better accuracy: OpenAI handles PDF structure internally, preserving tables/layouts
- Mixed content support: Better handling of PDFs with images, tables, complex formatting
- Cost optimization: Can still pre-parse for high-volume scenarios if needed
Implementation Steps
- Update
extractTextFromFile.ts to support direct PDF upload mode
- Modify API extraction endpoints to accept file buffers alongside text
- Update OpenAI client configuration to handle file uploads
- Implement fallback to pdf-parse when direct upload fails
- Test with various CV formats (simple text, tables, mixed content)
Technical Details
- OpenAI supports PDF upload via
files.create with purpose='user_data'
- Then reference file in chat completion with
type: 'file' message
- Vercel AI SDK may need updates to support file messages
- Keep extracted text flow for Word documents (mammoth)
Considerations
- Monitor API costs (direct upload may be more expensive)
- Handle intermittent "unable to read PDF" errors reported by users
- Maintain backward compatibility with existing data
- Consider hybrid approach: simple PDFs → direct upload, complex → parse first
References
Priority
Medium - Current system works but this would improve accuracy and simplify code
Labels
enhancement, ai, infrastructure
Summary
Replace the current pdf-parse library with OpenAI's native PDF upload capability which became available in March 2025. This will simplify our PDF processing pipeline and potentially improve extraction accuracy for complex documents.
Current Implementation
pdf-parselibrary to extract text from PDFs insrc/actions/extractTextFromFile.tsgenerateObjectfrom the AI SDKProposed Solution
Leverage OpenAI's direct PDF file input support announced in March 2025:
Benefits
Implementation Steps
extractTextFromFile.tsto support direct PDF upload modeTechnical Details
files.createwithpurpose='user_data'type: 'file'messageConsiderations
References
Priority
Medium - Current system works but this would improve accuracy and simplify code
Labels
enhancement, ai, infrastructure