Skip to content

Replace pdf-parse with direct PDF upload to OpenAI API #465

@bhekanik

Description

@bhekanik

Summary

Replace the current pdf-parse library with OpenAI's native PDF upload capability which became available in March 2025. This will simplify our PDF processing pipeline and potentially improve extraction accuracy for complex documents.

Current Implementation

  • We use pdf-parse library to extract text from PDFs in src/actions/extractTextFromFile.ts
  • Extracted text is then sent to OpenAI API using Vercel AI SDK
  • We're already using OpenAI models (GPT-4o-mini) via generateObject from the AI SDK

Proposed Solution

Leverage OpenAI's direct PDF file input support announced in March 2025:

  • Send PDFs directly to the OpenAI API without pre-parsing
  • Use the existing Vercel AI SDK which should support file uploads
  • Keep pdf-parse as a fallback for when direct upload fails (known reliability issues)

Benefits

  1. Simplified pipeline: Remove text extraction step
  2. Better accuracy: OpenAI handles PDF structure internally, preserving tables/layouts
  3. Mixed content support: Better handling of PDFs with images, tables, complex formatting
  4. Cost optimization: Can still pre-parse for high-volume scenarios if needed

Implementation Steps

  1. Update extractTextFromFile.ts to support direct PDF upload mode
  2. Modify API extraction endpoints to accept file buffers alongside text
  3. Update OpenAI client configuration to handle file uploads
  4. Implement fallback to pdf-parse when direct upload fails
  5. Test with various CV formats (simple text, tables, mixed content)

Technical Details

  • OpenAI supports PDF upload via files.create with purpose='user_data'
  • Then reference file in chat completion with type: 'file' message
  • Vercel AI SDK may need updates to support file messages
  • Keep extracted text flow for Word documents (mammoth)

Considerations

  • Monitor API costs (direct upload may be more expensive)
  • Handle intermittent "unable to read PDF" errors reported by users
  • Maintain backward compatibility with existing data
  • Consider hybrid approach: simple PDFs → direct upload, complex → parse first

References

Priority

Medium - Current system works but this would improve accuracy and simplify code

Labels

enhancement, ai, infrastructure

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions