Skip to content

Add source field to extracted variables (filename vs text) #41

@michaelbeutler

Description

@michaelbeutler

Allow regex variable extraction to run against the original filename in addition to the extracted document text. Add a source field to ExtractedVariable.

Example Config

variables:
  extracted:
    - name: account_id
      source: filename   # NEW — extract from filename instead of text
      pattern: "[0-9]{8}_.*(?P<account_id>[0-9]{4}\\.[0-9]{4}\\.[0-9]{4})"
    - name: invoice_num
      source: text        # default behavior
      pattern: "INV-(?P<invoice_num>\\d+)"

Implementation

Schema change:

  • File: crates/paporg/src/config/schema.rs
  • Add source: Option<VariableSource> to ExtractedVariable (line ~65)
  • Add enum:
    enum VariableSource { Text, Filename }
    Default: Text

Variable engine change:

  • File: crates/paporg/src/config/variables.rs
  • Update extract_variables() (line ~36) to accept both text: &str and filename: &str
  • For each pattern, check source to decide which string to match against

Pipeline change:

  • File: crates/paporg/src/pipeline/runner.rs
  • In step_extract_variables() (line ~222): pass the original filename alongside the text

Acceptance Criteria

  • source: filename extracts variables from the original filename
  • source: text (default) preserves current behavior
  • Omitting source defaults to text
  • Covered by unit tests for both sources

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions