Skip to content

feat(xfa): expose raw scripts via FormSchema.Scripts; drop heuristic Rule API#3

Merged
lfstokols merged 4 commits into
mainfrom
claude/xfa-script-parsing-redesign-tRWh1
May 25, 2026
Merged

feat(xfa): expose raw scripts via FormSchema.Scripts; drop heuristic Rule API#3
lfstokols merged 4 commits into
mainfrom
claude/xfa-script-parsing-redesign-tRWh1

Conversation

@lfstokols
Copy link
Copy Markdown
Contributor

Motivation

The previous XFA script-handling surface had two layers:

  1. A working classification envelope (initial commit): each XFA <event> became a Rule with Type picked from the event activity, Source = field name, and a single Action{Type: ActionTypeExecute, Script: body} carrying the raw script. This part was fine — it didn't claim to interpret semantics.
  2. An aspirational structured layer (Condition, Operator, LogicOp, typed Actions beyond Execute). Declared but unused until c2c3fe0 (v1.2.0), when regex-based extraction started populating it for visibility / set-value / validate / calculate patterns.

The v1.2.0 heuristic was lossy and incorrect on non-trivial scripts. Faithful FormCalc / ES5 interpretation needs a real AST, which is out of scope for a PDF library. This PR replaces the envelope with a richer, lossless one and removes the aspirational layer entirely — callers that need to evaluate scripts should plug in their own interpreter.

What changed

New surface:

  • types.FormScript — verbatim script body plus event activity, language (javascript | formcalc), SOM owner path, owner ID, Name, RunAt, and a Properties map for unknown <event> / <script> attributes (listen, ref, id, binding, stateless, url, …).
  • FormSchema.Scripts []FormScript — flat declaration-order list.
  • Question.Scripts []string and FormSection.Scripts []stringFormScript.ID references for back-lookup.

Removed (breaking, but the only user-facing populator was the v1.2.0 heuristic):

  • types.Rule, RuleType, Condition, Operator, LogicOp, Action, ActionType and all their constants.
  • pdfer.Rule root-level alias (replaced by pdfer.FormScript).
  • FormSchema.Rules (replaced by FormSchema.Scripts).
  • Annotation.Actions (was unused).
  • All XFA script-classification helpers: parseXFAScript, detectScriptLanguage, tryParseVisibilityScript, tryParseSetValueScript, tryParseValidationScript, tryParseCalculateScript, extractSimpleCondition, invertCondition, extractAllPresenceTargets, and supporting regex utilities.

<variables><script> blocks are now exposed verbatim with Event="variables" rather than being chopped into per-function pseudo-rules. Language defaults to formcalc per the XFA spec when contentType is absent.

Name overlap: forms/acroform/actions.go still defines its own ActionType for PDF action dictionaries (URI, JavaScript, …). That type is unrelated to the deleted XFA ActionType and stays.

Migration

For callers that used the old Rule/Action surface — the field correspondence is:

Old New
Rule.ID (rule_N) FormScript.ID (stable: SOM path + event + index)
Rule.Type (bucket: calculate/validate/setValue) FormScript.Event (literal XFA activity)
Rule.Source FormScript.OwnerID / FormScript.OwnerPath
Action.Script FormScript.Body
none FormScript.Language, Name, RunAt, Properties

Typical usage:

for _, s := range schema.Scripts {
    fmt.Printf("[%s] %s on %s\n%s\n", s.Event, s.Language, s.OwnerPath, s.Body)
}

Known gaps

Scripts attached to nodes pdfer does not surface in the schema are not extracted: decorative <draw> elements with events, <field> buttons with bind="none" other than AddAttachment, <pageArea>-level events, and individual <field> radio options collapsed into an <exclGroup>'s Options. Documented on the FormScript type comment. Callers needing full event fidelity should walk the raw XFA XML directly.

Verification

  • go test ./... — passes.
  • End-to-end on tests/resources/nonivd_estar.pdf: 769 scripts emitted, bodies preserved verbatim (including comments, escapes, multi-line if/else chains), 621/904 questions carry script ID references.

Test plan

  • go test ./...
  • XFA package tests cover envelope (event/script Properties capture, <variables><script> branch, Question/Section script ID resolution, language defaulting, name/runAt plumbing)
  • eSTAR roundtrip script counts (769 / 887 / 649) unchanged from pre-ripout

🤖 Generated with Claude Code

claude and others added 4 commits May 25, 2026 14:50
…istics

XFA script interpretation was regex-based and lossy: bodies got trimmed and
sliced when a heuristic "succeeded", contentType wasn't surfaced, and owning
field/subform context was dropped. Real ES5/FormCalc semantics need an AST,
which is out of scope here. Callers can do better with the raw source.

This is a breaking change to the form API:

- New types.FormScript carries the verbatim body, language, event activity,
  SOM owner path, and matching Question.ID / FormSection.Path as OwnerID.
- types.FormSchema gains Scripts []FormScript; loses Rules.
- types.Question and types.FormSection gain Scripts []string — ID indexes
  into FormSchema.Scripts in declaration order.
- Rule, RuleType, Condition, Operator, LogicOp, Action, ActionType and
  their constants are deleted. Annotation.Actions (unused) drops too.
- pdfer.Rule alias is replaced with pdfer.FormScript.

XFA translator: heuristic helpers (parseXFAScript, detectScriptLanguage,
tryParseVisibilityScript, tryParseSetValueScript, tryParseValidationScript,
tryParseCalculateScript, extractSimpleCondition, invertCondition,
extractAllPresenceTargets, splitIfElse, splitJSIfElseChain,
parseVariablesFunctionRules, extractJSFunctionBodies, and supporting
regex/match utilities) are removed. Language defaults to "formcalc" per the
XFA spec when contentType is absent; contentTypeToLang() handles the rest.
<variables><script> blocks are now exposed verbatim as FormScripts with
Event="variables" rather than being chopped into per-function pseudo-rules.

End-to-end verification against tests/resources/nonivd_estar.pdf: 769
scripts emitted, bodies preserved verbatim (including comments, escapes,
multi-line if/else chains), 621/904 questions carry script ID references.
The script-redesign FormScript struct already declared a Properties map
for additional <event>/<script> attributes, but nothing populated it.
Without listen/ref/id/binding/stateless/url, callers can't faithfully
reproduce XFA event semantics — ref/listen in particular re-target the
script to a different node than the host.

Wire all four parse sites (one <event>, three <script> branches:
variables, field-event, subform-event) to capture every attribute not
already consumed by a typed field via a small putAttr helper. Event
and script attrs share a single flat map per FormScript; key collisions
(e.g. both elements carrying id) are last-write-wins, documented on the
field comment.

Also document the suppressed-node gap on FormScript: events on
decorative draws, bind="none" non-AddAttachment buttons, pageArea, and
<field> radio options collapsed into an <exclGroup>'s Options are not
extracted. Deferred to a follow-up; most such scripts in eSTAR are
chrome (status-indicator presence toggles).

Fix three stale "%d rules" log strings in tests/estar_test.go that
still said "rules" while passing len(schema.Scripts).

New tests cover the field-event + script Properties capture (including
absence of typed-field duplicates) and the <variables><script> branch.

eSTAR script counts unchanged (769 / 887 / 649).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- README Forms section: add a snippet showing schema.Scripts iteration
  and a note that pdfer doesn't interpret FormCalc/JavaScript. Links to
  xfa-web as an example of a downstream interpreter.
- README "Known limitations" Forms bullet: replace stale
  "ActionTypeExecute fallback" wording with the new verbatim-body contract
  and the list of unsupported owner nodes.
- GAPS.md: mark the v1.2.0 "XFA script parsing" entry as superseded
  with a pointer to FormScript and the type comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trailing blank lines and column alignment shifts left over after the
xfa_form_translator.go / xfa_translator_test.go deletions in the heuristic
removal commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lfstokols lfstokols force-pushed the claude/xfa-script-parsing-redesign-tRWh1 branch from fdc7320 to a76928f Compare May 25, 2026 19:08
@lfstokols lfstokols merged commit 7721d7a into main May 25, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants