feat(xfa): expose raw scripts via FormSchema.Scripts; drop heuristic Rule API#3
Merged
Merged
Conversation
…istics XFA script interpretation was regex-based and lossy: bodies got trimmed and sliced when a heuristic "succeeded", contentType wasn't surfaced, and owning field/subform context was dropped. Real ES5/FormCalc semantics need an AST, which is out of scope here. Callers can do better with the raw source. This is a breaking change to the form API: - New types.FormScript carries the verbatim body, language, event activity, SOM owner path, and matching Question.ID / FormSection.Path as OwnerID. - types.FormSchema gains Scripts []FormScript; loses Rules. - types.Question and types.FormSection gain Scripts []string — ID indexes into FormSchema.Scripts in declaration order. - Rule, RuleType, Condition, Operator, LogicOp, Action, ActionType and their constants are deleted. Annotation.Actions (unused) drops too. - pdfer.Rule alias is replaced with pdfer.FormScript. XFA translator: heuristic helpers (parseXFAScript, detectScriptLanguage, tryParseVisibilityScript, tryParseSetValueScript, tryParseValidationScript, tryParseCalculateScript, extractSimpleCondition, invertCondition, extractAllPresenceTargets, splitIfElse, splitJSIfElseChain, parseVariablesFunctionRules, extractJSFunctionBodies, and supporting regex/match utilities) are removed. Language defaults to "formcalc" per the XFA spec when contentType is absent; contentTypeToLang() handles the rest. <variables><script> blocks are now exposed verbatim as FormScripts with Event="variables" rather than being chopped into per-function pseudo-rules. End-to-end verification against tests/resources/nonivd_estar.pdf: 769 scripts emitted, bodies preserved verbatim (including comments, escapes, multi-line if/else chains), 621/904 questions carry script ID references.
The script-redesign FormScript struct already declared a Properties map for additional <event>/<script> attributes, but nothing populated it. Without listen/ref/id/binding/stateless/url, callers can't faithfully reproduce XFA event semantics — ref/listen in particular re-target the script to a different node than the host. Wire all four parse sites (one <event>, three <script> branches: variables, field-event, subform-event) to capture every attribute not already consumed by a typed field via a small putAttr helper. Event and script attrs share a single flat map per FormScript; key collisions (e.g. both elements carrying id) are last-write-wins, documented on the field comment. Also document the suppressed-node gap on FormScript: events on decorative draws, bind="none" non-AddAttachment buttons, pageArea, and <field> radio options collapsed into an <exclGroup>'s Options are not extracted. Deferred to a follow-up; most such scripts in eSTAR are chrome (status-indicator presence toggles). Fix three stale "%d rules" log strings in tests/estar_test.go that still said "rules" while passing len(schema.Scripts). New tests cover the field-event + script Properties capture (including absence of typed-field duplicates) and the <variables><script> branch. eSTAR script counts unchanged (769 / 887 / 649). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- README Forms section: add a snippet showing schema.Scripts iteration and a note that pdfer doesn't interpret FormCalc/JavaScript. Links to xfa-web as an example of a downstream interpreter. - README "Known limitations" Forms bullet: replace stale "ActionTypeExecute fallback" wording with the new verbatim-body contract and the list of unsupported owner nodes. - GAPS.md: mark the v1.2.0 "XFA script parsing" entry as superseded with a pointer to FormScript and the type comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trailing blank lines and column alignment shifts left over after the xfa_form_translator.go / xfa_translator_test.go deletions in the heuristic removal commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fdc7320 to
a76928f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The previous XFA script-handling surface had two layers:
<event>became aRulewithTypepicked from the event activity,Source= field name, and a singleAction{Type: ActionTypeExecute, Script: body}carrying the raw script. This part was fine — it didn't claim to interpret semantics.Condition,Operator,LogicOp, typedActions beyondExecute). Declared but unused untilc2c3fe0(v1.2.0), when regex-based extraction started populating it for visibility / set-value / validate / calculate patterns.The v1.2.0 heuristic was lossy and incorrect on non-trivial scripts. Faithful FormCalc / ES5 interpretation needs a real AST, which is out of scope for a PDF library. This PR replaces the envelope with a richer, lossless one and removes the aspirational layer entirely — callers that need to evaluate scripts should plug in their own interpreter.
What changed
New surface:
types.FormScript— verbatim script body plus event activity, language (javascript|formcalc), SOM owner path, owner ID,Name,RunAt, and aPropertiesmap for unknown<event>/<script>attributes (listen,ref,id,binding,stateless,url, …).FormSchema.Scripts []FormScript— flat declaration-order list.Question.Scripts []stringandFormSection.Scripts []string—FormScript.IDreferences for back-lookup.Removed (breaking, but the only user-facing populator was the v1.2.0 heuristic):
types.Rule,RuleType,Condition,Operator,LogicOp,Action,ActionTypeand all their constants.pdfer.Ruleroot-level alias (replaced bypdfer.FormScript).FormSchema.Rules(replaced byFormSchema.Scripts).Annotation.Actions(was unused).parseXFAScript,detectScriptLanguage,tryParseVisibilityScript,tryParseSetValueScript,tryParseValidationScript,tryParseCalculateScript,extractSimpleCondition,invertCondition,extractAllPresenceTargets, and supporting regex utilities.<variables><script>blocks are now exposed verbatim withEvent="variables"rather than being chopped into per-function pseudo-rules. Language defaults toformcalcper the XFA spec whencontentTypeis absent.Name overlap:
forms/acroform/actions.gostill defines its ownActionTypefor PDF action dictionaries (URI,JavaScript, …). That type is unrelated to the deleted XFAActionTypeand stays.Migration
For callers that used the old
Rule/Actionsurface — the field correspondence is:Rule.ID(rule_N)FormScript.ID(stable: SOM path + event + index)Rule.Type(bucket: calculate/validate/setValue)FormScript.Event(literal XFA activity)Rule.SourceFormScript.OwnerID/FormScript.OwnerPathAction.ScriptFormScript.BodyFormScript.Language,Name,RunAt,PropertiesTypical usage:
Known gaps
Scripts attached to nodes pdfer does not surface in the schema are not extracted: decorative
<draw>elements with events,<field>buttons withbind="none"other thanAddAttachment,<pageArea>-level events, and individual<field>radio options collapsed into an<exclGroup>'sOptions. Documented on theFormScripttype comment. Callers needing full event fidelity should walk the raw XFA XML directly.Verification
go test ./...— passes.tests/resources/nonivd_estar.pdf: 769 scripts emitted, bodies preserved verbatim (including comments, escapes, multi-line if/else chains), 621/904 questions carry script ID references.Test plan
go test ./...<variables><script>branch, Question/Section script ID resolution, language defaulting,name/runAtplumbing)🤖 Generated with Claude Code