Skip to content

Recover IQ extraction gaps: tag-discriminated unions + honest drop metrics#3

Merged
jlucaso1 merged 6 commits into
mainfrom
richer-iq-recovery
Jun 6, 2026
Merged

Recover IQ extraction gaps: tag-discriminated unions + honest drop metrics#3
jlucaso1 merged 6 commits into
mainfrom
richer-iq-recovery

Conversation

@jlucaso1

@jlucaso1 jlucaso1 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Closes the known IQ extraction/codegen gaps surfaced after #2 — but first re-frames them: a parallel diagnosis found the alarming headline counts were mostly benign noise, not lost data. This recovers what's genuinely recoverable and makes the rest honest.

Recoveries (each: gate green + the regenerated reference .rs compiles clean against the consumer API)

  • Fix the parser-completeness validator — its INIT_FIELD regex captured the init value (tag: tag_items,) instead of the key, so 8 specs wrongly fell back to (). () responses 39 → 31.
  • Keep router-only MixinGroup modules in the mixin index — a router (pure if/return of branch mixins, no direct <iq>) was dropped, so its branches' xmlns/type never reached the request. Recovers the GroupProfilePictures w:g2 stanza and enriches the PushConfig set request. Also dedups same-named request variant names (a latent bug the enrichment exposed: six platform-led <config> alternatives all collapsed to one enum variant).
  • Request-anchored response fallback — an op whose response module ends differently (Ping's …ResponseServerResponse, not RPC/ResponseSuccess) now resolves by anchoring on the exact op name. +1 typed (Ping w:p).
  • Tag-discriminated response unions → enums (the big one) — same-tag, field-discriminated unions were dropped. New UnionShape::TagDiscriminated: each variant becomes a newtype enum arm over its own generated struct (recursively, incl. nested children/unions); the parser tries them first-success, gated by pinned attr values + required fields, with a separability gate that drops shapes where an earlier arm would shadow a later one rather than misclassify. Union enums 10 → 116, including the GroupInfo participant (Admin/NonAdmin) and newsletter message (text/media/poll/…) unions.
  • Reclassify benign mixin-fragment drops — 51 of 58 "unparseable" modules only export merge…Mixin combinators (partial <iq> folded into real requests by design; their data already lives in the concrete consumers). Add DropReason::MixinFragment so the headline reads 6 unparseable (+ 51 benign mixin fragments) instead of a misleading 58.

Numbers (vs the post-#2 baseline)

  • union enums 10 → 116
  • () responses 39 → 31
  • unparseable 58 → 6 genuine (+ 51 reclassified benign)
  • typed 135 → 137, degraded 20 → 19, stanzas 155 → 156

What is deliberately left (and why it's correct)

The diagnosis established most of the remaining "gap" is not loss: 51 benign mixin fragments + 5 non-request IQs + 7 legitimate set/ack responses (() is the faithful model) + a few protobuf/stateful responses (not representable in the declarative IR) + 5 same-tag unions that aren't separable (e.g. Error == ErrorFallback) — the separability gate drops those on purpose, the same fall-back-not-misclassify stance as the outcome-union path.

Contract

Mostly codegen-only (the reference .rs is gitignored): the IR contract changes only where the extractor genuinely improved — the GroupProfilePictures stanza + PushConfig children + Ping response, and the 51 relabeled reason strings. whatspec update --check stays idempotent; schema validation passes.

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced union type handling in code generation, including support for tag-discriminated unions.
    • Added tuple/newtype enum variant representation.
  • Bug Fixes

    • Fixed enum variant name collision handling through intelligent deduplication.
    • Improved response resolution with fallback logic for module naming variations.
    • Corrected validation regex for struct initialization fields.
  • Improvements

    • Enhanced diagnostics to distinguish benign schema fragments from genuine failures.

jlucaso1 added 5 commits June 6, 2026 13:30
…ver 8 () specs)

The parser-completeness validator extracted struct-init field names with a `,`-only
regex. A nested repeated grandchild is emitted as `tag: tag_items,`, so the regex
captured the VALUE (`tag_items`) instead of the KEY (`tag`); the validator then
thought `tag` was never initialized and wrongly fell the whole spec back to (). Match
the identifier before `:` OR `,`. The acceptance check is one-directional
(required is a subset of inited), so the extra value/Default matches are harmless.

Recovers 8 specs (() 39 -> 31): BotList outcome union + 7 newsletter/group success
structs. Verified: 45 wa-codegen tests + syn net pass; regenerated iq.rs compiles
clean against the real wacore crate; IR/contract unchanged.
…ames

Extractor: mixin_index extract_fragment dropped router-only MixinGroup modules
(e.g. mergeBaseGetGroupOrServerMixinGroup: pure if/return of branch mixins, no
direct <iq>), so resolve()'s BFS couldn't bridge request -> router -> branch ->
base. Keep a router when it has merged_callees. Recovers the GroupProfilePictures
w:g2 get stanza (155 -> 156, unparseable 58 -> 57, typed 135 -> 136) and enriches
the PushConfig set request with its <config>/<clear> children (verified additive:
0 stanzas lost; a full request+response tree A/B shows only the 1 new + 1 enriched).

Codegen: the PushConfig enrichment exposed a latent variant-group bug — six
<config> alternatives all lead with a dynamic "platform" attr, so variant_name
resolved them all to "Platform", producing a duplicate enum variant with mismatched
match-arm fields (E0428/E0026/E0559). Dedup variant names within a group
(Platform/Platform2/Platform3...), computed once and shared by the enum def and the
build match.

Verified: 46 wa-codegen + 101 wa-scan tests; regenerated iq.rs compiles clean
against the real wacore crate (0 errors); --check idempotent.
Pass 1 (RPC) and Pass 2 (ResponseSuccess) of the response index miss an op whose
response module ends differently — PingsClient's parser lives in
WASmaxInPingsClientResponseServerResponse, which is neither an RPC nor a
*ResponseSuccess. Add ResponseIndex::resolve_for_request_op(x): anchored on the
EXACT op name from the request, it finds a WASmaxIn<x>Response<V> whose variant V
is success-like (never reverse-deriving x by stripping "Response", which would
mis-split "…ServerResponse"), and parses it on demand. module.rs chains it after
get_by_x.

Recovers the Ping response (from: Jid, type, t: u64; guard type="result"):
typed 136 -> 137, degraded 20 -> 19. A/B confirms only makeClientRequest changed;
regenerated iq.rs compiles clean against real wacore; --check idempotent.
…72 more)

The largest fidelity gap: same-tag field-discriminated unions were dropped by
classify_union. Add UnionShape::TagDiscriminated — variants that share one node's
tag but carry richer payloads (nested children/unions). Each variant becomes a
newtype enum arm over its own generated struct (collect_response_fields recurses,
producing child item structs + nested enums); the parser tries them first-success,
each arm a closure gated by its pinned attr values (Cow-safe `.as_deref()` guards)
and required fields, returning the first that parses. A separability gate (signature
subset + pinned-attr conflict, reusing parser_is_valid) drops shapes where an earlier
arm would shadow a later one (e.g. userFetch Error == ErrorFallback) rather than
misclassify.

collect_union now returns (RustField, Vec<RustEnum>, Vec<RustChildStruct>) so the
per-variant structs/enums reach the module-level emit; emit_response_parser was
factored into a reusable emit_struct_parser for the per-arm bodies. The repeated-child
path (collect + emit) now also recurses into a Vec<Item>'s union columns, so unions
nested under repeated children are reached.

Recovers 116 union enums (was 10), including the GroupInfo participant
(Admin/NonAdmin) and newsletter message (text/media/poll/…) unions. Pure codegen:
IR/contract unchanged, --check idempotent. 48 wa-codegen tests (3 new: participant,
attr-value, non-separable); the regenerated iq.rs compiles clean against the real
wacore crate (0 errors).
…7 -> 6)

The "unparseable" count was dominated by noise: 51 of 57 are cross-module mixin
fragments (modules that export ONLY merge…Mixin combinators and build a partial
<iq> missing xmlns/type by design, folded into real requests via mergeStanzas).
Their data already lives in the concrete request consumers — they are not dropped
stanzas. Add DropReason::MixinFragment and classify a module as such at the
unresolved bail when its only functions are merge…Mixin (hoisting the export
computation above the bail). The CLI headline now reports genuine unresolved
(6) separately from benign fragments (51), and the manifest dropsByReason reflects
the accurate split.

Diagnostics only: no stanza added or removed; the no-silent-vanish invariant
(candidates = producing + unparseable) holds. Contract change is limited to the
51 relabeled `reason` strings in index.json + the manifest buckets.
@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@jlucaso1, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 49 minutes. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: e7bf2595-6e29-4869-b587-62db836ad17b

📥 Commits

Reviewing files that changed from the base of the PR and between 86dd8b6 and 6350317.

📒 Files selected for processing (1)
  • crates/wa-codegen/src/union.rs
📝 Walkthrough

Walkthrough

This PR extends XMPP codegen to handle tag-discriminated union shapes, refactors response parsing for reuse, deduplicates variant names, and introduces mixin fragment tracking. Union codegen now classifies variants sharing a node tag plus pinned attribute values and emits multi-enum payloads. Fragment lifecycle tracking distinguishes benign mixin-only modules from genuine failures, enabling request-anchored response resolution for cross-module dependencies.

Changes

Union Shape Support and Codegen Integration

Layer / File(s) Summary
Parser Refactoring and Visibility
crates/wa-codegen/src/spec.rs, crates/wa-codegen/src/emit.rs
emit_struct_parser becomes parameterized on node_var and struct_name; emit_response_parser delegates to it. Validation helpers exposed as pub(crate). Struct-init field-key regex fixed to recognize both explicit key: and shorthand key, forms including raw identifiers.
Union Type System
crates/wa-codegen/src/union.rs
UnionShape gains TagDiscriminated variant and new TagArm struct carrying per-variant pinned attribute discriminators and payload fields. Imports updated for tag-discriminated classification and emission.
Union Classification
crates/wa-codegen/src/union.rs
classify_tag_discriminated detects unions where all variants share the same node tag and are separable by pinned attr values plus required-field presence. Integrated into classify_union pipeline with conflict checking to avoid arm shadowing.
Union Code Emission
crates/wa-codegen/src/union.rs
collect_union return type expanded to (RustField, Vec<RustEnum>, Vec<RustChildStruct>) to emit module-level enums and per-variant child structs. emit_union_read adds tag-discriminated branch: descends to shared node, cascades through arms with pinned-attr guards, parses payloads via emit_struct_parser.
Field Collection Integration
crates/wa-codegen/src/fields.rs
collect_response_fields now uses expanded collect_union return, accumulating generated child structs. RustEnumVariant gains tuple_type field; emit_enum_def emits tuple-form variants when present.
Repeated Field Union Handling
crates/wa-codegen/src/emit.rs
Repeated-child parsing detects union-typed fields, collects union_kids, emits per-union reads, initializes payloads, and includes initializers in repeated item struct construction.
Variant Name Deduplication
crates/wa-codegen/src/emit.rs
dedup_variant_names produces collision-free identifiers via stable suffixing (Platform, Platform2, …). emit_variant_groups computes deduped list once and uses consistently in enum generation and match arms.
Union and Deduplication Tests
crates/wa-codegen/src/union.rs, crates/wa-codegen/src/emit.rs
Tests updated for new collect_union return signature. New coverage for tag-discriminated unions: successful recovery, pinned-attribute variants, non-separable rejection. Deduplication test asserts distinct variant identifiers and matching arm references.

Mixin Fragment Lifecycle and Response Resolution

Layer / File(s) Summary
Fragment Classification and Detection
crates/wa-scan/src/mixin_index.rs, crates/wa-scan/src/module.rs
DropReason::MixinFragment classifies modules with only merge…Mixin combinators as benign fragments. extract_fragment retains fragments with non-empty merged_callees even without direct smax("iq", …). scan_module_outcome detects fragment-only exports to differentiate from unresolved failures.
Response Resolution with Fallback
crates/wa-scan/src/response_index.rs, crates/wa-scan/src/module.rs
ResponseIndex stores in_slices for request-anchored lookups. New resolve_for_request_op reconstructs WASmaxIn<x>Response* lookups for operations without direct RPC entries. Module resolution tries get_by_x first, falls back to request-anchored method.
CLI Diagnostics
crates/whatspec/src/main.rs
push_iq log distinguishes benign mixin-fragment unparseable entries from genuine failures: reports fragment count and genuine unparseable count separately while preserving original stanza totals in diagnostics.

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • oxidezap/whatspec#1: Both PRs modify response-codegen in crates/wa-codegen/src/emit.rs and crates/wa-codegen/src/fields.rs at the parser-generation and field-collection touchpoints.

Poem

🐰 Union shapes now tagged and split,
Fragments tracked through mixin kit,
Variants dedupe with matching care,
Response fallback everywhere,
Codegen refactored, sharp and fit!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main changes: tag-discriminated unions (codegen recovery) and honest drop metrics (reclassification of benign failures), matching the PR's core objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch richer-iq-recovery

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 86dd8b6836

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +203 to +204
if f.required && (f.method == "child" || f.method.starts_with("attr")) {
s.insert(format!("REQ:{}", f.name));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Exclude non-failing repeated children from union signatures

When a tag-discriminated arm has a required repeated child, the generated parser only builds a Vec from get_children_by_tag and succeeds with an empty vector if the child is absent, so it is not a fail-on-absent discriminator. Counting every required child here lets the separability gate accept shapes where an earlier arm actually matches unconditionally; for example the generated newsletter views-count union has a first arm with a required repeated views_count child and a later deprecated views_count/count arm, but the first parser succeeds even without that child and shadows the deprecated arm, dropping the count.

Useful? React with 👍 / 👎.

…dex P1)

variant_signature counted every required child as a fail-on-absent discriminator,
but a repeated child reads as a possibly-empty Vec — the parser never bails when it
is absent. So an arm whose only required field is a repeated child matched
unconditionally and shadowed later arms, and the separability gate wrongly accepted
it. Concretely the newsletter views-count union (required repeated `views_count` vs
a deprecated `count` attr) was emitted with the deprecated arm unreachable.

Skip repeated children (and their contents) when building the signature, so such a
shape is correctly rejected (dropped, not misclassified). Drops the 2 views-count
unions (116 -> 114 enums); the rest are unaffected. Pure codegen: contract unchanged,
regenerated iq.rs compiles clean against the real wacore crate. Adds a unit test.
@jlucaso1 jlucaso1 merged commit ecd4fb5 into main Jun 6, 2026
2 checks passed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 63503176d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +197 to +200
if f.field_type == ParsedFieldType::Union {
if f.required {
s.insert(format!("NESTED:{}", f.name));
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude nested unions from discriminator signatures

When an earlier tag-discriminated arm is distinguished only by a required nested type=union, this signature lets classification accept it as fail-on-absent, but emit_union_read always produces an Option<Enum> and returns None for an absent or unrecognized nested union rather than bailing. In that scenario the earlier arm's per-variant parser still succeeds with the nested union set to None, so it shadows later arms even though the separability gate accepted the shape.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant