Skip to content

feat(structural-parsing): cross-locale label resolution, OrderPage split, synthetic CHIPS structure (0.9.0–0.9.2)#104

Closed
Flummy1 wants to merge 8 commits into
devfrom
feature/multi-source-label-index
Closed

feat(structural-parsing): cross-locale label resolution, OrderPage split, synthetic CHIPS structure (0.9.0–0.9.2)#104
Flummy1 wants to merge 8 commits into
devfrom
feature/multi-source-label-index

Conversation

@Flummy1

@Flummy1 Flummy1 commented May 2, 2026

Copy link
Copy Markdown
Contributor

Why this PR exists

SubcategoryStructure (introduced in #102) was built from data-fields JSON, so it indexed fields by their English IDs ('weapon', 'quantity', 'currency'). But every other surface the engine actually consumes — OfferPage.fields, OrderPage param-list, the buyer's order form — renders localized labels ('Категория', 'количество', 'Тип валюты', 'Telegram Username', 'Логин Steam'). Consequence, measured live across iterating stands:

Stage Recall on OrderPage.lot_fields
baseline (before this PR) ~12% (1/8 fields, 32-purchase / 8-subcategory stand)
after b027046 (multi-source aliases + offerEdit-side seeding) 59% raw → 80% after enrich_from_offer
after 8d979c5 (synthetic CHIPS + composite labels, 0.9.0) targeting ≥95%
after 54e4d56 (no-HTTP enrichment paths + provenance, 0.9.1) 75% on 62-purchase stand
after e78cc88 (delivery-fields classification + context lookup + merge, 0.9.2) targeting ≥90%

Three orthogonal contract issues blocked engine-side integration and were addressed alongside the recall work:

  1. OrderPage.data mixed three categories of data — order metadata ('игра', 'сумма'), lot-config fields ('регион', 'количество usd'), and per-order delivery-contract data the buyer types at checkout ('telegram username', 'логин steam'). Lookups against the flat dict risked false positives across all three.
  2. FieldCondition / SubcategoryStructure weren't FunPayObjects — broke pydantic from_attributes validators in funpaybotengine.
  3. CHIPS subcategories without a lot-fields block produced no structure — even though every offer's other_data carried structured (field_id → value_id) pairs.

This PR addresses the whole pipeline.


Commit chronology

# sha What it adds
1 b027046 Foundational work: SubcategoryFieldDef.aliases, multi-source label_map, lookup_field_id / add_alias / enrich_from_offer / enrich_from_offer_fields, OrderPage data split into metadata + lot_fields, FunPayObject unification of FieldCondition and SubcategoryStructure.
2 98dfc96 Quantity-style SELECT/DROPDOWN options ('20 USD', '13 звёзд') collapse to bare int in enrich_from_offer value matching since the unit is already encoded in field_id.
3 425e447 _normalize_option (casefold + strip emoji/symbols + collapse ws + strip outer punct) bridges decorated values like 'RUB🔥' to clean options. __post_init__ on SubcategoryFieldDef auto-adds label to aliases (single source of truth — parser-side wiring becomes a one-liner). New enrich_from_offer_fields cross-source helper.
4 71be2f3 Condition-aware option matching: enrich_from_offer skips NUMERIC_RANGE candidates whose conditions are not satisfied by the already-resolved context. Fixes the live regression on subcat #1316.
5 8d979c5 Synthetic SubcategoryStructure for CHIPS subcategories without lot-fields (built from union of OfferPreview.other_data keys/values; opt-in flag). Composite-label expansion in _split_order_data ('количество usd' = '20 USD' → also 'usd' = '20'; opt-out flag). derived_from field + is_synthetic property. Bump to 0.9.0.
6 54e4d56 Two new no-HTTP enrichment paths — enrich_from_order_page(OrderPage) and enrich_from_offer_previews(Iterable[OfferPreview]). Adds AliasSource enum + side-channel _alias_sources map; add_alias gains a source= kwarg (default USER, backward-compatible); new alias_source() / forget_aliases_from(source). Bump to 0.9.1.
7 e78cc88 Delivery-contract classification: OfferPage.delivery_fields_spec, SubcategoryStructure.delivery_fields + enrich_delivery_fields_from_offer, OrderPage.delivery_fields + static ORDER_DELIVERY_LABELS blacklist, reclassify_with_structure. Context-aware lookup_field_id(context=…). SubcategoryStructure.merge_from. Bump to 0.9.2.
8 8da34ad Refactor: remove _parse_title_fields, _TITLE_SUFFIX_TYPES, OfferPreview.parse_title_fields, OrderPreview.parse_title_fields, and their tests. Title parsing moves to a dedicated external package. _normalize_option / _DECORATION_RE are preserved — still used by enrich_from_offer and enrich_from_order_page.

Areas changed

1. Multi-source label index — funpayparsers/types/subcategory_structure.py

The label/lookup machinery so a structure built from English data-fields IDs still resolves localized labels seen elsewhere.

  • SubcategoryFieldDef.aliases: set[str] — additional casefolded label aliases.
  • SubcategoryFieldDef.__post_init__ casefolds every alias entry, drops empties, and auto-adds the field's own label as an alias. Single source of truth for label registration.
  • SubcategoryStructure.label_map (@cached_property) indexes both f.label (as-is, possibly localized) and every entry in f.aliases. Per-field seen set guards against label itself being one of the aliases.
  • SubcategoryStructure.lower_label_map (@cached_property) merges duplicates produced by casefolding without re-introducing already-listed field IDs.
  • SubcategoryStructure.add_alias(field_id, alias, source=...) — registers a casefolded alias and busts both cached label maps via self.__dict__.pop(...).
  • SubcategoryStructure.enrich_from_offer_fields(offer_fields) — registers the canonical localized labels from an authenticated OfferFields.field_schema (offerEdit page) as aliases on a structure built from the public listing form, bridging the listing-form / offer-page locale gap. Returns self for chaining.
  • SubcategoryStructure.enrich_from_offer(offer) — heuristic enrichment by SELECT/DROPDOWN option value: registers label as alias only if exactly one field matches (after _normalize_option on both sides). Returns self.
  • SubcategoryStructure.from_offer_fields propagates offer_fields.raw_source into the new structure.

Parser side (funpayparsers/parsers/offer_fields_parser.py): only aliases={field_id} is passed; the localized <label> text flows in through __post_init__.

2. Context-aware lookup_field_id — 0.9.2

SubcategoryStructure.lookup_field_id(label, *, context=None) now disambiguates ambiguous matches against already-resolved fields. Some FunPay subcategories declare two fields under the same form-locale label — most commonly a quantity SELECT and a quantity2 NUMERIC_RANGE both labelled 'Количество робуксов', where the latter is gated on quantity='другое количество'. Previously the lookup returned None on ambiguity and get_structured_fields silently picked first-by-declaration.

  • Per-candidate scoring against FieldCondition state in context (mapping {field_id: value} of fields already resolved this pass):
    • 2 — has conditions, all satisfied;
    • 1 — has no conditions (always-visible);
    • 0 — has conditions, at least one unsatisfied.
  • Returns None on miss, None on top-tier tie or all-zero scoring (still ambiguous), the winning id otherwise.
  • Backward-compatible: when context is omitted and the label is ambiguous, returns None exactly like before.

OrderPage.get_structured_fields is rewritten as an iterative resolve that feeds the accumulating result back as context on each step, with a fallback to first-by-declaration when context is insufficient (preserves legacy behaviour for callers with no per-field disambiguation needs).

3. Value normalization — _DECORATION_RE / _normalize_option

Sellers commonly inject decorations into rendered values that are absent from the canonical filter-form options ('RUB🔥' vs option 'RUB', 'По логину🔥' vs 'По логину', '★★★Premium★★★' vs 'Premium').

  • _DECORATION_RE covers extended pictographs (emoji proper), misc symbols + dingbats, regional indicators (flags), misc technical (⌚ ⌛ ⏰), misc symbols and arrows (⭐ ⬆), variation selectors, ZWJ. Latin / Cyrillic / digits / interior punctuation are deliberately preserved.
  • _normalize_option(s) — casefold + strip decorations + collapse whitespace + strip outer punctuation.
  • Applied in enrich_from_offer and enrich_from_order_page (both sides of the value comparison).

4. OrderPage.data split: metadata / lot_fields / delivery_fields

The previous flat data dict mixed RU/EN/UA-localized order metadata with everything else. After b027046 it became (metadata, lot_fields); after e78cc88 it splits into three disjoint dicts:

  • metadata — eight canonical keys (game, category, short_description, detailed_description, amount, open, closed, total), exposed via the public constant ORDER_METADATA_LABELS.
  • lot_fields — lot-config fields and the input for get_structured_fields. Composite labels like 'количество usd' = '20 USD' are additionally indexed under the trailing currency id ('usd' = '20') so the structure's usd/rub/… fields can resolve them. setdefault semantics on synthetic-key collisions.
  • delivery_fieldsnew in 0.9.2. Per-order data the buyer typed into the order form (Telegram username, Steam login, character name, email, …). Routed via the public constant ORDER_DELIVERY_LABELS (static blacklist of common casefolded labels). Live data showed 10 of 62 stand orders losing recall to misclassified delivery labels — 'telegram username' × 5 on subcat #2418 (Telegram Stars), 'логин steam' × 5 on subcat #1086 (Steam).

_split_order_data signature is now:

_split_order_data(
    data: dict[str, str],
    expand_composite: bool = True,
    extra_delivery_labels: frozenset[str] = frozenset(),
) -> tuple[dict[str, str], dict[str, str], dict[str, str]]

The new extra_delivery_labels kwarg lets callers pass a casefolded harvest of SubcategoryStructure.delivery_fields.values() for high-precision per-subcategory classification (extends, does not replace, the static blacklist). All three returned dicts are pairwise disjoint by key.

OrderPage.reclassify_with_structure(structure) re-splits data using the structure's delivery_fields for high-precision routing — useful when the page was originally parsed without a structure on hand and the caller has since obtained one.

OrderPage.metadata / lot_fields / delivery_fields are additional fields on the dataclass; data is preserved verbatim as the union for back-compat. All metadata properties (short_description, full_description, amount, open_date_text, close_date_text, order_category_name, order_subcategory_name, order_total) read from metadata directly.

Wired in funpayparsers/parsers/page_parsers/order_page_parser.py:

  • New option OrderPageParsingOptions.expand_composite_lot_labels: bool = True (opt-out for callers asserting len(lot_fields)).
  • Splits data once after the param-list loop and wires all three resulting dicts into the constructed OrderPage.

5. Buyer order form parsing → OfferPage.delivery_fields_spec — 0.9.2

The authoritative source for "which OrderPage.data labels are buyer-typed delivery inputs vs lot-config" is the offer's own <form action="/orders/new">. Each <div class="form-group"> containing a <label class="control-label"> and a named <input> / <select> is one delivery field.

  • New parser-private constants:
    • _DELIVERY_FORM_RESERVED_NAMEScsrf_token, type, preview, offer_id, price_guard, username (chrome autofill bug-fix dummy), method (payment dropdown), amount, sum.
    • _DELIVERY_FORM_SKIP_CLASSESmultiple-purchase-switcher, offer-calc-box, js-price-row, js-order-prices.
  • New private helper _parse_delivery_fields_spec(form_node) — returns {input_name: label}. Skips reserved-name / skip-class / type="hidden" / unlabeled groups. First named input per form-group wins (setdefault).
  • New field OfferPage.delivery_fields_spec: dict[str, str]. Wired in OfferPageParser._parse via page_content.css_first('form[action$="orders/new"]', strict=False); gracefully empty when the form is absent (anonymous offer pages).

SubcategoryStructure.delivery_fields: dict[str, str] (also new in 0.9.2) accumulates specs across observed offers via enrich_delivery_fields_from_offer(offer). First-seen-label wins so the label seen on the first observed offer stays stable across the subcategory. The accumulated dict feeds OrderPage.reclassify_with_structure for high-precision per-order routing.

6. Synthetic SubcategoryStructure for CHIPS — subcategory_page_parser.py

CHIPS-listing pages (e.g. #173) often have no div.lot-fields block, so no authoritative structure can be parsed. But each CHIPS offer's other_data already carries structured (field_id → value_id) pairs ({'server': 12448, 'side': 2}) — enough to synthesize one. Out of scope: OFFERS subcategories without lot-fields (e.g. #3789 ChatGPT) carry only order metadata in OfferPage.fields, nothing structural to infer.

  • New private helper _synthesize_chips_structure(subcategory_id, offers) — builds a SubcategoryStructure from the union of OfferPreview.other_data keys/values across the parsed previews. Each distinct key becomes a SubcategoryFieldDef(type=SELECT) whose options are the first-seen values. Labels come from OfferPreview.other_data_names, with field_id as fallback.
  • New opt-in option SubcategoryPageParsingOptions.fallback_structure_from_chips_offers: bool = False.
  • Integration in SubcategoryPageParser._parse: subcategory id is computed once; offers are parsed before the synth fallback; when structure is None, subcategory_type is CHIPS, and the option is set, synthesizes a structure from the parsed offers.

Provenance is exposed on the structure itself:

  • SubcategoryStructure.derived_from: Literal['lot_fields', 'chips_offers'], default 'lot_fields'. The default preserves back-compat for every existing callsite.
  • SubcategoryStructure.is_synthetic property — True iff derived_from != 'lot_fields'. Replaces sniffing raw_source for provenance.

7. No-HTTP enrichment paths + alias provenance — 0.9.1

enrich_from_offer(OfferPage) covers the post-purchase OfferPage path, but two common flows still hit the structure with locale-mismatched labels: completed-order processing (where the engine has an OrderPage in hand from a NEW_ORDER message but no OfferPage) and listing-page batch enrichment (where every OfferPreview.other_data already carries structured (field_id → value) pairs — datapoints that cost zero extra HTTP).

A second concern: with multiple enrich-from-* sources feeding the same aliases set, debugging "why was this alias added?" and selectively invalidating one source's contributions become real needs. Solved with a side-channel provenance map.

  • New enum AliasSourceLABEL / LISTING / OFFER_EDIT / OFFER_PAGE / ORDER_PAGE / OFFER_PREVIEW / USER. Exported via __all__.
  • New field SubcategoryStructure._alias_sources: dict[(field_id, alias_cf), AliasSource]. Side-channel rather than a per-SubcategoryFieldDef change so the public aliases: set[str] API stays untouched.
  • New SubcategoryStructure.__post_init__ seeds existing field aliases (auto-added by SubcategoryFieldDef.__post_init__ from the field's label) with AliasSource.LABEL.
  • add_alias(field_id, alias, source=AliasSource.USER) — gained a source kwarg with a backward-compatible default. Re-registering an existing alias updates the recorded source.
  • New alias_source(field_id, alias) -> AliasSource | None — provenance lookup.
  • New forget_aliases_from(source) -> int — drops every alias previously registered with source, returning the count removed.
  • Existing enrich_from_offer and enrich_from_offer_fields now tag their additions with OFFER_PAGE and OFFER_EDIT respectively.

New no-HTTP enrichment paths:

  • enrich_from_order_page(order: OrderPage) — value-match heuristic on order.lot_fields, mirror of enrich_from_offer. Source: ORDER_PAGE.
  • enrich_from_offer_previews(offers: Iterable[OfferPreview]) — for each preview, registers other_data_names[field_id] as an alias for any field_id already in self.fields. Source: OFFER_PREVIEW.
  • enrich_delivery_fields_from_offer(offer) (0.9.2) — accumulates OfferPage.delivery_fields_spec entries into SubcategoryStructure.delivery_fields with first-seen-label-wins semantics.

8. SubcategoryStructure.merge_from — 0.9.2

Combines two structures into one. Used by the engine to layer synthetic structure (incomplete but cheap, from CHIPS listing) under an authoritative from_offer_fields structure, or to hydrate persistent-cached aliases on top of a freshly parsed structure.

  • For each field_id only in other — deep-copies the whole SubcategoryFieldDef and carries over _alias_sources entries.
  • For each field_id in both — unions the alias sets only. self is treated as the authoritative source for label, options, type, and conditions (other's values for these are ignored).
  • Provenance of newly-added aliases is taken from other's _alias_sources, falling back to AliasSource.USER.
  • delivery_fields are unioned with the same first-seen-wins semantic.
  • Invalidates label_map / lower_label_map caches.
  • Does not mutate other. Returns self for chaining.

9. FieldCondition and SubcategoryStructure unified under FunPayObject

In funpaybotengine PR #35 (commit d66422f), pydantic wrappers built via model_validate(..., from_attributes=True) choked on FieldCondition and SubcategoryStructure because they were plain @dataclasses with no raw_source. The engine had to add an asdict() shim in _add_raw_source.

  • FieldCondition extends FunPayObject with raw_source: str = field(default='', compare=False). All other fields gain defaults so existing kw-only callers keep working.
  • SubcategoryStructure similarly.
  • OfferFieldsParser._parse_conditions stores the raw condition JSON on each FieldCondition via json.dumps(cond, ensure_ascii=False).
  • SubcategoryPageParser passes lot_fields_div.html as raw_source to the constructed SubcategoryStructure.

The engine's asdict() shim can be removed after this PR is released.

10. Version bump — pyproject.toml

0.8.00.8.20.8.30.9.00.9.10.9.2.

The 0.9.0 minor bump covered composite-label expansion (additive, but a contract change for callers asserting lot_fields dict size). The 0.9.1 patch bump was strictly additive. The 0.9.2 patch bump is similarly additive at the dataclass level, but _split_order_data now returns a 3-tuple — a breaking change for any caller importing that private symbol directly (none in tree besides the parser and updated tests).


Public API surface added or changed

Symbol Where Kind
SubcategoryFieldDef.aliases types/subcategory_structure.py new field
SubcategoryStructure.derived_from types/subcategory_structure.py new field
SubcategoryStructure.delivery_fields types/subcategory_structure.py new field (0.9.2)
SubcategoryStructure.is_synthetic types/subcategory_structure.py new property
SubcategoryStructure.lookup_field_id(label, *, context=None) types/subcategory_structure.py new method (context kwarg in 0.9.2)
SubcategoryStructure.add_alias types/subcategory_structure.py new method
SubcategoryStructure.add_alias(source=...) types/subcategory_structure.py new kwarg (0.9.1, default USER)
SubcategoryStructure.enrich_from_offer types/subcategory_structure.py new method
SubcategoryStructure.enrich_from_offer_fields types/subcategory_structure.py new method
SubcategoryStructure.enrich_from_order_page types/subcategory_structure.py new method (0.9.1)
SubcategoryStructure.enrich_from_offer_previews types/subcategory_structure.py new method (0.9.1)
SubcategoryStructure.enrich_delivery_fields_from_offer types/subcategory_structure.py new method (0.9.2)
SubcategoryStructure.merge_from types/subcategory_structure.py new method (0.9.2)
SubcategoryStructure.alias_source types/subcategory_structure.py new method (0.9.1)
SubcategoryStructure.forget_aliases_from types/subcategory_structure.py new method (0.9.1)
AliasSource types/subcategory_structure.py new enum (0.9.1)
OfferPage.delivery_fields_spec types/pages/offer_page.py new field (0.9.2)
OrderPage.metadata types/pages/order_page.py new field
OrderPage.lot_fields types/pages/order_page.py new field
OrderPage.delivery_fields types/pages/order_page.py new field (0.9.2)
OrderPage.reclassify_with_structure types/pages/order_page.py new method (0.9.2)
ORDER_METADATA_LABELS types/pages/order_page.py new public constant
ORDER_DELIVERY_LABELS types/pages/order_page.py new public constant (0.9.2)
OrderPageParsingOptions.expand_composite_lot_labels parsers/page_parsers/order_page_parser.py new option (default True)
SubcategoryPageParsingOptions.fallback_structure_from_chips_offers parsers/page_parsers/subcategory_page_parser.py new option (default False)
FieldCondition types/subcategory_structure.py now a FunPayObject (gained raw_source)
SubcategoryStructure types/subcategory_structure.py now a FunPayObject (gained raw_source)

Backward compatibility

  • OrderPage.data is kept verbatim — metadata, lot_fields, delivery_fields are additional fields. data is the union for any existing caller.
  • FieldCondition / SubcategoryStructure continue to construct via the same kw-only call patterns; new raw_source defaults to ''.
  • SubcategoryStructure.derived_from defaults to 'lot_fields', so any existing test or wrapper that constructs structures directly produces non-synthetic ones. delivery_fields defaults to {}.
  • OfferFieldsParser still emits the same SubcategoryFieldDef objects; the parser explicitly seeds aliases={field_id} and the localized label flows in via __post_init__.
  • OrderPageParsingOptions.expand_composite_lot_labels defaults to True — preserves the recall-boosting semantics. Callers asserting len(lot_fields) can opt out.
  • SubcategoryPageParsingOptions.fallback_structure_from_chips_offers defaults to False — opt-in only; CHIPS callers that previously got structure=None continue to get structure=None.
  • lookup_field_id(label) without a context kwarg behaves exactly as before — ambiguous matches still return None.
  • Breaking at the private-symbol level: _split_order_data now returns a 3-tuple. The only in-tree callers (OrderPageParser, the unit tests) are updated. External callers importing this private helper need to unpack three values.

Tests

pytest tests/574 passed.

tests/subcategory_structure_test.py

  • TestFieldCondition — case-insensitive matching, numeric value coercion, edge cases.
  • TestSubcategoryStructureLabelMaplabel_map indexes both the original label and the casefolded form (label auto-added by __post_init__); duplicate labels preserved in declaration order; empty labels grouped; case-insensitive merge.
  • TestSubcategoryFieldDefAliases — alias casefolding, empty-alias filtering.
  • TestSubcategoryStructureAliasIndexinglookup_field_id resolves both English ID and localized aliases; add_alias invalidates cached label maps; enrich_from_offer matches a unique SELECT option and skips ambiguous matches.
  • TestNormalizeOption — strips emoji / misc symbols / dingbats; collapses whitespace; strips outer punct; preserves Latin / Cyrillic / digits / interior punctuation.
  • TestEnrichFromOfferWithDecoratedValue'RUB🔥' resolves to a currency field whose options include 'RUB'.
  • TestSubcategoryFieldDefLabelAutoAliaslabel auto-added to aliases by __post_init__; empty label not added.
  • TestEnrichFromOfferFields — localized labels from offerEdit-page OfferFields get registered as aliases on a listing-page-built structure; ids absent from self.fields are ignored; chaining returns self.
  • TestEnrichFromOrderPage (0.9.1) — unique-value match registers an alias with ORDER_PAGE provenance; ambiguous matches skipped; already-mapped labels left alone.
  • TestEnrichFromOfferPreviews (0.9.1) — other_data_names entries registered as aliases with OFFER_PREVIEW provenance; no-op on empty inputs / empty other_data; unknown field ids ignored.
  • TestAliasSource (0.9.1) — default source is USER; explicit source recorded; label-derived auto-aliases seeded with LABEL; forget_aliases_from(source) removes only matching entries and returns the count; positional add_alias(fid, alias) calls remain backward-compatible.
  • TestLookupFieldIdWithContext (0.9.2) — unique label resolves without context; ambiguous + no context returns None; context picks unconditional when conditions of the alternative are unsatisfied; context promotes the conditional candidate when its conditions hold; miss returns None regardless of context.
  • TestEnrichDeliveryFieldsFromOffer (0.9.2) — multiple offers union; first-seen label wins; chaining returns self.
  • TestSubcategoryStructureMerge (0.9.2) — fields only in other are deep-copied (mutating other afterwards doesn't affect self); overlapping field unions aliases; alias provenance carries over; other is not mutated; delivery_fields are unioned; label_map cache is invalidated; chaining returns self.
  • TestOrderPageReclassifyAndStructured (0.9.2) — reclassify_with_structure promotes a label out of lot_fields and into delivery_fields when the structure has the matching delivery spec; get_structured_fields picks the unconditional candidate on ambiguous lookup with no helpful context.

tests/order_page_split_test.py

  • TestSplitOrderData — canonical RU/EN/UA metadata extracted under canonical keys; metadata and lot_fields key spaces stay disjoint; English-locale variants.
  • TestSplitOrderDataCompositeExpansion'количество usd' = '20 USD' produces both the original entry and a synthetic 'usd' = '20'; same for rub; broad currency-id regex (gbp works); non-numeric value left alone; pre-existing synthetic-key wins on collision (setdefault); opt-out via expand_composite=False; non-composite labels untouched.
  • TestSplitOrderDataDeliveryClassification (0.9.2) — static blacklist routes Telegram / Steam labels to delivery_fields; extra_delivery_labels extends rather than replaces the static blacklist; the three returned dicts are pairwise disjoint; ORDER_DELIVERY_LABELS exposes the expected casefolded entries.

tests/parsers_test/offer_page_delivery_spec_test.py (0.9.2)

  • TestParseDeliveryFieldsSpec — basic Telegram-username form-group; payment-method dropdown skipped; chrome-autofill hidden input skipped; offer-calc-box form-group skipped; unlabeled groups skipped; multiple visible fields collected; reserved csrf_token skipped; named character-name input picked up.

tests/synthesize_chips_test.py

  • TestSynthesizeChipsStructure — empty / single / multi-offer cases; first-seen option ordering; label fallback to field id; default derived_from='lot_fields' for normal SubcategoryStructure constructors.
  • TestFallbackOptionDefaultfallback_structure_from_chips_offers defaults to False.

tests/parsers_test/lot_fields_parser_test.py — fixture asserts the auto-registered aliases ({field_id, casefolded label}) on each parsed SubcategoryFieldDef.


Engine cleanup once 0.9.2 is released

  • Remove the asdict() shim in _add_raw_source (was needed because FieldCondition / SubcategoryStructure weren't FunPayObject).
  • Switch any structure-provenance checks to structure.is_synthetic instead of sniffing raw_source.
  • Drop any locally maintained "delivery label blacklist" — ORDER_DELIVERY_LABELS + SubcategoryStructure.delivery_fields is now the canonical source.
  • Replace any positional _split_order_data result unpacking — the third element is delivery_fields.

Address structural-parsing recall problems found via live engine testing
on funpaybotengine PR #35: structure built from data-fields JSON keeps
English IDs while OfferPage / OrderPage render localized labels, leaving
get_structured_fields with ~1/8 recall on real data.

- SubcategoryFieldDef gains casefolded `aliases: set[str]`; OfferFieldsParser
  registers field id and form <label> as aliases. label_map / lower_label_map
  now index aliases. New `lookup_field_id`, `add_alias`, and heuristic
  `enrich_from_offer` (matches by SELECT/DROPDOWN option value).

- OfferPreview / OrderPreview gain `parse_title_fields(structure)` (cherry-
  picked from the lost feature/subcategory-structure-parser commits and
  adapted to the dict-based fields layout). SELECT/DROPDOWN segments are
  validated against `options`; unmatched segments are dropped.

- OrderPage data is split into `metadata` (canonical RU/EN/UA-aware order
  fields) and `lot_fields` (everything else). `get_structured_fields` now
  reads `lot_fields` only, eliminating false positives from order-metadata
  labels colliding with structural ones. `data` is preserved for back-compat.

- FieldCondition and SubcategoryStructure now extend FunPayObject with
  `raw_source`, removing the need for the asdict() shim in engine
  pydantic validators.

- Bump 0.8.0 -> 0.8.2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Flummy1 Flummy1 force-pushed the feature/multi-source-label-index branch from 3da79ee to b027046 Compare May 2, 2026 13:45
Flummy1 and others added 3 commits May 2, 2026 16:58
…ues to int

Options like '20 USD', '5000 RUB', '13 звёзд', '50 stars' carry a numeric
magnitude with a redundant trailing unit; the unit is already encoded in
field_def.id (usd / rub / quantity), so consumers should not have to
re-strip it. Pure-text options (e.g. currency='USD', method='По username')
are preserved as strings unchanged.

Updated test_select_match_kept (now test_select_quantity_collapsed_to_int)
and added test_select_non_numeric_option_kept_as_string for coverage of
the string-preserving path.
… label seeding, right-to-left title matcher

Follow-up on the multi-source label index. Live testing on 32 real
purchases / 8 subcategories surfaced three remaining failure modes
that capped enriched recall at ~80% (43/54 lot_fields). This commit
addresses all three:

Task 1 — Value normalization for option matching.
  Sellers commonly decorate values with emoji / star symbols ('RUB🔥'
  vs option 'RUB', 'По логину🔥' vs 'По логину'), and option lists
  on the filter form are clean. New module-level _normalize_option
  helper (casefold + strip emoji/symbol decorations + collapse
  whitespace + strip outer punctuation) is now applied on both sides
  of the comparison in `SubcategoryStructure.enrich_from_offer` and
  in `_parse_title_fields`. Latin / Cyrillic / digits / interior
  punctuation are preserved.

Task 2 — Auto-alias the canonical localized label, plus an
  `enrich_from_offer_fields` cross-source helper.
  `SubcategoryFieldDef.__post_init__` now adds the casefolded `label`
  to `aliases` automatically — single source of truth for label
  registration, removing the duplicated parser-side wiring. New
  `SubcategoryStructure.enrich_from_offer_fields(offer_fields)`
  method registers the canonical localized labels from an
  authenticated offerEdit-page schema as aliases on a structure
  built from the (often English-labeled) public listing form,
  unblocking TEXT fields that have no `options` for value-based
  enrichment to work against.

Task 3 — Right-to-left greedy `_parse_title_fields`.
  Replaced positional rsplit+zip with a reverse walk that tries each
  rightmost not-yet-consumed segment against the current field; on
  miss, the segment is left for an earlier field. Fixes short-title
  rsplit-mismatch where titles with fewer segments than suffix
  fields produced wrong assignments (e.g. '50 звёзд, По username'
  with 3 declared suffix fields). Also folds in _normalize_option
  in the SELECT/DROPDOWN branch so decorated title segments resolve
  to the canonical option.

Bump 0.8.2 → 0.8.3 (purely additive surface, behavior of existing
match cases preserved).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ECT first

Live regression on gift-card subcat #1316: titles like
'…, RUB, 2000 RUB' were resolving to {inr2: 2000, currency: 'RUB'}
instead of {rub: 2000, currency: 'RUB'}. Root cause was the right-to-
left matcher in 425e447 attributing leading-numeric segments to the
LAST declared NUMERIC_RANGE field (`inr2`, conditioned on
`inr='другое количество'`) without verifying that the condition holds.

Two-pass algorithm now:
  Pass 1 — SELECT/DROPDOWN, right-to-left. These are authoritative
           because the segment must literally appear in the field's
           options.
  Pass 2 — NUMERIC_RANGE for unconsumed segments, ONLY for fields whose
           conditions are satisfied by pass-1 results (or have no
           conditions). Walks fields in declaration order so the
           innermost matching field wins.

Adds two tests:
  - test_select_preferred_over_numeric_range_with_unsatisfied_conditions:
    reproduces the live #1316 case.
  - test_unconditional_numeric_range_still_matches: ensures gating
    does not regress unconditional NUMERIC_RANGE fields.

531 tests pass.
@Flummy1 Flummy1 force-pushed the feature/multi-source-label-index branch from b1c35ef to 71be2f3 Compare May 2, 2026 14:45
Closes the remaining ~16% recall gap on the live structural-parsing
stand by addressing two well-scoped failure modes from the dataset
research:

Task 1 — Synthetic SubcategoryStructure for CHIPS subcategories.
  CHIPS-listing pages (e.g. #173 Necropolis) often have no
  ``div.lot-fields`` block, so no authoritative structure can be
  parsed. But each CHIPS offer's ``other_data`` already carries
  structured ``(field_id → value_id)`` pairs — enough to synthesize
  one. New ``_synthesize_chips_structure`` helper in the parser
  module builds a ``SubcategoryStructure`` from the union of
  ``other_data`` keys/values across the parsed previews, with each
  distinct key as a SELECT field whose options are the first-seen
  values, and labels pulled from ``other_data_names`` (or the
  field id as fallback).

  Wired into ``SubcategoryPageParser._parse``: parses the offers
  first, and when ``structure is None``, ``subcategory_type`` is
  CHIPS, and the new opt-in ``fallback_structure_from_chips_offers``
  flag on ``SubcategoryPageParsingOptions`` is set, synthesizes a
  structure from them. OFFERS subcategories without lot-fields are
  intentionally not covered — their ``OfferPage.fields`` carry only
  order metadata, nothing structural to infer.

  ``SubcategoryStructure`` gains a ``derived_from`` field
  (``Literal['lot_fields', 'chips_offers']``, default
  ``'lot_fields'``) and an ``is_synthetic`` property, so callers
  can distinguish authoritative vs inferred structures without
  sniffing ``raw_source``.

Task 2 — Composite lot-label expansion in ``_split_order_data``.
  ``OrderPage.lot_fields`` regularly contains composite labels of
  the form ``'количество usd' = '20 USD'``, but the structure
  exposes the canonical currency-amount field as ``usd`` —
  ``get_structured_fields`` couldn't bridge the two. New
  ``_COMPOSITE_LABEL_RE`` matches ``'<quantity-locale> <2-4-letter-id>'``
  (broad enough to survive FunPay adding new currencies) and
  ``_COMPOSITE_VALUE_RE`` extracts the numeric magnitude. When both
  match, an additional ``lot_fields[currency_id] = '<number>'``
  entry is emitted alongside the original composite label. Original
  wins on synthetic-key collisions (``setdefault``).

  Behavior gated by ``OrderPageParsingOptions.expand_composite_lot_labels``
  (default ``True`` — preserves the recall-boosting semantics — but
  callers asserting ``len(lot_fields)`` can opt out).

Bump 0.8.3 → 0.9.0. Surface additions are opt-in and back-compat,
but composite expansion changes the contents of ``lot_fields`` for
existing payloads, which is the "minor breaking" trigger for the
minor bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Flummy1 Flummy1 changed the title feat: multi-source label index, parse_title_fields, OrderPage data split feat: structural-parsing recall — multi-source aliases, parse_title_fields, OrderPage split, synthetic CHIPS, composite labels (0.9.0) May 2, 2026
@Flummy1 Flummy1 changed the title feat: structural-parsing recall — multi-source aliases, parse_title_fields, OrderPage split, synthetic CHIPS, composite labels (0.9.0) feat(structural-parsing): cross-locale label resolution, robust title matcher, OrderPage split, synthetic CHIPS structure (0.9.0) May 2, 2026
Flummy1 and others added 3 commits May 2, 2026 22:01
…ce provenance (0.9.1)

Adds two new no-HTTP enrichment paths and a provenance side-channel so
aliases registered by each enrich_from_* source can be inspected and
selectively invalidated.

* AliasSource enum (LABEL / LISTING / OFFER_EDIT / OFFER_PAGE /
  ORDER_PAGE / OFFER_PREVIEW / USER) and a SubcategoryStructure-side
  _alias_sources map keyed by (field_id, alias_casefold).
* add_alias gains source= kwarg (default USER, backward-compatible);
  alias_source() and forget_aliases_from() expose / invalidate by source.
* enrich_from_order_page(OrderPage) — value-match against SELECT options,
  mirror of enrich_from_offer for completed-order payloads.
* enrich_from_offer_previews(Iterable[OfferPreview]) — registers
  other_data_names entries as aliases for ids already in the structure.
* Existing enrich_from_offer / enrich_from_offer_fields tagged with
  OFFER_PAGE / OFFER_EDIT respectively; auto-aliases seeded from
  SubcategoryFieldDef.label tagged LABEL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…re merge (0.9.2)

* OfferPage.delivery_fields_spec — parsed from <form action="/orders/new">,
  maps input.name → label for per-order delivery contract fields
  (Telegram username, Steam login, …)
* SubcategoryStructure.delivery_fields + enrich_delivery_fields_from_offer
  to accumulate specs across observed offers (first-seen label wins)
* OrderPage.delivery_fields with static ORDER_DELIVERY_LABELS blacklist;
  reclassify_with_structure for high-precision per-subcategory routing
* lookup_field_id(context=…) scores ambiguous matches against
  FieldCondition state, disambiguating shared labels like
  quantity SELECT vs quantity2 NUMERIC_RANGE
* OrderPage.get_structured_fields uses iterative resolve with
  accumulated context, falling back to first-by-declaration
* SubcategoryStructure.merge_from for combining synthetic + authoritative
  structures: deep-copies missing fields, unions aliases, carries
  alias provenance, invalidates label-map caches

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…thods

Title parsing moves to a dedicated external package (title_playground).
Remove _TITLE_SUFFIX_TYPES, _parse_title_fields from subcategory_structure.py,
OfferPreview.parse_title_fields and OrderPreview.parse_title_fields, their
tests (TestParseTitleFields, TestParseTitleFieldsRightToLeft), and the
now-unused SubcategoryStructure TYPE_CHECKING import in orders.py.

_normalize_option and _DECORATION_RE are preserved — they are still used
by enrich_from_offer and enrich_from_order_page for option-value matching.
@Flummy1 Flummy1 changed the title feat(structural-parsing): cross-locale label resolution, robust title matcher, OrderPage split, synthetic CHIPS structure (0.9.0) feat(structural-parsing): cross-locale label resolution, OrderPage split, synthetic CHIPS structure (0.9.0–0.9.2) May 5, 2026
@Flummy1 Flummy1 closed this Jun 7, 2026
@Flummy1 Flummy1 deleted the feature/multi-source-label-index branch June 7, 2026 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant