Skip to content

Dataset generator: queries from categories (facet results)#265

Open
radu-gheorghe wants to merge 4 commits into
SeaseLtd:dataset-generatorfrom
radu-gheorghe:datset_generator_category_queries
Open

Dataset generator: queries from categories (facet results)#265
radu-gheorghe wants to merge 4 commits into
SeaseLtd:dataset-generatorfrom
radu-gheorghe:datset_generator_category_queries

Conversation

@radu-gheorghe
Copy link
Copy Markdown

This PR includes #264, so if we merge it, that one could be closed.

In addition to generating queries (and therefore judgments) from documents, we can now create those from "categories" - i.e., facet values. For example, movie genres or E-commerce product tags.

The category values could be provided manually in the config or derived from facet results (given a query template for the search engine).

Config example

category_queries:
  # Explicit-values mode
  - fields: ["genres"]
    values: ["comedy", "action"]
    query_text_template_file: "templates/genre_query.tmpl"   # optional NL wrapper

  # Engine-discovery mode
  - fields: ["genres"]
    values_query_template_file: "templates/genre_facet_solr.json"

High level implementation notes

  • fetch_field_values per engine, — each matches its existing retrieval-template convention.

  • Each query is tagged with where it came from this runuser, category, llm, or cached (loaded from a previous run's datastore.json).

  • Budget fix. When the total number of queries exceeds num_queries_needed, the budget used to truncate by raw insertion order, which meant cached queries from datastore.json (loaded first) could push fresh user/category/LLM queries out of scoring. The budget now sorts by source priority first (user/category > llm > cached), so fresh queries always win their slot. The same priority is used when deciding whether to call the LLM at all — cached queries don't count toward the "do we still need more?" check.

  • main() phase split. generate_and_add_queries now goes like add_user_queries / add_category_queries / fetch_and_add_seed_documents / generate_and_add_queries_from_documents.

Gotchas

  • generate_queries_from_documents semantics changed. Was Optional[bool] = True and unread by main(). Now bool = True and actually gates LLM generation + the seed-fetch decision. A YAML null in this field used to be silently equivalent to absent; it now raises a validation error.

  • Validator ordering matters. Path-collision validators run before content-reading validators, so a swapped values_query_template_file / query_text_template_file produces "different concepts" rather than "missing placeholder". Don't reorder.

Next steps (sometime 🙂 )

Multi-field category sources. fields: List[str] is already accepted in the schema; config currently rejects more than one entry with "not yet supported".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant