Dataset generator: queries from categories (facet results) by radu-gheorghe · Pull Request #265 · SeaseLtd/rated-ranking-evaluator

radu-gheorghe · 2026-05-08T13:30:35Z

This PR includes #264, so if we merge it, that one could be closed.

In addition to generating queries (and therefore judgments) from documents, we can now create those from "categories" - i.e., facet values. For example, movie genres or E-commerce product tags.

The category values could be provided manually in the config or derived from facet results (given a query template for the search engine).

Config example

category_queries:
  # Explicit-values mode
  - fields: ["genres"]
    values: ["comedy", "action"]
    query_text_template_file: "templates/genre_query.tmpl"   # optional NL wrapper

  # Engine-discovery mode
  - fields: ["genres"]
    values_query_template_file: "templates/genre_facet_solr.json"

High level implementation notes

fetch_field_values per engine, — each matches its existing retrieval-template convention.
Each query is tagged with where it came from this run — user, category, llm, or cached (loaded from a previous run's datastore.json).
Budget fix. When the total number of queries exceeds num_queries_needed, the budget used to truncate by raw insertion order, which meant cached queries from datastore.json (loaded first) could push fresh user/category/LLM queries out of scoring. The budget now sorts by source priority first (user/category > llm > cached), so fresh queries always win their slot. The same priority is used when deciding whether to call the LLM at all — cached queries don't count toward the "do we still need more?" check.
main() phase split. generate_and_add_queries now goes like add_user_queries / add_category_queries / fetch_and_add_seed_documents / generate_and_add_queries_from_documents.

Gotchas

generate_queries_from_documents semantics changed. Was Optional[bool] = True and unread by main(). Now bool = True and actually gates LLM generation + the seed-fetch decision. A YAML null in this field used to be silently equivalent to absent; it now raises a validation error.
Validator ordering matters. Path-collision validators run before content-reading validators, so a swapped values_query_template_file / query_text_template_file produces "different concepts" rather than "missing placeholder". Don't reorder.

Next steps (sometime 🙂 )

Multi-field category sources. fields: List[str] is already accepted in the schema; config currently rejects more than one entry with "not yet supported".

…ma to collection_name

radu-gheorghe added 4 commits May 6, 2026 18:36

allow using search_engine_type=vespa in config and default vespa_sche…

b04b8a4

…ma to collection_name

make queries needed respect the cache

d4d4ff7

test for query budget

b21bf95

category queries initial implementation

1cf8eaa

radu-gheorghe mentioned this pull request May 12, 2026

LLM call batching and multithreading #266

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset generator: queries from categories (facet results)#265

Dataset generator: queries from categories (facet results)#265
radu-gheorghe wants to merge 4 commits into
SeaseLtd:dataset-generatorfrom
radu-gheorghe:datset_generator_category_queries

radu-gheorghe commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

radu-gheorghe commented May 8, 2026

Config example

High level implementation notes

Gotchas

Next steps (sometime 🙂 )

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant