Conversation
| @@ -0,0 +1,88 @@ | |||
| # .. title:: Multiple configurations | |||
| # If you name a dataset, even an in-memory IterableDataSource, you can reuse it in multiple runs with different configurations. | |||
There was a problem hiding this comment.
This file demonstrates the primary purpose for this PR. Rather than requiring a particular format, we now support multiple input formats, including in-memory Python lists.
Among other things, that will simplify future testing and documented examples (which don't need to rely on external data files).
Claude Opus 4.6 PR Review: "Reuse datasets between runs"Branch: SummaryThis is a substantial refactor that introduces a many-to-many relationship between Strengths
Issues to AddressMedium:
Minor / Nits:
Questions
VerdictThis is a well-motivated refactor that cleans up the data model, improves test quality, and enables a key feature (dataset reuse). The core design is sound. The issues above are mostly minor — the 🤖 Generated with Claude Code |
BaptisteMP
left a comment
There was a problem hiding this comment.
Looks great! will be much easier to interact with datasets and reuse
One question: when clear_tables = True, does it clear all the datasets?
Would it be interesting to only clear some datasets in memory?
A few comments below
|
|
||
| DataSourceType = Annotated[ | ||
| Union[ | ||
| Annotated[NamedDataSource, Tag("named")], |
There was a problem hiding this comment.
did not know you could name types with Tag
| # The named dataset "demo_conversations" is reused from Run 1. | ||
| # Note that you could use a flexeval.schema.NamedDataSource instead if you wanted. | ||
| eval_run_2 = EvalRun( | ||
| data_sources=data_sources, |
There was a problem hiding this comment.
not so clear that the data is being reused in the 2nd eval run, using the NamedDataSource would make it clearer
| dataset, | ||
| data_source, | ||
| max_n_conversation_threads=config.max_n_conversation_threads, | ||
| nb_evaluations_per_thread=config.nb_evaluations_per_thread, |
There was a problem hiding this comment.
Did you have an idea of how to handle nb_evaluations_per_thread?
It appears that config still has it, but it might not work correctly if the load_jsonl loop does not set the eval_run_thread_id anymore? (as per Claude below)
Claude says: The nb_evaluations_per_thread / duplicate-thread logic for JSONL seems to have been removed during the refactor into load_thread_to_dataset. The load_jsonl function still has the loop, but it calls load_thread_to_dataset which doesn't set eval_run_thread_id. Is this intentional?
Your fix should not mess with the use of nb_evaluations_per_thread in principle because it touches only the loading of the datasets and not the threads loading in datasets
There was a problem hiding this comment.
This is a good flag. I'll have to look at this.
Claude summary
Discriminator("type").
raise_on_duplicate_dataset_name, raise_on_unnamed_dataset.
final checkpoint's cumulative channel_values.messages.
@expectedFailureor just checking for no crash).loading test.
Schema/Model changes
Pipeline wiring