Skip to content

Reuse datasets between runs#90

Open
levon003 wants to merge 9 commits intodevfrom
better-datasets
Open

Reuse datasets between runs#90
levon003 wants to merge 9 commits intodevfrom
better-datasets

Conversation

@levon003
Copy link
Copy Markdown
Member

@levon003 levon003 commented Sep 19, 2025

  • Add logic for reusing datasets
  • Can now associate EvalRuns with one or more datasets
  • Bump to langgraph 1.0+ and update test data script
  • Unit & integration testing cleanup
  • Removed the context_only flag. First of all, it was broken, which suggests we weren't using it. Second of all, it was confusing and unnecessary: if a function metric only wants the context, it can easily do that itself.

Claude summary

  1. Dataset loading refactor
  • Polymorphic data sources: DataSource now has three discriminated subtypes via type field: FileDataSource, NamedDataSource (lookup by name), and IterableDataSource. Uses Pydantic
    Discriminator("type").
  • FileFormatEnum: Explicit enum for file formats (jsonl, langgraph_sqlite) replacing a bare Literal["jsonl"].
  • New run_utils.load_datasets(): Centralized dataset loading logic with support for reuse-by-name, duplicate detection, and named lookups. New Config flags: reuse_dataset_by_name,
    raise_on_duplicate_dataset_name, raise_on_unnamed_dataset.
  • Dataset class simplified: Removed ~80 lines of loading logic that moved into run_utils.
  1. LangGraph parser rewrite
  • data_loader.py: Rewrote load_langgraph_sqlite() for langgraph 1.0 compatibility. Old approach (~300 lines) parsed incremental metadata["writes"]; new approach (~100 lines) reads the
    final checkpoint's cumulative channel_values.messages.
  1. context_only feature removal
  • Removed the broken/defunct context_only feature from Turn, compute_metrics, completions, function_types, eval_schema, and test configs.
  1. Test improvements
  • ConfigFailures: Tests now assert specific exception types and verify zero metric rows for runtime errors (instead of @expectedFailure or just checking for no crash).
  • LangGraph integration tests: Unskipped all 4 suites, regenerated test data.
  • Unit tests: Expanded test_data_loader.py significantly (+380 lines) — new tests for NamedDataSource, reuse-by-name, duplicate detection, iterable reuse, and completed the LangGraph
    loading test.
  1. Other changes
  • Docs updates (DEVELOPMENT.md, vignettes.rst, abstractions.rst)
  • New vignette: vignettes/multiple_configs.py
  • Dependency updates in uv.lock / pyproject.toml
  • Minor cleanups in runner.py, db_utils.py, dependency_graph.py, metrics/save.py

Schema/Model changes

  • dataset.py — Removed load_data(), is_sqlite_file(), max_n_conversation_threads, nb_evaluations_per_thread
  • thread.py — Removed evalsetrun FK
  • turn.py — Removed evalsetrun FK; get_completion() now accepts completion_config and evalsetrun as params
  • message.py — Same as turn.py
  • tool_call.py — Removed evalsetrun FK
  • eval_set_run.py — Renamed backref to dataset_links, added dataset_list property
  • evalrun_schema.py — Added type discriminator field to DataSource subclasses, created DataSourceType discriminated union
  • db_utils.py — Added EvalSetRunDatasets to DATABASE_TABLES

Pipeline wiring

  • data_loader.py — Removed all evalsetrun=dataset.evalsetrun references; load_file() now accepts config params
  • run_utils.py — Removed dataset_files from build_eval_set_run(); restructured load_datasets() with clean control flow
  • runner.py — Wired up build_evalsetrun_datasets(), passes datasets list to completions/metrics
  • completions.py — Accepts datasets list, iterates threads through datasets
  • compute_metrics.py — build_thread_task_graphs() takes Dataset; stored do_completion/grader_llm on MetricComputer; fixed load_rubrics() return value bug
  • metrics/save.py — Accepts evalsetrun and datasets as explicit parameters

@levon003 levon003 self-assigned this Sep 19, 2025
@levon003 levon003 changed the title Reuse datasets between runs [Draft] Reuse datasets between runs Mar 26, 2026
@levon003 levon003 changed the title [Draft] Reuse datasets between runs Reuse datasets between runs Mar 26, 2026
@levon003 levon003 marked this pull request as ready for review April 14, 2026 04:17
@levon003 levon003 requested a review from BaptisteMP April 14, 2026 04:22
@@ -0,0 +1,88 @@
# .. title:: Multiple configurations
# If you name a dataset, even an in-memory IterableDataSource, you can reuse it in multiple runs with different configurations.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file demonstrates the primary purpose for this PR. Rather than requiring a particular format, we now support multiple input formats, including in-memory Python lists.

Among other things, that will simplify future testing and documented examples (which don't need to rely on external data files).

@levon003
Copy link
Copy Markdown
Member Author

levon003 commented Apr 15, 2026

Claude Opus 4.6 PR Review: "Reuse datasets between runs"

Branch: better-datasets -> dev | 39 files changed | +1292 / -1042

Summary

This is a substantial refactor that introduces a many-to-many relationship between EvalSetRun and Dataset via a new EvalSetRunDatasets join table. It adds polymorphic data sources (FileDataSource, NamedDataSource, IterableDataSource), rewrites the LangGraph parser for langgraph >= 1.0, removes the broken context_only feature, and significantly improves tests.

Strengths

  1. Clean data model improvement. The old design tied datasets to a single EvalSetRun via FK. The new join table (EvalSetRunDatasets) is the right approach for dataset reuse. The FK removal from Thread, Turn, Message, and ToolCall is clean — they only need to reference their Dataset now.

  2. Polymorphic DataSource is well-designed. Using Pydantic's Discriminator("type") with tagged unions for NamedDataSource, FileDataSource, and IterableDataSource is idiomatic and makes YAML config and Python API usage consistent.

  3. LangGraph parser simplification is a big win. ~300 lines of fragile incremental metadata["writes"] parsing replaced with ~100 lines reading the final checkpoint's channel_values.messages. Much easier to reason about.

  4. context_only removal is the right call. It was broken, unused, and the comment in the PR explains the rationale well — wrapper functions are more explicit.

  5. Test quality improved substantially. ConfigFailures tests now assert specific exception types and verify zero metric rows instead of @expectedFailure. New unit tests for load_datasets control flow cover NamedDataSource, reuse-by-name, duplicate detection, iterable reuse, and type mismatch warnings. The completed LangGraph loading test is a nice finish.

  6. Vignette (vignettes/multiple_configs.py) is a good, self-contained example of the core feature.

Issues to Address

Medium:

  1. run_utils.py:53-58find_dataset_by_name uses len() on a query. len(eligible_datasets) on a Peewee SelectQuery will fetch all rows to count them. Use .count() instead, or better, use .first() and check for None, then do a second .count() only if you need the >1 check: count = Dataset.select().where(Dataset.name == name).count() Zach comment: fixed

  2. run_utils.py:83-84 — Auto-naming IterableDataSources with id(). f"_iterable_{id(data_source)}" uses the CPython memory address. If a data source is garbage-collected and another is created, id() can be reused within the same process. This is fine for same-run reuse, but the name is also persisted to the database (Dataset.name), making it meaningless for future sessions. Consider documenting that auto-names are not stable across processes, or using a hash/UUID instead. Zach comment: no change

  3. data_loader.py:189-197 — SQL injection in PRAGMA is safe, but the thread_id query uses parameterized queries correctly now. Good. However, the old code had f"select * from checkpoints where thread_id = '{thread.langgraph_thread_id}'" which was SQL-injectable — the new parameterized approach is a security improvement worth calling out. Zach comment: lol

  4. runner.py:89-91 — Data loading failures now raise after shutdown_logging(). This is a behavior change: previously, data loading errors were caught and logged, allowing the pipeline to continue (potentially producing empty results). Now it re-raises. This seems intentional and correct, but it's asymmetric with the completion and metric-computation blocks which still swallow exceptions. Consider making these consistent (or document why data loading is special). Zach comment: Interesting observation. Will need to think about where to document this.

  5. compute_metrics.py:570-594 — Thread pool created per-dataset. The multi-worker path now creates a new ThreadPoolExecutor for each dataset. If there are many small datasets, this adds overhead from pool spin-up/teardown. Consider collecting all graphs across datasets first, then processing them in a single pool. Zach comment: Good suggestion

  6. DEVELOPMENT.md:55 — Typo. "Integration tests live in tests/unit/" should say "Unit tests live in tests/unit/". Zach comment: Fixed

Minor / Nits:

  1. eval_schema.pyextra = "forbid" added to MetricItem. This is a good validation improvement but it's a breaking change for anyone with context_only in their existing YAML configs. The evals.yaml cleanup removes all context_only entries, but external users may still have them. Consider whether this warrants a migration note or deprecation warning. Zach comment: No change

  2. completions.py:489datasets: list parameter lacks type annotation. Should be datasets: list[Dataset] for consistency with the rest of the codebase. Zach comment: Fixed

  3. compute_metrics.py:553 — Same untyped datasets: list. Should be list[Dataset]. Zach comment: Fixed

  4. data_loader.pyload_thread_to_dataset no longer sets eval_run_thread_id. The old JSONL loader set eval_run_thread_id=f"{thread_id}_{thread_eval_run_id}" for duplicate runs. The refactored load_thread_to_dataset omits this, which means nb_evaluations_per_thread > 1 for JSONL won't produce distinct eval_run_thread_id values. The LangGraph path still sets it. This looks like a regression.

  5. test_data_loader.py:345-346Dataset.create(..., source=...). The Dataset model doesn't have a source field — it has notes, metadata, etc. This will either be silently ignored by Peewee or raise an error. Should be notes=str(langgraph_db_path) or stored in metadata.

  6. functional_tests.py:539 — Changed expected count from 2 to 1. The comment says "Expected 1 turn with long enough text to evaluate." Worth a brief code comment explaining why the new parser produces different counts (the old parser was likely double-counting due to incremental checkpoint processing). Zach comment: No change

Questions

  • Is there a migration path for existing databases? The schema changes (removed evalsetrun FK from Thread/Turn/Message/ToolCall, removed dataset_files/filename from EvalSetRun/Dataset, new EvalSetRunDatasets table) will break existing .db files. Is clear_tables=True the expected workflow?
  • The nb_evaluations_per_thread / duplicate-thread logic for JSONL seems to have been removed during the refactor into load_thread_to_dataset. The load_jsonl function still has the loop, but it calls load_thread_to_dataset which doesn't set eval_run_thread_id. Is this intentional?

Verdict

This is a well-motivated refactor that cleans up the data model, improves test quality, and enables a key feature (dataset reuse). The core design is sound. The issues above are mostly minor — the DEVELOPMENT.md typo (#6), the missing eval_run_thread_id in JSONL loading (#10), and the invalid source field in the test (#11) are the most likely to cause actual problems. I'd fix those before merging.

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

@BaptisteMP BaptisteMP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! will be much easier to interact with datasets and reuse

One question: when clear_tables = True, does it clear all the datasets?
Would it be interesting to only clear some datasets in memory?

A few comments below


DataSourceType = Annotated[
Union[
Annotated[NamedDataSource, Tag("named")],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did not know you could name types with Tag

# The named dataset "demo_conversations" is reused from Run 1.
# Note that you could use a flexeval.schema.NamedDataSource instead if you wanted.
eval_run_2 = EvalRun(
data_sources=data_sources,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not so clear that the data is being reused in the 2nd eval run, using the NamedDataSource would make it clearer

Comment thread src/flexeval/run_utils.py
dataset,
data_source,
max_n_conversation_threads=config.max_n_conversation_threads,
nb_evaluations_per_thread=config.nb_evaluations_per_thread,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you have an idea of how to handle nb_evaluations_per_thread?
It appears that config still has it, but it might not work correctly if the load_jsonl loop does not set the eval_run_thread_id anymore? (as per Claude below)
Claude says: The nb_evaluations_per_thread / duplicate-thread logic for JSONL seems to have been removed during the refactor into load_thread_to_dataset. The load_jsonl function still has the loop, but it calls load_thread_to_dataset which doesn't set eval_run_thread_id. Is this intentional?

Your fix should not mess with the use of nb_evaluations_per_thread in principle because it touches only the loading of the datasets and not the threads loading in datasets

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good flag. I'll have to look at this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants