Reuse datasets between runs by levon003 · Pull Request #90 · DigitalHarborFoundation/FlexEval

levon003 · 2025-09-19T00:37:20Z

Add logic for reusing datasets
Can now associate EvalRuns with one or more datasets
Bump to langgraph 1.0+ and update test data script
Unit & integration testing cleanup
Removed the context_only flag. First of all, it was broken, which suggests we weren't using it. Second of all, it was confusing and unnecessary: if a function metric only wants the context, it can easily do that itself.

Claude summary

Dataset loading refactor

Polymorphic data sources: DataSource now has three discriminated subtypes via type field: FileDataSource, NamedDataSource (lookup by name), and IterableDataSource. Uses Pydantic
Discriminator("type").
FileFormatEnum: Explicit enum for file formats (jsonl, langgraph_sqlite) replacing a bare Literal["jsonl"].
New run_utils.load_datasets(): Centralized dataset loading logic with support for reuse-by-name, duplicate detection, and named lookups. New Config flags: reuse_dataset_by_name,
raise_on_duplicate_dataset_name, raise_on_unnamed_dataset.
Dataset class simplified: Removed ~80 lines of loading logic that moved into run_utils.

LangGraph parser rewrite

data_loader.py: Rewrote load_langgraph_sqlite() for langgraph 1.0 compatibility. Old approach (~300 lines) parsed incremental metadata["writes"]; new approach (~100 lines) reads the
final checkpoint's cumulative channel_values.messages.

context_only feature removal

Removed the broken/defunct context_only feature from Turn, compute_metrics, completions, function_types, eval_schema, and test configs.

Test improvements

ConfigFailures: Tests now assert specific exception types and verify zero metric rows for runtime errors (instead of @expectedFailure or just checking for no crash).
LangGraph integration tests: Unskipped all 4 suites, regenerated test data.
Unit tests: Expanded test_data_loader.py significantly (+380 lines) — new tests for NamedDataSource, reuse-by-name, duplicate detection, iterable reuse, and completed the LangGraph
loading test.

Other changes

Docs updates (DEVELOPMENT.md, vignettes.rst, abstractions.rst)
New vignette: vignettes/multiple_configs.py
Dependency updates in uv.lock / pyproject.toml
Minor cleanups in runner.py, db_utils.py, dependency_graph.py, metrics/save.py

Schema/Model changes

dataset.py — Removed load_data(), is_sqlite_file(), max_n_conversation_threads, nb_evaluations_per_thread
thread.py — Removed evalsetrun FK
turn.py — Removed evalsetrun FK; get_completion() now accepts completion_config and evalsetrun as params
message.py — Same as turn.py
tool_call.py — Removed evalsetrun FK
eval_set_run.py — Renamed backref to dataset_links, added dataset_list property
evalrun_schema.py — Added type discriminator field to DataSource subclasses, created DataSourceType discriminated union
db_utils.py — Added EvalSetRunDatasets to DATABASE_TABLES

Pipeline wiring

data_loader.py — Removed all evalsetrun=dataset.evalsetrun references; load_file() now accepts config params
run_utils.py — Removed dataset_files from build_eval_set_run(); restructured load_datasets() with clean control flow
runner.py — Wired up build_evalsetrun_datasets(), passes datasets list to completions/metrics
completions.py — Accepts datasets list, iterates threads through datasets
compute_metrics.py — build_thread_task_graphs() takes Dataset; stored do_completion/grader_llm on MetricComputer; fixed load_rubrics() return value bug
metrics/save.py — Accepts evalsetrun and datasets as explicit parameters

levon003 · 2026-04-14T04:24:02Z

@@ -0,0 +1,88 @@
+# .. title:: Multiple configurations
+# If you name a dataset, even an in-memory IterableDataSource, you can reuse it in multiple runs with different configurations.


This file demonstrates the primary purpose for this PR. Rather than requiring a particular format, we now support multiple input formats, including in-memory Python lists.

Among other things, that will simplify future testing and documented examples (which don't need to rely on external data files).

levon003 · 2026-04-15T01:16:35Z

Claude Opus 4.6 PR Review: "Reuse datasets between runs"

Branch: better-datasets -> dev | 39 files changed | +1292 / -1042

Summary

This is a substantial refactor that introduces a many-to-many relationship between EvalSetRun and Dataset via a new EvalSetRunDatasets join table. It adds polymorphic data sources (FileDataSource, NamedDataSource, IterableDataSource), rewrites the LangGraph parser for langgraph >= 1.0, removes the broken context_only feature, and significantly improves tests.

Strengths

Clean data model improvement. The old design tied datasets to a single EvalSetRun via FK. The new join table (EvalSetRunDatasets) is the right approach for dataset reuse. The FK removal from Thread, Turn, Message, and ToolCall is clean — they only need to reference their Dataset now.
Polymorphic DataSource is well-designed. Using Pydantic's Discriminator("type") with tagged unions for NamedDataSource, FileDataSource, and IterableDataSource is idiomatic and makes YAML config and Python API usage consistent.
LangGraph parser simplification is a big win. ~300 lines of fragile incremental metadata["writes"] parsing replaced with ~100 lines reading the final checkpoint's channel_values.messages. Much easier to reason about.
context_only removal is the right call. It was broken, unused, and the comment in the PR explains the rationale well — wrapper functions are more explicit.
Test quality improved substantially. ConfigFailures tests now assert specific exception types and verify zero metric rows instead of @expectedFailure. New unit tests for load_datasets control flow cover NamedDataSource, reuse-by-name, duplicate detection, iterable reuse, and type mismatch warnings. The completed LangGraph loading test is a nice finish.
Vignette (vignettes/multiple_configs.py) is a good, self-contained example of the core feature.

Issues to Address

Medium:

run_utils.py:53-58 — find_dataset_by_name uses len() on a query. len(eligible_datasets) on a Peewee SelectQuery will fetch all rows to count them. Use .count() instead, or better, use .first() and check for None, then do a second .count() only if you need the >1 check: count = Dataset.select().where(Dataset.name == name).count() Zach comment: fixed
run_utils.py:83-84 — Auto-naming IterableDataSources with id(). f"_iterable_{id(data_source)}" uses the CPython memory address. If a data source is garbage-collected and another is created, id() can be reused within the same process. This is fine for same-run reuse, but the name is also persisted to the database (Dataset.name), making it meaningless for future sessions. Consider documenting that auto-names are not stable across processes, or using a hash/UUID instead. Zach comment: no change
data_loader.py:189-197 — SQL injection in PRAGMA is safe, but the thread_id query uses parameterized queries correctly now. Good. However, the old code had f"select * from checkpoints where thread_id = '{thread.langgraph_thread_id}'" which was SQL-injectable — the new parameterized approach is a security improvement worth calling out. Zach comment: lol
runner.py:89-91 — Data loading failures now raise after shutdown_logging(). This is a behavior change: previously, data loading errors were caught and logged, allowing the pipeline to continue (potentially producing empty results). Now it re-raises. This seems intentional and correct, but it's asymmetric with the completion and metric-computation blocks which still swallow exceptions. Consider making these consistent (or document why data loading is special). Zach comment: Interesting observation. Will need to think about where to document this.
compute_metrics.py:570-594 — Thread pool created per-dataset. The multi-worker path now creates a new ThreadPoolExecutor for each dataset. If there are many small datasets, this adds overhead from pool spin-up/teardown. Consider collecting all graphs across datasets first, then processing them in a single pool. Zach comment: Good suggestion
DEVELOPMENT.md:55 — Typo. "Integration tests live in tests/unit/" should say "Unit tests live in tests/unit/". Zach comment: Fixed

Minor / Nits:

eval_schema.py — extra = "forbid" added to MetricItem. This is a good validation improvement but it's a breaking change for anyone with context_only in their existing YAML configs. The evals.yaml cleanup removes all context_only entries, but external users may still have them. Consider whether this warrants a migration note or deprecation warning. Zach comment: No change
completions.py:489 — datasets: list parameter lacks type annotation. Should be datasets: list[Dataset] for consistency with the rest of the codebase. Zach comment: Fixed
compute_metrics.py:553 — Same untyped datasets: list. Should be list[Dataset]. Zach comment: Fixed
data_loader.py — load_thread_to_dataset no longer sets eval_run_thread_id. The old JSONL loader set eval_run_thread_id=f"{thread_id}_{thread_eval_run_id}" for duplicate runs. The refactored load_thread_to_dataset omits this, which means nb_evaluations_per_thread > 1 for JSONL won't produce distinct eval_run_thread_id values. The LangGraph path still sets it. This looks like a regression.
test_data_loader.py:345-346 — Dataset.create(..., source=...). The Dataset model doesn't have a source field — it has notes, metadata, etc. This will either be silently ignored by Peewee or raise an error. Should be notes=str(langgraph_db_path) or stored in metadata.
functional_tests.py:539 — Changed expected count from 2 to 1. The comment says "Expected 1 turn with long enough text to evaluate." Worth a brief code comment explaining why the new parser produces different counts (the old parser was likely double-counting due to incremental checkpoint processing). Zach comment: No change

Questions

Is there a migration path for existing databases? The schema changes (removed evalsetrun FK from Thread/Turn/Message/ToolCall, removed dataset_files/filename from EvalSetRun/Dataset, new EvalSetRunDatasets table) will break existing .db files. Is clear_tables=True the expected workflow?
The nb_evaluations_per_thread / duplicate-thread logic for JSONL seems to have been removed during the refactor into load_thread_to_dataset. The load_jsonl function still has the loop, but it calls load_thread_to_dataset which doesn't set eval_run_thread_id. Is this intentional?

Verdict

This is a well-motivated refactor that cleans up the data model, improves test quality, and enables a key feature (dataset reuse). The core design is sound. The issues above are mostly minor — the DEVELOPMENT.md typo (#6), the missing eval_run_thread_id in JSONL loading (#10), and the invalid source field in the test (#11) are the most likely to cause actual problems. I'd fix those before merging.

🤖 Generated with Claude Code

BaptisteMP

Looks great! will be much easier to interact with datasets and reuse

One question: when clear_tables = True, does it clear all the datasets?
Would it be interesting to only clear some datasets in memory?

A few comments below

BaptisteMP · 2026-04-28T21:15:07Z

+
+DataSourceType = Annotated[
+    Union[
+        Annotated[NamedDataSource, Tag("named")],


did not know you could name types with Tag

BaptisteMP · 2026-04-28T21:44:05Z

+# The named dataset "demo_conversations" is reused from Run 1.
+# Note that you could use a flexeval.schema.NamedDataSource instead if you wanted.
+eval_run_2 = EvalRun(
+    data_sources=data_sources,


not so clear that the data is being reused in the 2nd eval run, using the NamedDataSource would make it clearer

BaptisteMP · 2026-04-28T21:50:41Z

+                        dataset,
+                        data_source,
+                        max_n_conversation_threads=config.max_n_conversation_threads,
+                        nb_evaluations_per_thread=config.nb_evaluations_per_thread,


Did you have an idea of how to handle nb_evaluations_per_thread?
It appears that config still has it, but it might not work correctly if the load_jsonl loop does not set the eval_run_thread_id anymore? (as per Claude below)
Claude says: The nb_evaluations_per_thread / duplicate-thread logic for JSONL seems to have been removed during the refactor into load_thread_to_dataset. The load_jsonl function still has the loop, but it calls load_thread_to_dataset which doesn't set eval_run_thread_id. Is this intentional?

Your fix should not mess with the use of nb_evaluations_per_thread in principle because it touches only the loading of the datasets and not the threads loading in datasets

This is a good flag. I'll have to look at this.

Start of dataset refactor

25701fd

levon003 self-assigned this Sep 19, 2025

Rework dataset loading, not yet reviewed carefully

1658659

levon003 changed the title ~~Reuse datasets between runs~~ [Draft] Reuse datasets between runs Mar 26, 2026

Doc updates

7e17baf

levon003 changed the title ~~[Draft] Reuse datasets between runs~~ Reuse datasets between runs Mar 26, 2026

levon003 added 4 commits April 13, 2026 17:50

Integration tests

34a12a3

Remove context_only, a broken and defunct feature

f02cb9e

Update tests to include more specific failure info

84af925

Tweak langgraph parsing to use new format

7c672b9

levon003 marked this pull request as ready for review April 14, 2026 04:17

levon003 requested a review from BaptisteMP April 14, 2026 04:22

levon003 commented Apr 14, 2026

View reviewed changes

levon003 added 2 commits April 15, 2026 09:52

Fix typo

4d8c9d1

Update loader behavior, to better handle partial loads.

b041937

BaptisteMP reviewed Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse datasets between runs#90

Reuse datasets between runs#90
levon003 wants to merge 9 commits intodevfrom
better-datasets

levon003 commented Sep 19, 2025 •

edited

Loading

Uh oh!

levon003 Apr 14, 2026

Uh oh!

levon003 commented Apr 15, 2026 •

edited

Loading

Uh oh!

BaptisteMP left a comment

Uh oh!

BaptisteMP Apr 28, 2026

Uh oh!

BaptisteMP Apr 28, 2026

Uh oh!

BaptisteMP Apr 28, 2026

Uh oh!

levon003 Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,88 @@
		# .. title:: Multiple configurations
		# If you name a dataset, even an in-memory IterableDataSource, you can reuse it in multiple runs with different configurations.

Conversation

levon003 commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude summary

Schema/Model changes

Pipeline wiring

Uh oh!

levon003 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

levon003 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Opus 4.6 PR Review: "Reuse datasets between runs"

Summary

Strengths

Issues to Address

Questions

Verdict

Uh oh!

BaptisteMP left a comment

Choose a reason for hiding this comment

Uh oh!

BaptisteMP Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

BaptisteMP Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

BaptisteMP Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

levon003 Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

levon003 commented Sep 19, 2025 •

edited

Loading

levon003 commented Apr 15, 2026 •

edited

Loading