Implement plan 05: test coverage gaps#960
Conversation
…leeps Replace each time.sleep(0.3) with wait_for_port(), which polls a TCP connect to the server port (5s timeout, 0.1s step) and raises TimeoutError on failure. Verified stable across 10 consecutive runs. Plan 05, task 3.
- Delete test/test_combinations.py: the whole class has been @pytest.mark.skip since 2023; its hypothesis sweep duplicates what test_bencher.py and the generated-example suite now cover. - Delete test_sweep_vars.py::test_int_sweep_samples_all: it was an exact line-for-line duplicate of the passing test_int_sweep_samples directly above it, skipped with no reason. - test_bencher.py::test_unique_file_names: replace the bare skip with a self-describing reason pointing at plans/05-test-coverage.md task 4. - test_usability.py: use bn.ResultFloat instead of the deprecated bn.ResultVar shim (identical behavior, verified before/after). Plan 05, task 4.
Cover to(result_type) conversion (values plotted match the worker), to_auto plot_list/remove_plots/failing-callback handling, to_auto_plots summary placement, plot() callback dispatch, default_plot_callbacks, from_existing state copies, and NaN-point robustness. Plan 05, task 1.
Cover viewer construction from a small ResultDataSet sweep, dataset_list round-tripping worker DataFrames unchanged, xarray index-to-frame mapping, and ds_to_container unwrapping the stored frame. Plan 05, task 1.
Cover 3-float volume construction (Plotly pane, single Volume trace, axis titles with units, trace values verified against the worker formula, isomin/isomax), the unsupported-shape rejection path, the over_time short-circuit, and NaN robustness (finite-only isomin/isomax). Plan 05, task 1.
…_dataframe.py) Assert data intactness through composition: append order/identity, single-DataFrame passthrough, right/down/sequence concat values and coords, overlay elementwise mean with skipna semantics, and label_formatter output. Complements test_composable_container_dataset.py which covers dims/sizes only. Plan 05, task 1.
Cover defaults, flag round-trips (repeats/over_time/cache flags), deprecated level= mapping, with_defaults semantics, persistent hash behavior (tag/bench_name/const-vars/include_repeats), describe_benchmark and sweep_sentence content, panel helpers, optimized/unoptimized input partition, optuna targets, and DimsCfg construction. Plan 05, task 2.
Cover construction defaults, function_input/canonical_input assembly (constants merged but excluded from canonical form), fn_inputs_sorted, and hash behavior: stable across dim ordering, sensitive to values/dim names/tag/constants, independent of bench_cfg_sample_hash. video_writer.py needs no new file: the core write path is already covered by test_video_writer_extended.py. Plan 05, task 2.
Cover the repeats>1 Curve+Spread overlay (labels, kdims/vdims, std band), to_plot delegation, categorical groupby via to_curve_ds, the repeats=1 filter rejection, and NaN robustness. Plan 05, task 1.
Cover static percentile-band composition (two Areas, median Curve, samples Scatter, percentile vdim names), default/explicit titles and unit-bearing ylabel, enable_scatter=False, categorical flattening into the sample pool, the over_time band path, regression-report suppression of to_band_ds, and NaN handling (nanpercentile + scatter mask drops NaN points). Plan 05, task 1.
Cover hv.Table construction (kdims/vdims), row count = samples x repeats, values matching the worker, repeat-dim squeeze at repeats=1, and NaN row preservation. Plan 05, task 1.
Cover _to_scatter_ds element type and title, the public to_scatter path, 2-cat NdOverlay grouping, float-input rejection, and NaN robustness. Also pins a discovered quirk: to_scatter filters result_types on the deprecated ResultVar, so modern ResultFloat sweeps silently return None (test_to_scatter_result_float_returns_none documents current behavior; the filter likely wants ResultFloat). Plan 05, task 1.
Cover to_bar Row/HoloViews/Bars structure, to_plot delegation, dims and unit-bearing ylabel via to_bar_ds, the ResultBool repeats=2 REDUCE scenario, 2-cat grouped kdims, float-input rejection, and NaN robustness. Plan 05, task 1.
…r_result.py) Cover overlay/element types, kdims/vdims and unit-bearing ylabel/title, raw repeats present per category (no aggregation), 2-cat kdims, repeats=1 filter rejection, and NaN robustness. Plan 05, task 1.
Cover overlay/element types, label/units propagation, raw repeats per category, filter rejection, and NaN robustness through the shared DistributionResult base. Plan 05, task 1.
…_jitter_result.py) Cover jitter scatter element type, label propagation, per-category raw sample integrity with repeats=3, its stricter cat_range filter (2-cat rejection), and NaN robustness. Plan 05, task 1.
Cover element structure (kdim/vdim names), bin frequencies summing to the sample count, bins= kwarg forwarding, title/ylabel/xrotation opts, the native 0-input repeats filter and float-input rejection, and NaN samples being dropped from bin counts without crashing. Note: result-var units never appear in histogram output (only the var name); tests assert the implemented behavior. Plan 05, task 1.
Cover exact best_value/best_params and summary() lines for single-objective studies, Pareto-front membership and the single-objective-only RuntimeError for multi-objective, sweep-driven structure (trial counts, target_names, directions), and NaN worker output marking the trial FAIL while the study continues. Plan 05, task 1.
Reviewer's GuideAdds extensive new unit tests for core result and configuration modules, replaces flaky file server sleeps with a port polling helper, removes or clarifies long-skipped tests, and tightens behavioral guarantees (including NaN handling and plotting/filtering behavior) without changing production code. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- Several test helpers (e.g.
unwrap_hv,_inner_element,_run_sweep,run_cfg_with) are reimplemented with very similar logic across multiple new test modules; consider moving these into a shared test utility module to avoid duplication and keep behavior consistent. - The new
wait_for_porthelper intest_file_server.pyis specific but generic enough to be reused; if other tests also spin up HTTP/TCP servers it may be worth centralizing this function in a common test helper to avoid diverging polling logic and timeouts.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Several test helpers (e.g. `unwrap_hv`, `_inner_element`, `_run_sweep`, `run_cfg_with`) are reimplemented with very similar logic across multiple new test modules; consider moving these into a shared test utility module to avoid duplication and keep behavior consistent.
- The new `wait_for_port` helper in `test_file_server.py` is specific but generic enough to be reused; if other tests also spin up HTTP/TCP servers it may be worth centralizing this function in a common test helper to avoid diverging polling logic and timeouts.
## Individual Comments
### Comment 1
<location path="test/test_band_result.py" line_range="123-132" />
<code_context>
+class TestBandResult:
</code_context>
<issue_to_address>
**suggestion (testing):** Add a negative test to pin the BandResult filter behavior when repeats=1 or otherwise not matching the filter criteria.
Current BandResult tests only cover valid shapes and special cases; they don't verify what happens when the filter rejects input (e.g., sweeps that don't meet `float_range`/`repeats_range` like `repeats>=2`). For other result types, we assert that `to_*` with `override=False` returns a diagnostic/None instead of a plot when the filter fails. Please add a similar negative test here that:
- builds a Band-style sweep that should be rejected (e.g., `repeats=1`),
- calls `to_band(override=False)`, and
- asserts that the result is not an `hv.Overlay`/HoloViews pane (and optionally checks the diagnostic text when `print_debug` is enabled).
This will lock in the rejection behavior for unsupported shapes and prevent regressions where BandResult might silently emit misleading plots.
Suggested implementation:
```python
class TestBandResult:
def test_to_band_overlay_composition(self, res_1d):
"""to_band yields two percentile Areas, a median Curve and a samples Scatter."""
plot = res_1d.to_band()
assert plot is not None
overlay = unwrap_hv(plot)
assert isinstance(overlay, hv.Overlay)
# exact types: hv.Area is a subclass of hv.Curve, so isinstance would double count
assert len([el for el in overlay if type(el) is hv.Area]) == 2
assert len([el for el in overlay if type(el) is hv.Curve]) == 1
assert len([el for el in overlay if type(el) is hv.Scatter]) == 1
def test_to_band_rejected_when_filter_fails(self, res_1d_repeats_1):
"""to_band(override=False) should not emit a Band plot if the filter rejects the sweep.
In particular, a Band-style sweep with repeats=1 (or otherwise outside the
allowed repeats/float ranges) should be rejected and yield a diagnostic
instead of an hv.Overlay/HoloViews pane.
"""
# When the filter fails, to_band(override=False) should *not* return a plot.
# With print_debug enabled, we also expect a human-readable diagnostic.
result = res_1d_repeats_1.to_band(override=False, print_debug=True)
# Unwrap any Panel/HoloViews wrappers, then assert that we did *not* get an Overlay.
unwrapped = unwrap_hv(result)
assert not isinstance(unwrapped, hv.Overlay)
# Optional: pin a bit of the diagnostic so regressions in the filter behavior are caught.
# Adjust the substring below to match the actual diagnostic message emitted by BandResult.
assert "repeats" in str(result) or "filter" in str(result)
```
To make this test pass, you also need to:
1. Add a `res_1d_repeats_1` fixture in this file (or a shared conftest) that:
- Builds a Band-style sweep identical (or very similar) to `res_1d`, but with `repeats=1` or otherwise configured so that the BandResult filter rejects it (e.g., by violating `float_range`/`repeats_range` such as `repeats>=2`).
- Returns the corresponding `BandResult` instance.
For example (pseudocode, adapt to your actual API):
```python
@pytest.fixture
def res_1d_repeats_1(bench, run_cfg):
run_cfg_single = dataclasses.replace(run_cfg, repeats=1) # or whatever your API is
res = bench.plot_sweep(
"band_time",
input_vars=["size"],
result_vars=["throughput"],
run_cfg=run_cfg_single,
time_src="2026-06-10 snap0000",
)
return res
```
2. Adjust the diagnostic assertion in the test if the actual `BandResult.to_band` rejection message differs:
- Change the `"repeats"` / `"filter"` substring checks to match the concrete text your implementation emits when the filter fails.
3. If `to_band(override=False, print_debug=True)` returns a structured object (e.g. `(diagnostic, plot)` or a Panel pane) instead of a single value, update the test accordingly:
- Unpack the return value, and apply the `unwrap_hv` / `isinstance(..., hv.Overlay)` checks to the plot component.
- Apply the substring assertion to the diagnostic component.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Performance Report for
|
| Metric | Value |
|---|---|
| Total tests | 1653 |
| Total time | 123.24s |
| Mean | 0.0746s |
| Median | 0.0020s |
Top 10 slowest tests
| Test | Time (s) |
|---|---|
test.test_bench_examples.TestBenchExamples::test_example_meta |
16.957 |
test.test_over_time_save_perf::test_save_faster_without_aggregated_tab |
4.982 |
test.test_generated_examples::test_generated_example[regression/example_regression_tuning_drift.py] |
4.544 |
test.test_generated_examples::test_generated_example[cartesian_animation/example_cartesian_animation.py] |
2.949 |
test.test_split_render_examples::test_split_render_subprocess_media |
2.940 |
test.test_generated_examples::test_generated_example[regression/example_regression_tuning_step.py] |
2.830 |
test.test_generated_examples::test_generated_example[result_types/result_image/example_result_image_to_video.py] |
2.807 |
test.test_hash_persistent.TestCrossProcessDeterminism::test_hash_stable_across_two_processes[ResultBool] |
2.795 |
test.test_generated_examples::test_generated_example[regression/example_regression_tuning_noise.py] |
2.777 |
test.test_over_time_repeats.TestMaxSliderPoints::test_default_subsampling_caps_at_max |
2.398 |
Updated by Performance Tracking workflow
|
"Histogram output never shows result-var units (only the var name); tests assert the implemented behaviour" this is not desired. All graphs should show units. check this across the codebase (make a follow up pr if appropriate) |
Addresses review feedback that all graphs should display units. The
histogram x-axis carries the result variable but only showed its name;
now uses the '{name} [{units}]' convention shared with band/bar/heatmap.
Move the copy-pasted unwrap_hv, inner_element and run_cfg_with helpers into a shared module, and add a BandResult negative test pinning that a non-scalar (vector) result is rejected rather than silently plotted.
|
Good catch — fixed the histogram in this PR (e9a91fc): the x-axis now shows Audited the rest of the codebase per your request. Units are already shown by Also addressed the Sourcery feedback: consolidated the duplicated |
Performance Report for
|
| Metric | Value |
|---|---|
| Total tests | 1656 |
| Total time | 120.29s |
| Mean | 0.0726s |
| Median | 0.0020s |
Top 10 slowest tests
| Test | Time (s) |
|---|---|
test.test_bench_examples.TestBenchExamples::test_example_meta |
16.002 |
test.test_over_time_save_perf::test_save_faster_without_aggregated_tab |
4.634 |
test.test_generated_examples::test_generated_example[regression/example_regression_tuning_drift.py] |
4.377 |
test.test_generated_examples::test_generated_example[cartesian_animation/example_cartesian_animation.py] |
3.036 |
test.test_split_render_examples::test_split_render_subprocess_media |
3.023 |
test.test_hash_persistent.TestCrossProcessDeterminism::test_hash_stable_across_two_processes[ResultBool] |
2.869 |
test.test_generated_examples::test_generated_example[result_types/result_image/example_result_image_to_video.py] |
2.706 |
test.test_generated_examples::test_generated_example[regression/example_regression_tuning_noise.py] |
2.584 |
test.test_generated_examples::test_generated_example[regression/example_regression_tuning_step.py] |
2.253 |
test.test_over_time_repeats.TestMaxSliderPoints::test_default_subsampling_caps_at_max |
2.238 |
Updated by Performance Tracking workflow
Addresses the Sourcery review on #960: the per-file _run_sweep bodies in the bar/scatter/box-whisker/violin/scatter-jitter result tests were near-identical copies. Add run_named_sweep and run_dist_sweep to test/helpers.py and delegate to them, so the run-config and plot-callback setup lives in one place. Also bump version to 1.106.0.
Performance Report for
|
| Metric | Value |
|---|---|
| Total tests | 1659 |
| Total time | 124.12s |
| Mean | 0.0748s |
| Median | 0.0020s |
Top 10 slowest tests
| Test | Time (s) |
|---|---|
test.test_bench_examples.TestBenchExamples::test_example_meta |
17.088 |
test.test_over_time_save_perf::test_save_faster_without_aggregated_tab |
5.142 |
test.test_generated_examples::test_generated_example[regression/example_regression_tuning_drift.py] |
4.564 |
test.test_split_render_examples::test_split_render_subprocess_media |
2.986 |
test.test_generated_examples::test_generated_example[cartesian_animation/example_cartesian_animation.py] |
2.938 |
test.test_hash_persistent.TestCrossProcessDeterminism::test_hash_stable_across_two_processes[ResultBool] |
2.855 |
test.test_generated_examples::test_generated_example[result_types/result_image/example_result_image_to_video.py] |
2.826 |
test.test_generated_examples::test_generated_example[regression/example_regression_tuning_noise.py] |
2.782 |
test.test_generated_examples::test_generated_example[regression/example_regression_tuning_step.py] |
2.492 |
test.test_over_time_repeats.TestMaxSliderPoints::test_default_subsampling_caps_at_max |
2.414 |
Updated by Performance Tracking workflow
|
Follow-up PR for the remaining axis-unit gaps (curve/line/scatter) is up: #963 |
Summary
Implements
plans/05-test-coverage.md(from #959): direct unit tests for the previously indirectly-testedresults/and core modules, de-flaked file server tests, and resolved long-skipped tests. One module per commit, as the plan prescribes.Coverage: 90% → 91% (missed lines 1050 → 1001 of 10776). Tests: 1474 passed / 5 skipped → 1651 passed / 2 skipped (+177 tests across 16 new test files).
Task 1 — unit tests for untested result modules
results/bench_result.pyresults/histogram_result.pyresults/optimize_result.pyresults/dataset_result.pyholoview_results/scatter_result.pyholoview_results/bar_result.pyholoview_results/curve_result.pyholoview_results/band_result.pyholoview_results/table_result.pydistribution_result/box/violin/jitterresults/volume_result.pycomposable_container_dataframe.pySeveral of these were already at high line coverage purely through generated-example integration tests; the new tests assert behavior directly — element types, kdims/vdims, title/ylabel/units propagation, values matching the worker function, filter rejection paths, and NaN robustness (NaN is the missing-value default since v1.105).
Task 2 — core modules
test_bench_cfg.py(42 tests): defaults, flag round-trips, deprecatedlevel=mapping, persistent hash behavior, describe/sentence helpers, optuna target partition.bench_cfg.py97% → 99%.test_worker_job.py(16 tests): input assembly, canonical form, hash stability/sensitivity.video_writer.py: no new file — the plan's target (write path with synthetic frames, file exists and non-empty) is already covered bytest_video_writer_extended.py.Task 3 — de-flake
test_file_server.pyReplaced each
time.sleep(0.3)with await_for_port()TCP poll (5s timeout, 0.1s step,TimeoutErroron failure). Verified stable across 10 consecutive runs (all passed).Task 4 — long-skipped tests (with deviations; the plan's named tests don't exist)
test_combinations.py: not an empty placeholder — a 184-line hypothesis class fully@pytest.mark.skipsince 2023. Deleted per the plan's intent.test_sweep_vars.pyistest_int_sweep_samples_all(nottest_missing_default_value): an exact line-for-line duplicate of the passing test above it. Deleted.test_bencher.py::test_unique_file_names(nottest_plot_all_permutations): bare skip replaced with a self-describing reason pointing at the plan.test_usability.py: deprecatedbn.ResultVar→bn.ResultFloat(verified identical behavior).Bugs found while writing tests (not fixed here — test-only PR)
ScatterResult.to_scatterfiltersresult_types=(ResultVar,), butResultVaris the deprecated shim — modernResultFloatsweeps silently returnNone(the shipped scatter example appendsNoneto its report). Pinned bytest_to_scatter_result_float_returns_none; the filter likely wantsResultFloat.DistributionResult._plot_distributionforwardstitleinside**kwargsand as an explicit kwarg into.opts()— any caller passingtitle=would hit "multiple values for keyword argument".Notes
fail_under(plan 01) is not in place yet, so there is nothing to raise; when plan 01 lands it can start at 91%.pixi run cipasses locally.🤖 Generated with Claude Code
Summary by Sourcery
Increase unit test coverage for core result and configuration modules and stabilize flaky file server tests.
Bug Fixes:
Enhancements: