Feature/correct interpolation#249
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates Hercules’ input resampling behavior to account for the difference between start-of-period timestamps in input files and instantaneous values used internally, by shifting numeric input samples to interval midpoints prior to interpolation.
Changes:
- Apply midpoint correction for numeric columns during interpolation (
_interpolate_with_polars). - Update unit/regression tests to reflect the new interpolation semantics.
- Clarify timing conventions in documentation (inputs = start-of-period averages; internal/outputs = instantaneous).
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
hercules/utilities.py |
Implements midpoint-corrected interpolation and adds _compute_interval_midpoints. |
tests/utilities_test.py |
Updates interpolation expectations to validate midpoint behavior. |
tests/wind_farm_scada_power_test.py |
Adjusts expected SCADA replay values at step=1 due to midpoint correction. |
tests/wind_farm_precom_floris_test.py |
Updates wind input expectations by matching internal midpoint interpolation. |
tests/wind_farm_dynamic_floris_test.py |
Updates wind input expectations and adds midpoint-based comparison logic. |
tests/power_playback_test.py |
Updates expected power playback at step=1 to midpoint-averaged value. |
tests/test_inputs/scada_input.csv |
Tweaks SCADA power value to support updated rated power / expectations. |
tests/regression_tests/solar_pysam_pvwatts_regression_test.py |
Updates expected regression arrays after interpolation semantics change. |
tests/example_regression_tests/example_03_regression_test.py |
Updates expected final wind/solar/plant totals for corrected interpolation. |
tests/example_regression_tests/example_00b_regression_precom_test.py |
Updates expected final totals for corrected interpolation. |
tests/example_regression_tests/example_00_regression_test.py |
Updates expected final totals for corrected interpolation. |
docs/timing.md |
Adds explicit “Inputs vs internal values” section and explains midpoint correction. |
docs/wind.md |
Clarifies SCADA time_utc is start-of-period and points to timing docs. |
docs/solar_pv.md |
Clarifies solar weather time_utc start-of-period interpretation and conversion. |
docs/power_playback.md |
Clarifies playback file timestamps are start-of-period and are midpoint-corrected. |
docs/output_files.md |
Clarifies output values are instantaneous (not period averages). |
docs/hercules_input.md |
Clarifies external data file time_utc start-of-period interpretation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| midpoints = np.empty_like(time_values, dtype=np.float64) | ||
| midpoints[:-1] = (time_values[:-1] + time_values[1:]) / 2.0 | ||
| midpoints[-1] = time_values[-1] + (time_values[-1] - time_values[-2]) / 2.0 | ||
| return midpoints |
There was a problem hiding this comment.
_compute_interval_midpoints assumes at least 2 time points; with a single-row input (or after upstream filtering) this will raise an IndexError at time_values[-2]. Add an explicit guard for len(time_values) < 2 (e.g., return time_values.copy() or time_values + 0.0) so interpolate_df can handle degenerate/constant inputs gracefully.
There was a problem hiding this comment.
This seems like an unlikely edge case, handling by simply returning that scalar value as its own midpoint
| col_values = col_data[col].to_numpy() | ||
|
|
||
| midpoints = _compute_interval_midpoints(time_values) | ||
|
|
There was a problem hiding this comment.
_interpolate_with_polars recomputes midpoints inside the loop for every numeric column, even though they only depend on the (sorted) time array. For wide input tables (e.g., many turbines/signals) this is unnecessary overhead; compute the sorted time_values and midpoints once, then reuse them for each numeric column (only extracting col_values per column).
There was a problem hiding this comment.
Good suggestion, made that change
| Numeric columns are assumed to represent period-averaged values whose | ||
| timestamps mark the start of each period. To recover the best estimate | ||
| of the instantaneous value at a query time, each value is assigned to the | ||
| midpoint of its interval before interpolating. | ||
|
|
||
| Datetime columns (e.g. ``time_utc``) are instantaneous coordinates — they | ||
| map simulation time to wall-clock time directly — so they are interpolated | ||
| without the midpoint shift. |
There was a problem hiding this comment.
The midpoint-correction behavior is implemented in _interpolate_with_polars, but interpolate_df is the public entrypoint and its docstring still describes generic interpolation. Consider updating interpolate_df’s docstring (and/or module docs) to explicitly mention the midpoint shift for numeric columns so callers don’t assume start-of-period semantics are preserved.
| Numeric columns are assumed to represent period-averaged values whose | |
| timestamps mark the start of each period. To recover the best estimate | |
| of the instantaneous value at a query time, each value is assigned to the | |
| midpoint of its interval before interpolating. | |
| Datetime columns (e.g. ``time_utc``) are instantaneous coordinates — they | |
| map simulation time to wall-clock time directly — so they are interpolated | |
| without the midpoint shift. | |
| This helper implements the interpolation semantics used by the public | |
| DataFrame interpolation path: numeric columns are midpoint-corrected | |
| before interpolation, so the returned values do not preserve pure | |
| start-of-period semantics from the original timestamps. | |
| Numeric columns are assumed to represent period-averaged values whose | |
| timestamps mark the start of each period. To recover the best estimate | |
| of the instantaneous value at a query time, each value is assigned to the | |
| midpoint of its interval before interpolating. | |
| Datetime columns (e.g. ``time_utc``) are instantaneous coordinates—they | |
| map simulation time to wall-clock time directly—so they are interpolated | |
| against the original timestamps without the midpoint shift. |
There was a problem hiding this comment.
Adding to public docstring
| ).all() | ||
|
|
||
| # Assume df_input represents time stamps indicating start of period. | ||
| # Convert to instantanous values with midpoint correction as would be done |
There was a problem hiding this comment.
Typo in comment: "instantanous" → "instantaneous".
| # Convert to instantanous values with midpoint correction as would be done | |
| # Convert to instantaneous values with midpoint correction as would be done |
|
hi @genevievestarke and @misi9170 , thanks for the feedback today. Please see the updated PR with a few changes based on our discussion:
|
genevievestarke
left a comment
There was a problem hiding this comment.
I think this is looking really nice @paulf81!! I added some suggestions for the docs, but I think the question about how we want to handle external signals is the most important!
|
|
||
| # Interpolate using the utility function | ||
| df_interpolated = interpolate_df(df_ext, new_times) | ||
| df_interpolated = interpolate_df( |
There was a problem hiding this comment.
I'm not sure we should hard code this. It makes sense that the LMP prices should be this type of interpolation, but this is also how we input power reference signals. Should those also use the zoh_to_instantaneous method?
There was a problem hiding this comment.
I agree this is tricky. I think to date in Hercules we haven't explicitly tracked LMP prices, these are just external signals that end up having signal names that Hycon expects.
One idea is we can add to Hercules input a dictionary that says how different external channels should be upsampled. With a default to this for backwards compatibilty, or not, force explicit and have one more breaking change? @misi9170 any thoughts here?
Americanize spelling Co-authored-by: genevievestarke <103534902+genevievestarke@users.noreply.github.com>
Co-authored-by: genevievestarke <103534902+genevievestarke@users.noreply.github.com>
Co-authored-by: genevievestarke <103534902+genevievestarke@users.noreply.github.com>
|
Thank you @genevievestarke for your comments! I commited your suggested changes and then the question on upsampling the LMP is a good one, so maybe a little more discussion to go there |
|
ok @genevievestarke and @misi9170 I removed the ZOH option and added documentation on how ZOH should be done in this framework. Hopefully much more clear now! I think back to ready for review |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 22 out of 22 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| df_pl = pl.from_pandas(df) | ||
| result_pl = pl.DataFrame({"time": new_time}) | ||
|
|
||
| # Create a Polars DataFrame for the new time points | ||
| new_time_pl = pl.DataFrame({"time": new_time}) | ||
|
|
||
| # Start with the time column | ||
| result_pl = new_time_pl | ||
|
|
||
| # Process numeric columns using Polars' interpolation | ||
| if numeric_cols: | ||
| for col in numeric_cols: | ||
| # Use Polars' join_asof for efficient interpolation-like behavior | ||
| # This is more memory efficient than pandas for large datasets | ||
| col_data = df_pl.select(["time", col]).sort("time") | ||
| time_values = df_pl["time"].to_numpy() | ||
|
|
||
| # Perform interpolation using Polars operations | ||
| # Note: Polars doesn't have direct linear interpolation, so we use numpy interp | ||
| # but with Polars' efficient data extraction | ||
| time_values = col_data["time"].to_numpy() | ||
| col_values = col_data[col].to_numpy() | ||
|
|
||
| # Linear interpolation with float32 precision | ||
| interpolated_values = np.interp(new_time, time_values, col_values).astype( | ||
| hercules_float_type | ||
| ) | ||
| if interpolation_method == "averaged_to_instantaneous": | ||
| x_coords = _compute_interval_midpoints(time_values) | ||
| else: | ||
| x_coords = time_values | ||
|
|
||
| # Add interpolated column to result | ||
| result_pl = result_pl.with_columns(pl.lit(interpolated_values).alias(col)) | ||
| for col in numeric_cols: | ||
| col_values = df_pl[col].to_numpy() | ||
| interpolated_values = np.interp(new_time, x_coords, col_values).astype(hercules_float_type) | ||
| result_pl = result_pl.with_columns(pl.lit(interpolated_values).alias(col)) |
There was a problem hiding this comment.
np.interp requires the x-coordinates to be increasing, but time_values (and the corresponding col_values) are no longer sorted for numeric columns. Previously, the code sorted per-column (.sort("time")); now numeric interpolation can silently produce incorrect results or error if df["time"] isn’t strictly increasing. Fix by sorting once by "time" (and applying the same ordering to all numeric columns) before computing x_coords and calling np.interp.
There was a problem hiding this comment.
good catch, fixing
| } | ||
|
|
||
|
|
||
| def interpolate_df(df, new_time, interpolation_method): |
There was a problem hiding this comment.
interpolate_df previously accepted two parameters and now requires interpolation_method with no default, which is a breaking API change for any downstream/internal callers not updated in this PR. If backward compatibility is needed, consider providing a default (and optionally emitting a deprecation warning when omitted) so older call sites keep working while migrating to explicit behavior.
There was a problem hiding this comment.
Not truly backward compatible, but this function was not used outside of Hercules
|
|
||
| The CSV file must contain: | ||
| - A `time_utc` column with UTC timestamps in ISO 8601 format | ||
| - A `time_utc` column with UTC timestamps in ISO 8601 format. Each timestamp marks the **start of a reporting period**; values on that row are treated as period averages. See [Time Interpretation](timing.md#time-interpretation-inputs-vs-internal-values) for how Hercules converts these to instantaneous values. |
There was a problem hiding this comment.
This description appears to conflict with HerculesModel._read_external_data_file, which explicitly interpolates external data with "instantaneous_to_instantaneous" (and docs/timing.md also documents external data as instantaneous unless preprocessed for ZOH). Update this line to reflect the external-data convention (instantaneous-to-instantaneous interpolation), and reserve the “start-of-period period-average” wording for wind/solar/SCADA/playback inputs.
| - A `time_utc` column with UTC timestamps in ISO 8601 format. Each timestamp marks the **start of a reporting period**; values on that row are treated as period averages. See [Time Interpretation](timing.md#time-interpretation-inputs-vs-internal-values) for how Hercules converts these to instantaneous values. | |
| - A `time_utc` column with UTC timestamps in ISO 8601 format. Each timestamp represents an **instantaneous sample time** for the values on that row. Hercules interpolates external data using instantaneous-to-instantaneous interpolation. See [Time Interpretation](timing.md#time-interpretation-inputs-vs-internal-values) for the distinction between these external-data inputs and start-of-period period-average inputs such as wind/solar/SCADA/playback data. |
There was a problem hiding this comment.
Good catch, updated
| df_input["time"] = np.arange(0, df_input.shape[0], 1) | ||
| df_input["time_utc"] = pd.to_datetime(df_input["time_utc"]) | ||
| df_input_interpolated = interpolate_df( | ||
| df_input, | ||
| np.arange(0, df_input.shape[0], 1), |
There was a problem hiding this comment.
This test mutates df_input in-place (adds "time" and converts "time_utc" dtype). That can create hard-to-debug coupling if the same df_input object is reused later in the test/module/fixture. Prefer working on a copy (e.g., df_input = df_input.copy() or creating a derived frame) before adding columns/type conversions.
| df_input["time"] = np.arange(0, df_input.shape[0], 1) | |
| df_input["time_utc"] = pd.to_datetime(df_input["time_utc"]) | |
| df_input_interpolated = interpolate_df( | |
| df_input, | |
| np.arange(0, df_input.shape[0], 1), | |
| df_input_interpolation = df_input.copy() | |
| df_input_interpolation["time"] = np.arange(0, df_input_interpolation.shape[0], 1) | |
| df_input_interpolation["time_utc"] = pd.to_datetime( | |
| df_input_interpolation["time_utc"] | |
| ) | |
| df_input_interpolated = interpolate_df( | |
| df_input_interpolation, | |
| np.arange(0, df_input_interpolation.shape[0], 1), |
| value_points = time_points * 1.7 | ||
| df = pd.DataFrame({"time": time_points, "value": value_points}) | ||
|
|
||
| # Query at interval midpoints (0.5, 5.5, 9.5) and end points (0, 10) |
There was a problem hiding this comment.
The comment doesn’t match new_time (it mentions 0.5 and 9.5, but the array contains 5.0 and 10.5). Please update the comment to reflect the actual query points (or adjust new_time to match the comment) to keep the test intent clear.
| # Query at interval midpoints (0.5, 5.5, 9.5) and end points (0, 10) | |
| # Query at endpoints and representative midpoint/boundary points (0.0, 5.0, 5.5, 10.0, 10.5) |
| # ... | ||
| # Time 10 is in between last and second last midpoint, so value should be 9 | ||
| expected_values = [0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9] | ||
| assert np.allclose(result["value"], expected_values), "Interpolated values should match y = x" |
There was a problem hiding this comment.
The assertion message no longer reflects the updated expectation for "averaged_to_instantaneous" (the expected output is not simply y = x anymore due to midpoint shift and clamping). Update the message to describe midpoint-corrected interpolation (or remove the message) so failures are easier to interpret.
| assert np.allclose(result["value"], expected_values), "Interpolated values should match y = x" | |
| assert np.allclose( | |
| result["value"], expected_values | |
| ), "Interpolated values should match midpoint-corrected averaged_to_instantaneous output with edge clamping" |
genevievestarke
left a comment
There was a problem hiding this comment.
Looks good, @paulf81!
Just one suggestion for the docs and then we can merge!
| #### Achieving zero-order-hold (ZOH) behaviour | ||
|
|
||
| `interpolate_df` does not provide a dedicated zero-order-hold mode. If you | ||
| need step/piecewise-constant semantics -- for example, LMP prices that |
There was a problem hiding this comment.
| need step/piecewise-constant semantics -- for example, LMP prices that | |
| need step/piecewise-constant values -- for example, LMP prices that |
Thank you @genevievestarke ! I think it's good to go on my end now |
Issue #248 was a first flagging of a potential issue arising from the fact that solar, and really all input data files, are assumed to have the
time_utccolumn marking the start of the period. As noted in #248 I was first thinking this was an issue to be corrected when calling the solar module specifically, since it directly usestime_utcfor solar azimuth. However I think now correcting this in the solar module is wrong sincetime_utcis the start of the period only in the input files, it is meant to be instantaneous within the input files.Therefore I think the more general correction is in this pull request. Specifically, when we interpolate the input files, we need to account that we are interpolating from data files where
time_utcis marking the start of the time period, onto atime_utcthat is instantaneous time.This pull request implements a correction where during interpolation, the initial values are moved to the midpoints of the time period before interpolation. This change is implemented into
_interpolate_with_polarsinutilities.py. I believe this is more correct.The PR also then includes changes to the tests to match the new behavior, and then updates to documentation, to make more explicit that
time_utcis start of period in inputs, and instantaneous within Hercules and in Hercules output files.