Skip to content

Speed up recf.py and hourly_repperiods.py preprocessing#92

Closed
wesleyjcole wants to merge 8 commits into
mainfrom
wjc/speedups
Closed

Speed up recf.py and hourly_repperiods.py preprocessing#92
wesleyjcole wants to merge 8 commits into
mainfrom
wjc/speedups

Conversation

@wesleyjcole
Copy link
Copy Markdown
Contributor

@wesleyjcole wesleyjcole commented May 14, 2026

Summary

Performance improvements for recf.py and hourly_repperiods.py, with changes to other scripts that are called by those two input processing files. The bulk fo the changes are to hourly_repperiods.py and related files because recf.py is mostly slow because of reading the large data files. This PR does not result in model changes.

Technical details

Implementation notes

  • recf.py: Replace deprecated groupby(axis=1) with transposed groupby.
  • hourly_repperiods.py: Load recf.h5, csp.h5, and load.h5 once before the loop over ndays values and pass them into each hourly_writetimeseries.main call, avoiding repeated HDF5 reads. Replace deprecated groupby(axis=1) with transposed groupby. Replace scipy.spatial.distance.euclidean with np.linalg.norm.
  • hourly_writetimeseries.py: Vectorize ccseason lookup (replaces per-element lambda with MultiIndex.loc); replace .map('{:>03}'.format) with .str.zfill(3); replace per-element any([x.startswith(i) for i in rep_periods]) lambda with str.startswith(tuple(rep_periods)). Add recf_input/cspcf_input/load_input pass-through parameters (default None preserves existing call sites).
  • hourly_plots.py: Cache get_sitemap() results outside the tech loop in plot_maps, saving one redundant HDF5 read + GeoDataFrame CRS projection.
  • reeds/io.py: Replace .map(lambda x: x.decode()) with .str.decode('utf-8') in get_outage_hourly and get_temperatures. Vectorize datetime formatting in write_profile_to_h5 with strftime instead of per-element apply.

Validation, testing, and comparison report(s)

The Pacific case, USA_defaults, and WECC_county cases all had no changes to inputs.gdx with the changes made here. My USA_defaults test case showed no changes for any output except runtime: results-Main,Speedup.pptx.

Total input processing speed went down by 18% (1.4 minutes) for Pacific, 19% (5.2 minutes) for USA_defaults, and 10% (2.3 minutes) for WECC_county. These are more modest than I was hoping, but given that I already made them, I think it's still worth pushing into the model. My interest in faster run time in input processing is for debugging so that I don't need to wait as long for a new run to get to the part of the model that I want to test, and this helps at least a little with that.

Checklist for author

Details to double-check

  • Charge code provided to reviewers
  • Included comparison reports for appropriate test cases
  • Documentation updated if necessary
  • Dollar year recorded and converted to 2004$ for GAMS
  • Timeseries are in Central Time
  • Units are specified
  • Preprocessing steps have been documented and committed to ReEDS_Input_Processing
  • New large data files handled with .h5 instead of .csv

General information to guide review

  • Zero impact on results of default case
  • No large data file(s) added/modified
  • No substantive impact on runtime for full-US reference case
  • No change to process flow (runreeds.py, solve.py)
  • No change to code organization
  • No change to package requirements (environment.yml or Project.toml)

Did you use LLM tools (chatbot or copilot) in the preparation of this PR? If so, describe how

Yes — Claude (Sonnet 4.6 via Claude Code) was used to identify hotspots via cProfile, implement and verify the optimizations, and confirm output equivalence.

### Input processing
profiles_day = (
profiles_fitperiods.groupby(['property','region'], axis=1).mean())
profiles_fitperiods.T.groupby(level=['property','region']).mean().T)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old approach is deprecated in pandas 3 (which we haven't switched to yet) but I think these are unrelated to speed

Comment thread reeds/input_processing/hourly_plots.py Outdated
Comment on lines +277 to +281
_sitemaps = {
False: reeds.io.get_sitemap(offshore=False),
True: reeds.io.get_sitemap(offshore=True),
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These take ~0.4 s to run for the full US; for me that's small enough not to matter

Comment on lines +341 to +349
nearest_period = {}
centroid_values = centroids.values
profile_values = profiles_fitperiods.values
for i in range(int(sw['GSw_HourlyNumClusters'])):
cluster_mask = idx == i
# Calculate distance from each point in the cluster to the centroid
dists = np.linalg.norm(profile_values[cluster_mask] - centroid_values[i], axis=1)
# Find the index of the point closest to the centroid
nearest_period[i] = profiles_fitperiods.index[cluster_mask][np.argmin(dists)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try running with hierarchical clustering to make sure the results are unchanged?

In general hourly_repperiods.py is not the bottleneck so I would avoid changing it. Bigger changes are needed to speed it up for multiple weather years but those are implemented on pb/rep15, I just haven't gotten around to wrapping it up yet

sw=sw, reeds_path=reeds_path, inputs_case=inputs_case,
periodtype='rep',
make_plots=1,
recf_input=recf_input, cspcf_input=cspcf_input, load_input=load_input,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of inputs for the main() function of an input-processing script. For me, reeds.io.read_file('recf.h5') takes 2.5 s with parse_timestamps=False and 3.7 s with parse_timestamps=True. I guess I feel like that's not worth complicating the inputs to the main() function (and for timestamp parsing it'd be faster just to recreate the index with pd.date_range()).

load_input = reeds.io.read_file(
os.path.join(inputs_case, 'load.h5'), parse_timestamps=True).unstack(level=0)
load_input.columns = load_input.columns.rename(['r', 't'])
load_input *= (1 - reeds.io.get_scalars(inputs_case)['distloss'])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should avoid doing this in additional places (here and in hourly_writetimeseries.py)

Comment on lines +77 to +78
+ 'd' + hmap_allyrs.yearperiod.astype(str).str.zfill(3)
+ 'h' + hmap_allyrs.periodhour.astype(str).str.zfill(3)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ok but no faster; they both run in ~0.03 s

@wesleyjcole
Copy link
Copy Markdown
Contributor Author

Closing and moving to an issue (#98) for a more systematic profiling and speed improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants