Speed up `recf.py` and `hourly_repperiods.py` preprocessing by wesleyjcole · Pull Request #92 · ReEDS-Model/ReEDS

wesleyjcole · 2026-05-14T14:51:35Z

Summary

Performance improvements for recf.py and hourly_repperiods.py, with changes to other scripts that are called by those two input processing files. The bulk fo the changes are to hourly_repperiods.py and related files because recf.py is mostly slow because of reading the large data files. This PR does not result in model changes.

Technical details

Implementation notes

recf.py: Replace deprecated groupby(axis=1) with transposed groupby.
hourly_repperiods.py: Load recf.h5, csp.h5, and load.h5 once before the loop over ndays values and pass them into each hourly_writetimeseries.main call, avoiding repeated HDF5 reads. Replace deprecated groupby(axis=1) with transposed groupby. Replace scipy.spatial.distance.euclidean with np.linalg.norm.
hourly_writetimeseries.py: Vectorize ccseason lookup (replaces per-element lambda with MultiIndex.loc); replace .map('{:>03}'.format) with .str.zfill(3); replace per-element any([x.startswith(i) for i in rep_periods]) lambda with str.startswith(tuple(rep_periods)). Add recf_input/cspcf_input/load_input pass-through parameters (default None preserves existing call sites).
hourly_plots.py: Cache get_sitemap() results outside the tech loop in plot_maps, saving one redundant HDF5 read + GeoDataFrame CRS projection.
reeds/io.py: Replace .map(lambda x: x.decode()) with .str.decode('utf-8') in get_outage_hourly and get_temperatures. Vectorize datetime formatting in write_profile_to_h5 with strftime instead of per-element apply.

Validation, testing, and comparison report(s)

The Pacific case, USA_defaults, and WECC_county cases all had no changes to inputs.gdx with the changes made here. My USA_defaults test case showed no changes for any output except runtime: results-Main,Speedup.pptx.

Total input processing speed went down by 18% (1.4 minutes) for Pacific, 19% (5.2 minutes) for USA_defaults, and 10% (2.3 minutes) for WECC_county. These are more modest than I was hoping, but given that I already made them, I think it's still worth pushing into the model. My interest in faster run time in input processing is for debugging so that I don't need to wait as long for a new run to get to the part of the model that I want to test, and this helps at least a little with that.

Checklist for author

Details to double-check

Charge code provided to reviewers
Included comparison reports for appropriate test cases
~~Documentation updated if necessary~~
~~Dollar year recorded and converted to 2004$ for GAMS~~
~~Timeseries are in Central Time~~
~~Units are specified~~
~~Preprocessing steps have been documented and committed to ReEDS_Input_Processing~~
~~New large data files handled with .h5 instead of .csv~~

General information to guide review

Zero impact on results of default case
No large data file(s) added/modified
No substantive impact on runtime for full-US reference case
No change to process flow (runreeds.py, solve.py)
No change to code organization
No change to package requirements (environment.yml or Project.toml)

Did you use LLM tools (chatbot or copilot) in the preparation of this PR? If so, describe how

Yes — Claude (Sonnet 4.6 via Claude Code) was used to identify hotspots via cProfile, implement and verify the optimizations, and confirm output equivalence.

…g deprecated groupby

…efit

patrickbrown4 · 2026-05-19T13:29:21Z

    ### Input processing
    profiles_day = (
-        profiles_fitperiods.groupby(['property','region'], axis=1).mean())
+        profiles_fitperiods.T.groupby(level=['property','region']).mean().T)


The old approach is deprecated in pandas 3 (which we haven't switched to yet) but I think these are unrelated to speed

patrickbrown4 · 2026-05-19T13:29:41Z

+    _sitemaps = {
+        False: reeds.io.get_sitemap(offshore=False),
+        True: reeds.io.get_sitemap(offshore=True),
+    }
+


These take ~0.4 s to run for the full US; for me that's small enough not to matter

patrickbrown4 · 2026-05-19T13:30:37Z

+        nearest_period = {}
+        centroid_values = centroids.values
+        profile_values = profiles_fitperiods.values
+        for i in range(int(sw['GSw_HourlyNumClusters'])):
+            cluster_mask = idx == i
+            # Calculate distance from each point in the cluster to the centroid
+            dists = np.linalg.norm(profile_values[cluster_mask] - centroid_values[i], axis=1)
+            # Find the index of the point closest to the centroid
+            nearest_period[i] = profiles_fitperiods.index[cluster_mask][np.argmin(dists)]


Did you try running with hierarchical clustering to make sure the results are unchanged?

In general hourly_repperiods.py is not the bottleneck so I would avoid changing it. Bigger changes are needed to speed it up for multiple weather years but those are implemented on pb/rep15, I just haven't gotten around to wrapping it up yet

patrickbrown4 · 2026-05-19T15:07:58Z

        sw=sw, reeds_path=reeds_path, inputs_case=inputs_case,
        periodtype='rep',
        make_plots=1,
+        recf_input=recf_input, cspcf_input=cspcf_input, load_input=load_input,


This is a lot of inputs for the main() function of an input-processing script. For me, reeds.io.read_file('recf.h5') takes 2.5 s with parse_timestamps=False and 3.7 s with parse_timestamps=True. I guess I feel like that's not worth complicating the inputs to the main() function (and for timestamp parsing it'd be faster just to recreate the index with pd.date_range()).

patrickbrown4 · 2026-05-19T15:09:53Z

+    load_input = reeds.io.read_file(
+        os.path.join(inputs_case, 'load.h5'), parse_timestamps=True).unstack(level=0)
+    load_input.columns = load_input.columns.rename(['r', 't'])
+    load_input *= (1 - reeds.io.get_scalars(inputs_case)['distloss'])


Should avoid doing this in additional places (here and in hourly_writetimeseries.py)

patrickbrown4 · 2026-05-19T15:10:44Z

+            + 'd' + hmap_allyrs.yearperiod.astype(str).str.zfill(3)
+            + 'h' + hmap_allyrs.periodhour.astype(str).str.zfill(3)


This is ok but no faster; they both run in ~0.03 s

wesleyjcole · 2026-05-19T22:21:01Z

Closing and moving to an issue (#98) for a more systematic profiling and speed improvement.

wesleyjcole added 7 commits May 14, 2026 08:44

Implement some speedups for recf.py

4054b3c

Add hourly_repperiods speed-ups by only loading data once and updatin…

334ac12

…g deprecated groupby

Only read in subset of cf files when not running the full U.S.

fbae80b

Reset sites filter as it had mixed impact on runtime

cda142c

Speed up hourly_writetimeseries.py, hourly_plots.py, and io.py

474ac91

Revert matrix multiplication--memory risk isn't worth the runtime ben…

8c84cc3

…efit

Revert a cosmetic change

9c7d932

github-actions Bot added the model_changes label May 14, 2026

patrickbrown4 reviewed May 19, 2026

View reviewed changes

Revert pre-loading in hourly_plots.py

9c7ac8d

wesleyjcole mentioned this pull request May 19, 2026

Speed up input processing scripts #98

Open

wesleyjcole closed this May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up `recf.py` and `hourly_repperiods.py` preprocessing#92

Speed up `recf.py` and `hourly_repperiods.py` preprocessing#92
wesleyjcole wants to merge 8 commits into
mainfrom
wjc/speedups

wesleyjcole commented May 14, 2026 •

edited

Loading

Uh oh!

patrickbrown4 May 19, 2026

Uh oh!

patrickbrown4 May 19, 2026

Uh oh!

patrickbrown4 May 19, 2026

Uh oh!

patrickbrown4 May 19, 2026

Uh oh!

patrickbrown4 May 19, 2026

Uh oh!

patrickbrown4 May 19, 2026

Uh oh!

wesleyjcole commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		+ 'd' + hmap_allyrs.yearperiod.astype(str).str.zfill(3)
		+ 'h' + hmap_allyrs.periodhour.astype(str).str.zfill(3)

Conversation

wesleyjcole commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Technical details

Implementation notes

Validation, testing, and comparison report(s)

Checklist for author

Details to double-check

General information to guide review

Did you use LLM tools (chatbot or copilot) in the preparation of this PR? If so, describe how

Uh oh!

patrickbrown4 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

patrickbrown4 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

patrickbrown4 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

patrickbrown4 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

patrickbrown4 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

patrickbrown4 May 19, 2026

Choose a reason for hiding this comment

Uh oh!

wesleyjcole commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wesleyjcole commented May 14, 2026 •

edited

Loading