Speed up recf.py and hourly_repperiods.py preprocessing#92
Conversation
…g deprecated groupby
| ### Input processing | ||
| profiles_day = ( | ||
| profiles_fitperiods.groupby(['property','region'], axis=1).mean()) | ||
| profiles_fitperiods.T.groupby(level=['property','region']).mean().T) |
There was a problem hiding this comment.
The old approach is deprecated in pandas 3 (which we haven't switched to yet) but I think these are unrelated to speed
| _sitemaps = { | ||
| False: reeds.io.get_sitemap(offshore=False), | ||
| True: reeds.io.get_sitemap(offshore=True), | ||
| } | ||
|
|
There was a problem hiding this comment.
These take ~0.4 s to run for the full US; for me that's small enough not to matter
| nearest_period = {} | ||
| centroid_values = centroids.values | ||
| profile_values = profiles_fitperiods.values | ||
| for i in range(int(sw['GSw_HourlyNumClusters'])): | ||
| cluster_mask = idx == i | ||
| # Calculate distance from each point in the cluster to the centroid | ||
| dists = np.linalg.norm(profile_values[cluster_mask] - centroid_values[i], axis=1) | ||
| # Find the index of the point closest to the centroid | ||
| nearest_period[i] = profiles_fitperiods.index[cluster_mask][np.argmin(dists)] |
There was a problem hiding this comment.
Did you try running with hierarchical clustering to make sure the results are unchanged?
In general hourly_repperiods.py is not the bottleneck so I would avoid changing it. Bigger changes are needed to speed it up for multiple weather years but those are implemented on pb/rep15, I just haven't gotten around to wrapping it up yet
| sw=sw, reeds_path=reeds_path, inputs_case=inputs_case, | ||
| periodtype='rep', | ||
| make_plots=1, | ||
| recf_input=recf_input, cspcf_input=cspcf_input, load_input=load_input, |
There was a problem hiding this comment.
This is a lot of inputs for the main() function of an input-processing script. For me, reeds.io.read_file('recf.h5') takes 2.5 s with parse_timestamps=False and 3.7 s with parse_timestamps=True. I guess I feel like that's not worth complicating the inputs to the main() function (and for timestamp parsing it'd be faster just to recreate the index with pd.date_range()).
| load_input = reeds.io.read_file( | ||
| os.path.join(inputs_case, 'load.h5'), parse_timestamps=True).unstack(level=0) | ||
| load_input.columns = load_input.columns.rename(['r', 't']) | ||
| load_input *= (1 - reeds.io.get_scalars(inputs_case)['distloss']) |
There was a problem hiding this comment.
Should avoid doing this in additional places (here and in hourly_writetimeseries.py)
| + 'd' + hmap_allyrs.yearperiod.astype(str).str.zfill(3) | ||
| + 'h' + hmap_allyrs.periodhour.astype(str).str.zfill(3) |
There was a problem hiding this comment.
This is ok but no faster; they both run in ~0.03 s
|
Closing and moving to an issue (#98) for a more systematic profiling and speed improvement. |
Summary
Performance improvements for
recf.pyandhourly_repperiods.py, with changes to other scripts that are called by those two input processing files. The bulk fo the changes are tohourly_repperiods.pyand related files becauserecf.pyis mostly slow because of reading the large data files. This PR does not result in model changes.Technical details
Implementation notes
recf.py: Replace deprecatedgroupby(axis=1)with transposed groupby.hourly_repperiods.py: Loadrecf.h5,csp.h5, andload.h5once before the loop overndaysvalues and pass them into eachhourly_writetimeseries.maincall, avoiding repeated HDF5 reads. Replace deprecatedgroupby(axis=1)with transposed groupby. Replacescipy.spatial.distance.euclideanwithnp.linalg.norm.hourly_writetimeseries.py: Vectorize ccseason lookup (replaces per-element lambda withMultiIndex.loc); replace.map('{:>03}'.format)with.str.zfill(3); replace per-elementany([x.startswith(i) for i in rep_periods])lambda withstr.startswith(tuple(rep_periods)). Addrecf_input/cspcf_input/load_inputpass-through parameters (defaultNonepreserves existing call sites).hourly_plots.py: Cacheget_sitemap()results outside the tech loop inplot_maps, saving one redundant HDF5 read + GeoDataFrame CRS projection.reeds/io.py: Replace.map(lambda x: x.decode())with.str.decode('utf-8')inget_outage_hourlyandget_temperatures. Vectorize datetime formatting inwrite_profile_to_h5withstrftimeinstead of per-elementapply.Validation, testing, and comparison report(s)
The Pacific case, USA_defaults, and WECC_county cases all had no changes to inputs.gdx with the changes made here. My USA_defaults test case showed no changes for any output except runtime: results-Main,Speedup.pptx.
Total input processing speed went down by 18% (1.4 minutes) for Pacific, 19% (5.2 minutes) for USA_defaults, and 10% (2.3 minutes) for WECC_county. These are more modest than I was hoping, but given that I already made them, I think it's still worth pushing into the model. My interest in faster run time in input processing is for debugging so that I don't need to wait as long for a new run to get to the part of the model that I want to test, and this helps at least a little with that.
Checklist for author
Details to double-check
Documentation updated if necessaryDollar year recorded and converted to 2004$ for GAMSTimeseries are in Central TimeUnits are specifiedPreprocessing steps have been documented and committed to ReEDS_Input_ProcessingNew large data files handled with .h5 instead of .csvGeneral information to guide review
Did you use LLM tools (chatbot or copilot) in the preparation of this PR? If so, describe how
Yes — Claude (Sonnet 4.6 via Claude Code) was used to identify hotspots via
cProfile, implement and verify the optimizations, and confirm output equivalence.