Skip to content

Disaggregate and re-aggregate inputs with regional scope of legacy (134) zones#102

Open
kodiobika wants to merge 23 commits into
mainfrom
ko/agg_disagg_refactor
Open

Disaggregate and re-aggregate inputs with regional scope of legacy (134) zones#102
kodiobika wants to merge 23 commits into
mainfrom
ko/agg_disagg_refactor

Conversation

@kodiobika
Copy link
Copy Markdown
Contributor

@kodiobika kodiobika commented May 20, 2026

Summary

The two major goals of this PR are to:

  • Allow the input files that we only have legacy zonal data for (i.e., data corresponding to the current default 134 zones) to be incorporated in ReEDS runs using new/custom zones by first disaggregating those inputs to the county level and then re-aggregating them to the zone level (using the zones for any given run).
  • Mostly deprecate reeds/input_processing/aggregate_regions.py

Technical details

  • Adds functions to reeds/spatial.py for disaggregating from the legacy 134 zones to counties and aggregating from counties to the zones corresponding to a given run.
  • For all files except the transmission files (which will be addressed in a follow-up PR), aggregation and disaggregation now take place in reeds/input_processing/copy_files.py.
  • Disaggregation and aggregation both take place in the same run now (rather than disaggregation only happening in sub-BA runs and aggregation only happening in super-BA runs)

Additional changes

  • To clean up reeds/input_processing/runfiles.csv and make it clearer which files are actually being aggregated and disaggregated, I set the aggfunc and disaggfunc for all files that are not read from the repo (i.e., files that are created after reeds/input_processing/aggregate_regions.py) to ignore.
  • In reeds/input_processing/runfiles.csv, I updated the disaggfunc for unapp_water_sea_distr.csv and water_req_psh_10h_1_51.csv from geosize to uniform and the aggfunc for water_req_psh_10h_1_51.csv from mean to sum.
  • As part of testing this change I realized that there are duplicate rows in inputs/hydro/SeaCapAdj_hy.csv so I deleted those. (@jvcarag will address this in a follow-up PR)

Issues resolved

Part of #16

Validation, testing, and comparison report(s)

  • Essentially zero change for the Pacific case:
    results-main_Pacific,test_Pacific.pptx

  • There are small changes for the NYVT_mixed case: results-0519_Main_NYVT_mixed,0519_AggDisagg_NYVT_mixed.pptx

    • Based on the inputs.gdx files (see the gdxdiffs.ipynb notebook below), the differences stem from the cap_hyd_szn_adj, water_req_psh, and watsa_temp parameters. cap_hyd_szn_adj is different because doing the disaggregation and aggregation ends up removing the duplicate rows from inputs/hydro/SeaCapAdj_hy.csv (see "Additional changes"), so for the groups with duplicates, the values in the parameter go from 2 to 1. water_req_psh and watsa_temp are different for only the county-level zones of this case because the disaggfunc for their corresponding files was changed from geosize to uniform.
  • There are small changes for the USA_defaults case:
    results-0518_Main_USA_defaults,0518_AggDisagg_USA_defaults.pptx

    • Based on the inputs.gdx files (see the gdxdiffs.ipynb notebook below), the differences almost all stem from the aggregated zones (z28 and z122). Because we now disaggregate to counties before re-aggregating to these zones, inputs where the aggfunc is mean now represent a weighted average of legacy zonal values (where weights correspond to the number of counties in each legacy zone) rather than a simple average of legacy zonal values, which results in different values for these aggregated zones. The only difference not related to the aggregated zones is in the hyd_add_upg_cap parameter, where p108 and p69 no longer have values. This is because we now disaggregate the corresponding file to counties first according to the hydroexist disaggfunc, and these legacy zones have no existing hydro.
  • There are small changes for the USA_decarb case:
    results-0519_Main_USA_decarb,0519_AggDisagg_USA_decarb.pptx

  • This is a notebook looking at the differences in inputs.gdx for the USA_defaults and NYVT_mixed cases:
    gdxdiffs.ipynb. Aside from the differences explained above, there are only rounding-error differences.

Checklist for author

Details to double-check

  • Charge code provided to reviewers
  • Included comparison reports for appropriate test cases
  • Code formatting standardized
  • Reusable functions used where possible instead of copy/pasted code

General information to guide review

  • Zero impact on results of default case
  • No large data file(s) added/modified
  • No substantive impact on runtime for full-US reference case
  • No substantive impact on folder size for full-US reference case
  • No change to process flow (runreeds.py, reeds/core/solve/solve.py)
  • No change to code organization
  • No change to package requirements (environment.yml or Project.toml)

Did you use LLM tools (chatbot or copilot) in the preparation of this PR? If so, describe how

No

Tag points of contact here if you would like additional review of the relevant parts of the model

@kodiobika kodiobika changed the title Rewrite aggregation and disaggregation functionalities in reeds/spatial.py Move most aggregation and disaggregation to input_processing/copy_files.py May 20, 2026
@kodiobika kodiobika self-assigned this May 20, 2026
@kodiobika kodiobika changed the title Move most aggregation and disaggregation to input_processing/copy_files.py Disaggregate and re-aggregate inputs with legacy zonal data May 20, 2026
@kodiobika kodiobika changed the title Disaggregate and re-aggregate inputs with legacy zonal data Disaggregate and re-aggregate inputs with regional scope of legacy (134) zones May 20, 2026
@patrickbrown4 patrickbrown4 mentioned this pull request May 20, 2026
12 tasks
@stuartcohen8
Copy link
Copy Markdown

unapp_water_sea_distr.csv is ok to be uniform, but water_req_psh_10h_1_51.csv should be reverted to geosize. The latter are water volumes, so they should be split up when disaggregating. We think there is a separate bug with SeaCapAdj_hy.csv because older files do not have duplicates and have values for the hydD technology. Somewhere in input processing for PR1692 in the internal repo, this occurred. @jvcarag will make an issue to correct this file so that it has 1 values for hydD and no duplicates. In practice, it will not affect most if not all solutions because hydD capacity is high cost and therefore rarely economic. But if there are prescribed hydD builds, they might have there capacity zeroed out.

@kodiobika
Copy link
Copy Markdown
Contributor Author

unapp_water_sea_distr.csv is ok to be uniform, but water_req_psh_10h_1_51.csv should be reverted to geosize. The latter are water volumes, so they should be split up when disaggregating. We think there is a separate bug with SeaCapAdj_hy.csv because older files do not have duplicates and have values for the hydD technology. Somewhere in input processing for PR1692 in the internal repo, this occurred. @jvcarag will make an issue to correct this file so that it has 1 values for hydD and no duplicates. In practice, it will not affect most if not all solutions because hydD capacity is high cost and therefore rarely economic. But if there are prescribed hydD builds, they might have there capacity zeroed out.

Thanks @stuartcohen8. Just checking again on water_req_psh_10h_1_51.csv, I see in b_inputs.gms that the units for water_req_psh are Mgal/MW/yr (as opposed to Mgal). Is that not correct? And if it's actually Mgal, should the aggfunc for that file be updated from mean to sum? Everything else sounds good to me.

@stuartcohen8
Copy link
Copy Markdown

@kodiobika yes you're correct that aggfunc should be sum for that parameter. The water volumes are normalized by capacity because when they are used in water availability constraints, they are multiplied by the amount of PSH capacity that gets built to get a total water volume. The 'per year' convention is because water availability is generally characterized as an annual quantity of water available. So the units are right, and just to confirm you should sum to aggregate and use geosize to disaggregate. In reality it's more complicated based on basin dynamics, but that's beyond the scope of what we've done here so far.

@kodiobika
Copy link
Copy Markdown
Contributor Author

@kodiobika yes you're correct that aggfunc should be sum for that parameter. The water volumes are normalized by capacity because when they are used in water availability constraints, they are multiplied by the amount of PSH capacity that gets built to get a total water volume. The 'per year' convention is because water availability is generally characterized as an annual quantity of water available. So the units are right, and just to confirm you should sum to aggregate and use geosize to disaggregate. In reality it's more complicated based on basin dynamics, but that's beyond the scope of what we've done here so far.

Got it, thanks again!

Copy link
Copy Markdown
Contributor

@patrickbrown4 patrickbrown4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I really like the new structure. I'll defer to Stuart and Vincent on the hydro/water parts, but the rest looks good to me.

I think your approach to aggregate_regions.py makes sense; rather than picking through and removing the outdated code blocks now, we can move ahead with this PR and #95 in parallel, and then once they're both in, just remove the whole script.

Comment thread reeds/input_processing/aggregate_regions.py
Comment thread reeds/input_processing/runfiles.csv
Comment thread reeds/spatial.py Outdated
Comment thread reeds/spatial.py Outdated
Comment thread reeds/spatial.py
Comment on lines +264 to +268
# Get legacy zone-to-county allocation factors for disagg_variable
disagg_data = reeds.io.get_disagg_data(
os.path.dirname(inputs_case),
disagg_variable
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the next PR (or whatever one adds support for GSw_ZoneSet = z90), we should double check what happens if GSw_ZoneSet = z90 and GSw_Region = r/NJ.NY_NYC. NY_NYC is a subset of p127, and it's not immediately clear from copy_files.write_disagg_data_files() what would happen if your'e disaggregating/reaggregating with a subset of a z134 zone. I think it depends on whether you're using the full county2zone or the run-specific county2zone that only includes the zones in that run. I'd just want to make sure that the entire capacity (or whatever) of p127 doesn't end up in NY_NYC, since the other parts of p127 (NY_E and NY_W) aren't in the run. (Same idea as #23, but here for sub-z134 runs.)

I don't think we support sub-z134 runs now, so we can wait to test it once that capability works; just flagging since it came to mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants