Skip to content

Add Latin Hypercube Sampling for Monte Carlo Analysis#57

Merged
bsergi merged 30 commits into
mainfrom
bs/lhs
May 22, 2026
Merged

Add Latin Hypercube Sampling for Monte Carlo Analysis#57
bsergi merged 30 commits into
mainfrom
bs/lhs

Conversation

@bsergi
Copy link
Copy Markdown
Contributor

@bsergi bsergi commented Apr 23, 2026

Summary

This PR adds the option to sample using a Latin Hypercube sampling (LHS) method when running a Monte Carlo analysis. It also fixes the sampling approach for load.

Technical details

LHS is a quasi-random sampling method that partitions each input distribution being sample into N bins of equal probability and then draws one sample from each bin. The advantage of this method is that it enables convergence to the "true" input distribution with a lower number of samples. The figure below illustrates this with a comparison of random and LHS methods for uniform and triangular costs using the 2050 nuclear ATB costs:

image

More details on the LHS approach can be found in Sheikholeslami, R., & Razavi, S. (2017).

Implementation notes

The LHS method is implemented in the mcs_sampler.py during input processing. When activated, the LHS method samples an n x d matrix where n=number of samples (specified by MCS_runs) and d=dimensions, or the number of variables being sample (specified by MCS_dist_groups).

This matrix is generated upfront and provides the place on the cumulative distribution function for each sample. These are subsequently used by a set of lhs functions to derive the realized values from the respective distributions. The original LHS sample matrix is saved to inputs_case/mcs_latin_hypercube_samples. Note that this approach is distinct from the implementation of the pure random approach, which samples weights and the applies them to the relevant files and switches.

Additional changes

Switches added/removed/changed

Adds MCS_lhs: 0 to use random sampling, 1 to use LHS (default: 1)
Modifies input_processing_only: adds an option 2 that stops input processing right after Monte Carlo sampling (useful for testing the input distributions before running).

Issues resolved

Partially addresses #41 by fixing load.

Known incompatibilities

Relevant sources or documentation

Validation, testing, and comparison report(s)

Monte Carlo sampling is off by default so there is no change to the default case (see compare below).

results-main,update.pptx

The next slide deck summarizes input distribution and results from a set of 54-region ReEDS runs using both the random and LHS approaches for a different number of samples. In general the LHS converges faster to the expected input distributions than the random approach. Both approaches yield reasonably comparable results on aggregate metrics such as mean and 90% coverage for installed capacity, annual generation, and system costs.

20260430_latin_hypercube_sampling.pdf

Checklist for author

Details to double-check

  • Charge code provided to reviewers
  • Included comparison reports for appropriate test cases
  • Documentation updated if necessary
  • If input data added/modified:
    • Dollar year recorded and converted to 2004$ for GAMS
    • Timeseries are in Central Time
    • Units are specified
    • Preprocessing steps have been documented and committed to ReEDS_Input_Processing
    • New large data files handled with .h5 instead of .csv
    • If spatially resolved inputs are modified, the following visualizations for each file are included in the PR description (time-averaged if the inputs are time-resolved):
      • Map of absolute values before
      • Map of absolute values after
      • Map of differences: (after - before) or (after / before)
    • If entries are added/removed/changed in the EIA-NEMS unit database:
      • Changes have been committed to ReEDS_Input_Processing
      • hourlize/resource.py was rerun to regenerate the existing/prescribed VRE capacity data
  • Code formatting standardized
  • Reusable functions used where possible instead of copy/pasted code

General information to guide review

  • Zero impact on results of default case
  • No large data file(s) added/modified
  • No substantive impact on runtime for full-US reference case
  • No substantive impact on folder size for full-US reference case
  • No change to process flow (runbatch.py, d_solve_iterate.py)
  • No change to code organization
  • No change to package requirements (environment.yml or Project.toml)

Did you use LLM tools (chatbot or copilot) in the preparation of this PR? If so, describe how

I used Claude to generate the function docstrings

Tag points of contact here if you would like additional review of the relevant parts of the model

@bsergi bsergi marked this pull request as draft April 23, 2026 19:40
@bsergi bsergi self-assigned this Apr 23, 2026
Copy link
Copy Markdown
Contributor

@patrickbrown4 patrickbrown4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! I think we should keep the old MC approach working (currently the MonteCarlo case in cases_test.csv is failing for me on this branch) but otherwise most of my comments are just stylistic. Happy to approve once the test is passing again.

Comment thread docs/source/user_guide.md Outdated
Comment thread input_processing/mcs_sampler.py Outdated
Comment thread input_processing/mcs_sampler.py Outdated
"""Check whether a dirichlet distribution can be mapped to a supported LHS distribution.

Dirichlet distributions with identical parameters of length 2 or 3 are
equivalent to uniform or triangular distributions, respectively, which
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you say more about how the usage with 3 parameters relates to the Dirichlet distribution? The "peak" of the 3-equal-parameter Dirichlet could range between a mix of low/high and pure mid, and since the high/mid and mid/low ratios change over time, they're not exactly equivalent. Does LHS use a pure triangular distribution or is it more like Dirichlet?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed; adding some spaghetti plots of the inputs could help explain the behavior

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attaching a few spaghetti plots comparing dirichlet(1,1,1) and triangular. For the former you notice some different angles, but I actually didn't observe any samples that crossed-over each other. I think that probably has to do with the low likelihood that you sample twice in a two places that are close enough to that they could cross, but I'm not certain.

Regardless, after some reflection I think it's probably best to just enforce using the triangular, uniform, or discrete with the latin hypercube and not try to translate a dirichlet over, so I've updated that accordingly.

20260514_lhs_distribution_comparison.pptx

Comment thread runbatch.py Outdated
Comment thread cases.csv
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add demand back to the MonteCarlo case in cases_test.csv so we can keep an eye on it? I think it would just be adding .load_country to its MCS_dist_groups setting.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what do you think about adding a LHS test to cases_test.csv too? I could see it being used more often than some of the other capabilities we test for, and since the Monte Carlo capability is pretty sensitive to input file structure (which changes relatively often) it could be good to keep it in the test suite.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on adding load back. I also like the idea of a separate LHS test so will add that in as well.

Comment thread input_processing/mcs_sampler.py Outdated
Comment thread input_processing/mcs_sampler.py Outdated
Comment thread reeds/input_processing/mcs_sampler.py Outdated
Comment on lines +1961 to +1963
# use sampling capability to set up random vector for modeing to generate alternatives (MGA)
if random_vector:
print("MGA random vector sampling will be implemented in the future")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this for now since it's not implemented yet

Comment thread runbatch.py Outdated
# if using random sampling, set random seed using seed + MCS run number
# to allow reproducibility without having the same sample for each MCS-ReEDS call
else:
np.random.seed(seed + mcs_run_number)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there's still randomness in LHS, I think it might be best to still run np.random.seed(seed) for reproducibility when using LHS.

You could use the approach taken for the pras_seed switch, where if the switch is set to 0 we do not set the seed, and if it's a nonzero integer we set it to the integer; that way a user could still turn off the seed if desired. Personally I would set a nonzero default for reproducibility, but either works as long as it's user accessible.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incidentally the explicit seed would also make it easier to extend a batch of random Monte Carlo runs; like if you do 300 and decide you want 200 more, you could just set the seed to 300 to get the "next" 200 samples

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of thoughts on this one, let me know your reactions:

  • The LHS sampling is done via scipy.stats.qmc.LatinHypercube. From what I can tell np.random isn't used by those methods and it isn't used elsewhere in the mcs_sampler.py script, so I don't think setting this would impact things.
  • The function I mentioned above does take a seed argument, and we do pass one to it (the default it zero). I just did a quick test of two different runs and got the same values for the LHS matrix, so I think we've got the reproducibility piece covered.
    • One thing to note is that the changing the number of samples or dimensions changes the matrix. That means that otherwise two identical runs with N=50 and N=100, or two runs with the same N but where one sampled UPV costs the other sampled UPV + gas prices, would have different values even for shared parameters. So, the seed only guarantees reproducibility for the same setup.
  • There isn't a straightforward way to extend the LHS matrix with more samples and still have it be a Latin Hypercube; the reason is that the sampling method bins the distribution and puts one sample in each bin, so adding new samples results in overlap.
    • Your suggestion about setting the explicit seed would work for the random approach so I can add support for that. It's already in the python script so we just need to add to ReEDS; I put in as a scalar but let me know if you think a switch is more appropriate here. I also added some explanation on the usage in the docs/user_guide section.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks, that all makes sense. Sorry to complicate things. I think the way you have it now, with the MCS_seed being used by both the random and LHS methods, is great.

Copy link
Copy Markdown
Contributor

@patrickbrown4 patrickbrown4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going ahead and approving since the rest of the comments are mostly stylistic (though I do think it'd be good to add a LHS scenario to cases_test.csv and to be able to fix the LHS seed)

Copy link
Copy Markdown
Contributor

@kodiobika kodiobika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! LGTM

Comment thread docs/source/user_guide.md Outdated
Comment thread docs/source/user_guide.md Outdated
Comment thread docs/source/user_guide.md Outdated
Comment thread input_processing/mcs_sampler.py Outdated
Comment on lines +428 to +432
## check that distribution specified is valid
if distribution not in MCSConstants.VALID_DISTRIBUTIONS:
raise ValueError(
f"The distribution {distribution} is not supported."
f"Please choose one of the following: {MCSConstants.VALID_DISTRIBUTIONS}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably better to check this upfront (before the for loop). In general I'd say it's ideal to check as many things upfront as possible to avoid going through the loop and then finding issues at the end, but I know some of the checks become more complicated outside of the for loop so no need to change any/everything

Suggested change
## check that distribution specified is valid
if distribution not in MCSConstants.VALID_DISTRIBUTIONS:
raise ValueError(
f"The distribution {distribution} is not supported."
f"Please choose one of the following: {MCSConstants.VALID_DISTRIBUTIONS}")
## check that all the distributions specified are valid
invalid_distributions = [
dist for dist in df_input_dist['dist'].unique()
if dist not in MCSConstants.VALID_DISTRIBUTIONS
]
if len(invalid_distributions) > 0:
raise ValueError(
f"The following distributions are not supported: {invalid_distributions}."
f"Please choose one of the following: {MCSConstants.VALID_DISTRIBUTIONS}")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved a few of the checks up out of the loop. Since the distributions I added complicate some of the rule checking I moved some of the rules in a new file (mcs_distribution_rules.yaml). There's still a loop at the end checking some things that I wasn't sure about change but those could probably be vectorized at some point too.

Comment thread input_processing/mcs_sampler.py Outdated
Comment thread input_processing/mcs_sampler.py Outdated
Comment thread input_processing/mcs_sampler.py Outdated
Comment on lines +1784 to +1785
#self._apply_weights_recf(dist_files, sample_idx)
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this case raise an error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we flag that the sampling with siting is disabled upstream, so I'm going to remove the pass and revert the comment.

Copy link
Copy Markdown
Contributor Author

@bsergi bsergi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some changes to address the comments. I still need to update with main but will aim to do that later this week. In the meantime let me know if you have additional comments, and thanks for reviewing!

Comment thread input_processing/mcs_sampler.py Outdated
Comment on lines +428 to +432
## check that distribution specified is valid
if distribution not in MCSConstants.VALID_DISTRIBUTIONS:
raise ValueError(
f"The distribution {distribution} is not supported."
f"Please choose one of the following: {MCSConstants.VALID_DISTRIBUTIONS}")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved a few of the checks up out of the loop. Since the distributions I added complicate some of the rule checking I moved some of the rules in a new file (mcs_distribution_rules.yaml). There's still a loop at the end checking some things that I wasn't sure about change but those could probably be vectorized at some point too.

Comment thread input_processing/mcs_sampler.py Outdated
"""Check whether a dirichlet distribution can be mapped to a supported LHS distribution.

Dirichlet distributions with identical parameters of length 2 or 3 are
equivalent to uniform or triangular distributions, respectively, which
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attaching a few spaghetti plots comparing dirichlet(1,1,1) and triangular. For the former you notice some different angles, but I actually didn't observe any samples that crossed-over each other. I think that probably has to do with the low likelihood that you sample twice in a two places that are close enough to that they could cross, but I'm not certain.

Regardless, after some reflection I think it's probably best to just enforce using the triangular, uniform, or discrete with the latin hypercube and not try to translate a dirichlet over, so I've updated that accordingly.

20260514_lhs_distribution_comparison.pptx

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok that's interesting. I took a pass at fixing for Monte Carlo runs in this commit but I've never used this feature before so might be good for a second set of eyes there.

Comment thread runbatch.py Outdated
Comment thread input_processing/mcs_sampler.py Outdated
if file_name == 'load.h5':
columns_other_states = [col for col in dist_files[0].keys() if col not in columns_in_hierarchy]
if len(columns_other_states) > 0:
generic_weight_matrix = self.r_weights[next(iter(self.r_weights))]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment thread input_processing/mcs_sampler.py Outdated
if single_r_weight:
# Get the first region key
first_region = next(iter(self.r_weights))
first_region = next(iter(self.r_weights))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is easy enough to change so adjusted. I wasn't familiar with the next(iter()) syntax but I will say it is growing on me!

Comment thread runreeds.py Outdated
# otherwise, drop any case marked ignore
if single:
if case not in single.split(','):
if not sum([s in case for s in single.split(',')]):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this might give weird behavior since some case names are subsets of other case names. I think my suggestion would just be to ignore the issue I raised about using --single for Monte Carlo, since it's kind of ill-defined for multi-case MC runs anyway, and is not directly related to the rest of the PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, will revert

@bsergi bsergi merged commit 719c3d2 into main May 22, 2026
10 checks passed
@bsergi bsergi deleted the bs/lhs branch May 22, 2026 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants