Conversation
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
patrickbrown4
left a comment
There was a problem hiding this comment.
Very cool! I think we should keep the old MC approach working (currently the MonteCarlo case in cases_test.csv is failing for me on this branch) but otherwise most of my comments are just stylistic. Happy to approve once the test is passing again.
| """Check whether a dirichlet distribution can be mapped to a supported LHS distribution. | ||
|
|
||
| Dirichlet distributions with identical parameters of length 2 or 3 are | ||
| equivalent to uniform or triangular distributions, respectively, which |
There was a problem hiding this comment.
Could you say more about how the usage with 3 parameters relates to the Dirichlet distribution? The "peak" of the 3-equal-parameter Dirichlet could range between a mix of low/high and pure mid, and since the high/mid and mid/low ratios change over time, they're not exactly equivalent. Does LHS use a pure triangular distribution or is it more like Dirichlet?
There was a problem hiding this comment.
Discussed; adding some spaghetti plots of the inputs could help explain the behavior
There was a problem hiding this comment.
Attaching a few spaghetti plots comparing dirichlet(1,1,1) and triangular. For the former you notice some different angles, but I actually didn't observe any samples that crossed-over each other. I think that probably has to do with the low likelihood that you sample twice in a two places that are close enough to that they could cross, but I'm not certain.
Regardless, after some reflection I think it's probably best to just enforce using the triangular, uniform, or discrete with the latin hypercube and not try to translate a dirichlet over, so I've updated that accordingly.
There was a problem hiding this comment.
Could you add demand back to the MonteCarlo case in cases_test.csv so we can keep an eye on it? I think it would just be adding .load_country to its MCS_dist_groups setting.
There was a problem hiding this comment.
And what do you think about adding a LHS test to cases_test.csv too? I could see it being used more often than some of the other capabilities we test for, and since the Monte Carlo capability is pretty sensitive to input file structure (which changes relatively often) it could be good to keep it in the test suite.
There was a problem hiding this comment.
Good call on adding load back. I also like the idea of a separate LHS test so will add that in as well.
| # use sampling capability to set up random vector for modeing to generate alternatives (MGA) | ||
| if random_vector: | ||
| print("MGA random vector sampling will be implemented in the future") |
There was a problem hiding this comment.
I'd remove this for now since it's not implemented yet
| # if using random sampling, set random seed using seed + MCS run number | ||
| # to allow reproducibility without having the same sample for each MCS-ReEDS call | ||
| else: | ||
| np.random.seed(seed + mcs_run_number) |
There was a problem hiding this comment.
Since there's still randomness in LHS, I think it might be best to still run np.random.seed(seed) for reproducibility when using LHS.
You could use the approach taken for the pras_seed switch, where if the switch is set to 0 we do not set the seed, and if it's a nonzero integer we set it to the integer; that way a user could still turn off the seed if desired. Personally I would set a nonzero default for reproducibility, but either works as long as it's user accessible.
There was a problem hiding this comment.
Incidentally the explicit seed would also make it easier to extend a batch of random Monte Carlo runs; like if you do 300 and decide you want 200 more, you could just set the seed to 300 to get the "next" 200 samples
There was a problem hiding this comment.
A couple of thoughts on this one, let me know your reactions:
- The LHS sampling is done via
scipy.stats.qmc.LatinHypercube. From what I can tellnp.randomisn't used by those methods and it isn't used elsewhere in themcs_sampler.pyscript, so I don't think setting this would impact things. - The function I mentioned above does take a seed argument, and we do pass one to it (the default it zero). I just did a quick test of two different runs and got the same values for the LHS matrix, so I think we've got the reproducibility piece covered.
- One thing to note is that the changing the number of samples or dimensions changes the matrix. That means that otherwise two identical runs with N=50 and N=100, or two runs with the same N but where one sampled UPV costs the other sampled UPV + gas prices, would have different values even for shared parameters. So, the seed only guarantees reproducibility for the same setup.
- There isn't a straightforward way to extend the LHS matrix with more samples and still have it be a Latin Hypercube; the reason is that the sampling method bins the distribution and puts one sample in each bin, so adding new samples results in overlap.
- Your suggestion about setting the explicit seed would work for the random approach so I can add support for that. It's already in the python script so we just need to add to ReEDS; I put in as a scalar but let me know if you think a switch is more appropriate here. I also added some explanation on the usage in the docs/user_guide section.
There was a problem hiding this comment.
Ok thanks, that all makes sense. Sorry to complicate things. I think the way you have it now, with the MCS_seed being used by both the random and LHS methods, is great.
patrickbrown4
left a comment
There was a problem hiding this comment.
Going ahead and approving since the rest of the comments are mostly stylistic (though I do think it'd be good to add a LHS scenario to cases_test.csv and to be able to fix the LHS seed)
| ## check that distribution specified is valid | ||
| if distribution not in MCSConstants.VALID_DISTRIBUTIONS: | ||
| raise ValueError( | ||
| f"The distribution {distribution} is not supported." | ||
| f"Please choose one of the following: {MCSConstants.VALID_DISTRIBUTIONS}") |
There was a problem hiding this comment.
It's probably better to check this upfront (before the for loop). In general I'd say it's ideal to check as many things upfront as possible to avoid going through the loop and then finding issues at the end, but I know some of the checks become more complicated outside of the for loop so no need to change any/everything
| ## check that distribution specified is valid | |
| if distribution not in MCSConstants.VALID_DISTRIBUTIONS: | |
| raise ValueError( | |
| f"The distribution {distribution} is not supported." | |
| f"Please choose one of the following: {MCSConstants.VALID_DISTRIBUTIONS}") | |
| ## check that all the distributions specified are valid | |
| invalid_distributions = [ | |
| dist for dist in df_input_dist['dist'].unique() | |
| if dist not in MCSConstants.VALID_DISTRIBUTIONS | |
| ] | |
| if len(invalid_distributions) > 0: | |
| raise ValueError( | |
| f"The following distributions are not supported: {invalid_distributions}." | |
| f"Please choose one of the following: {MCSConstants.VALID_DISTRIBUTIONS}") |
There was a problem hiding this comment.
I moved a few of the checks up out of the loop. Since the distributions I added complicate some of the rule checking I moved some of the rules in a new file (mcs_distribution_rules.yaml). There's still a loop at the end checking some things that I wasn't sure about change but those could probably be vectorized at some point too.
| #self._apply_weights_recf(dist_files, sample_idx) | ||
| pass |
There was a problem hiding this comment.
Should this case raise an error?
There was a problem hiding this comment.
Right now we flag that the sampling with siting is disabled upstream, so I'm going to remove the pass and revert the comment.
bsergi
left a comment
There was a problem hiding this comment.
Made some changes to address the comments. I still need to update with main but will aim to do that later this week. In the meantime let me know if you have additional comments, and thanks for reviewing!
| ## check that distribution specified is valid | ||
| if distribution not in MCSConstants.VALID_DISTRIBUTIONS: | ||
| raise ValueError( | ||
| f"The distribution {distribution} is not supported." | ||
| f"Please choose one of the following: {MCSConstants.VALID_DISTRIBUTIONS}") |
There was a problem hiding this comment.
I moved a few of the checks up out of the loop. Since the distributions I added complicate some of the rule checking I moved some of the rules in a new file (mcs_distribution_rules.yaml). There's still a loop at the end checking some things that I wasn't sure about change but those could probably be vectorized at some point too.
| """Check whether a dirichlet distribution can be mapped to a supported LHS distribution. | ||
|
|
||
| Dirichlet distributions with identical parameters of length 2 or 3 are | ||
| equivalent to uniform or triangular distributions, respectively, which |
There was a problem hiding this comment.
Attaching a few spaghetti plots comparing dirichlet(1,1,1) and triangular. For the former you notice some different angles, but I actually didn't observe any samples that crossed-over each other. I think that probably has to do with the low likelihood that you sample twice in a two places that are close enough to that they could cross, but I'm not certain.
Regardless, after some reflection I think it's probably best to just enforce using the triangular, uniform, or discrete with the latin hypercube and not try to translate a dirichlet over, so I've updated that accordingly.
There was a problem hiding this comment.
Ah ok that's interesting. I took a pass at fixing for Monte Carlo runs in this commit but I've never used this feature before so might be good for a second set of eyes there.
| if file_name == 'load.h5': | ||
| columns_other_states = [col for col in dist_files[0].keys() if col not in columns_in_hierarchy] | ||
| if len(columns_other_states) > 0: | ||
| generic_weight_matrix = self.r_weights[next(iter(self.r_weights))] |
| if single_r_weight: | ||
| # Get the first region key | ||
| first_region = next(iter(self.r_weights)) | ||
| first_region = next(iter(self.r_weights)) |
There was a problem hiding this comment.
This one is easy enough to change so adjusted. I wasn't familiar with the next(iter()) syntax but I will say it is growing on me!
| # otherwise, drop any case marked ignore | ||
| if single: | ||
| if case not in single.split(','): | ||
| if not sum([s in case for s in single.split(',')]): |
There was a problem hiding this comment.
Hm, this might give weird behavior since some case names are subsets of other case names. I think my suggestion would just be to ignore the issue I raised about using --single for Monte Carlo, since it's kind of ill-defined for multi-case MC runs anyway, and is not directly related to the rest of the PR.
There was a problem hiding this comment.
sounds good, will revert
This reverts commit eaad50c.
Summary
This PR adds the option to sample using a Latin Hypercube sampling (LHS) method when running a Monte Carlo analysis. It also fixes the sampling approach for load.
Technical details
LHS is a quasi-random sampling method that partitions each input distribution being sample into N bins of equal probability and then draws one sample from each bin. The advantage of this method is that it enables convergence to the "true" input distribution with a lower number of samples. The figure below illustrates this with a comparison of random and LHS methods for uniform and triangular costs using the 2050 nuclear ATB costs:
More details on the LHS approach can be found in Sheikholeslami, R., & Razavi, S. (2017).
Implementation notes
The LHS method is implemented in the
mcs_sampler.pyduring input processing. When activated, the LHS method samples an n x d matrix where n=number of samples (specified byMCS_runs) and d=dimensions, or the number of variables being sample (specified byMCS_dist_groups).This matrix is generated upfront and provides the place on the cumulative distribution function for each sample. These are subsequently used by a set of lhs functions to derive the realized values from the respective distributions. The original LHS sample matrix is saved to
inputs_case/mcs_latin_hypercube_samples. Note that this approach is distinct from the implementation of the pure random approach, which samples weights and the applies them to the relevant files and switches.Additional changes
Switches added/removed/changed
Adds
MCS_lhs: 0 to use random sampling, 1 to use LHS (default: 1)Modifies
input_processing_only: adds an option 2 that stops input processing right after Monte Carlo sampling (useful for testing the input distributions before running).Issues resolved
Partially addresses #41 by fixing load.
Known incompatibilities
Relevant sources or documentation
Validation, testing, and comparison report(s)
Monte Carlo sampling is off by default so there is no change to the default case (see compare below).
results-main,update.pptx
The next slide deck summarizes input distribution and results from a set of 54-region ReEDS runs using both the random and LHS approaches for a different number of samples. In general the LHS converges faster to the expected input distributions than the random approach. Both approaches yield reasonably comparable results on aggregate metrics such as mean and 90% coverage for installed capacity, annual generation, and system costs.
20260430_latin_hypercube_sampling.pdf
Checklist for author
Details to double-check
hourlize/resource.pywas rerun to regenerate the existing/prescribed VRE capacity dataGeneral information to guide review
Did you use LLM tools (chatbot or copilot) in the preparation of this PR? If so, describe how
I used Claude to generate the function docstrings
Tag points of contact here if you would like additional review of the relevant parts of the model