Skip to content

Conversation

@mronkko
Copy link
Contributor

@mronkko mronkko commented Dec 13, 2025

Summary

Implements #74 by adding a prepare parameter to runSimulation() that executes once per simulation condition to prepare/modify fixed_objects before replications run.

Implementation

The prepare function:

  • Accepts condition and fixed_objects as arguments
  • Returns the modified fixed_objects
  • Is called in Analysis() function (before replications, alongside where summarise is called after replications)
  • Executes once per design row (condition)
  • Modified fixed_objects is passed to all replications for that condition

Use Case

Pre-compute expensive condition-specific objects (design matrices, correlation matrices, lookup tables) once per condition instead of per replication. This avoids both:

  • Memory issues from pre-computing objects for all conditions upfront
  • Performance issues from recomputing expensive objects for every replication

Example Usage

Design <- createDesign(N = c(100, 1000, 10000))

prepare <- function(condition, fixed_objects) {
    # Pre-compute expensive condition-specific objects
    fixed_objects$design_matrix <- matrix(rnorm(condition$N * 10), ncol=10)
    fixed_objects$lookup_table <- compute_expensive_lookup(condition$N)
    return(fixed_objects)
}

generate <- function(condition, fixed_objects) {
    # Use prepared objects from fixed_objects
    X <- fixed_objects$design_matrix
    y <- X %*% rnorm(10) + rnorm(nrow(X))
    data.frame(y=y, X)
}

runSimulation(Design, replications=1000,
              prepare=prepare,
              generate=generate,
              analyse=analyse,
              summarise=summarise)

Changes

  • Added prepare parameter to runSimulation() function signature
  • Added comprehensive documentation for the prepare parameter
  • Added validation for prepare function signature (must include condition, fixed_objects)
  • Added prepare parameter to Analysis() function
  • Implemented prepare call in Analysis() (once per condition, before replication loop)
  • Added parallel cluster export for prepare when provided
  • Added prepare globals to check.globals functionality
  • Full backward compatibility (defaults to NULL)

Testing

✅ Prepare function correctly modifies fixed_objects per condition
✅ Modified objects available in all user functions
✅ Backward compatibility maintained (existing code works unchanged)
✅ Proper error handling when prepare fails

Configure pbapply to display text-based progress bars when running
in non-interactive mode (e.g., batch jobs, Rscript). Previously,
progress tracking was disabled in non-interactive sessions, making
it impossible to monitor simulation progress in SLURM log files.

Changes:
- R/analysis.R: Add pboptions configuration to force type="txt"
  in non-interactive mode while preserving timer bars in interactive
  sessions
- R/runSimulation.R: Update progress parameter documentation to
  describe the new behavior

Interactive users see no change. Non-interactive users (SLURM, batch
jobs) now see text progress bars when monitoring logs via tail -f.

Fixes philchalmers#75
Implements philchalmers#74 by adding a prepare parameter to runSimulation() that
modifies fixed_objects once per condition before replications run.

The prepare function accepts condition and fixed_objects as arguments
and returns the modified fixed_objects, which is then passed to all
replications for that condition.

Use case: Pre-compute expensive condition-specific objects (design
matrices, lookup tables) once per condition instead of per replication,
avoiding both memory issues (from pre-computing all conditions) and
performance issues (from recomputing per replication).

Implementation:
- Added prepare parameter with validation
- Calls prepare(condition, fixed_objects) in main loop per condition
- Returns modified fixed_objects for use in replications
- Exports prepare to parallel clusters when provided
- Includes prepare globals in check.globals
- Full backward compatibility (prepare defaults to NULL)

Example:
prepare <- function(condition, fixed_objects) {
    fixed_objects$design_matrix <- matrix(rnorm(condition$N * 10), ncol=10)
    return(fixed_objects)
}
@philchalmers
Copy link
Owner

I generally like this structure now, thanks. The use case is fine, but maybe not the best way to think about how to use this in the documentation.

The way that I see this being useful is if within the prepare() definition something like fixed_objects$expensive_stuff <- readRDS('prepare/expensive_stuff') were used. The main reason is that information inside of prepare() is not returned by runSimulation() as the objects are expected to eat a good amount of RAM, so you wouldn't want these stored. Moreover, you'd certainty want to know what the information in prepare() actually look like given that they are a key component of the experiment (hence, we should note that any use of random number generation will be lost with this approach, and therefore saving RDS objects beforehand would be a more reasonable strategy).

As and aside, I particularly like this readRDS() idea in situations where binary files are precompiled locally and distributed on the cluster, as that should be considered a set once and forget it part of the codebase.

This commit adds comprehensive random number generator (RNG) state
management for the prepare() function, ensuring reproducibility and
debugging support consistent with generate/analyse/summarise functions.

Key Changes:

1. Seed Capture (R/analysis.R:15-52)
   - Automatically capture .Random.seed state before prepare() executes
   - Initialize RNG if .Random.seed doesn't exist yet
   - Store prepare error seed when prepare() fails for debugging

2. Seed Storage (R/analysis.R:26-37, 251-261)
   - Save prepare seeds to disk when save_seeds=TRUE
   - File path format: design-row-{ID}/prepare-seed
   - Store prepare_Random.seed in attributes when store_Random.seeds=TRUE
   - Always store prepare_error_seed for debugging (independent of flag)

3. New Parameter: load_seed_prepare (R/runSimulation.R:1033)
   - Dedicated parameter for debugging prepare function
   - Accepts character path, integer vector, or tibble/data.frame
   - Supports both absolute and relative file paths
   - Automatically detects path type and handles appropriately
   - Documented at R/runSimulation.R:345-352

4. Seed Extraction (R/SimExtract.R:120-123, 199-209)
   - SimExtract(res, 'prepare_seeds') - extract all prepare seeds
   - SimExtract(res, 'prepare_error_seed') - extract error seeds

5. Attribute Preservation (R/runSimulation.R:1635-1636)
   - Manually restore prepare seed attributes when Result_list is
     rebuilt as data.frame to prevent attribute loss

Example Usage:

# Run simulation with prepare that uses RNG
res <- runSimulation(Design, replications=10,
                     prepare=prepare,  # Uses rnorm(), runif(), etc.
                     control=list(save_seeds=TRUE,
                                  store_Random.seeds=TRUE))

# Extract prepare seeds for reproducibility
prepare_seeds <- SimExtract(res, 'prepare_seeds')

# Debug prepare errors by loading the error seed
res2 <- runSimulation(Design[2,], replications=1,
                      load_seed_prepare='design-row-2/prepare-seed')

Design Decisions:

- prepare_Random.seed only stored when store_Random.seeds=TRUE for
  consistency with stored_Random.seeds behavior
- prepare_error_seed always stored for debugging, like error_seeds
  and warning_seeds
- Separate attributes (prepare_Random.seed, prepare_error_seed)
  instead of nested list for consistency with existing codebase patterns
- File path detection allows both absolute and relative paths

Related: Complements PR philchalmers#78 (prepare function feature)
@mronkko
Copy link
Contributor Author

mronkko commented Dec 14, 2025

I implemented seed storing for prepare in the pull request.

Pre-generating and loading the prepared objects is a solution, but it is not always an ideal approach:

  1. The pre-generation can be costly and thus better run on a cluster instead of a local computer. The files can also be large, making storage and transfer cumbersome.
  2. For reproducibility by others (i.e. how easy the code is to run and undesrstand), it might be better to have one simulation file that does all preparation in one function instead of two functions for calculating and loading precalcultated results.

This change allows both use cases, 1) prepare as a loader and 2) prepare as a data generator shared with all replications.

@philchalmers
Copy link
Owner

philchalmers commented Dec 14, 2025

I implemented seed storing for prepare in the pull request.

Pre-generating and loading the prepared objects is a solution, but it is not always an ideal approach:

  1. The pre-generation can be costly and thus better run on a cluster instead of a local computer. The files can also be large, making storage and transfer cumbersome.

True, but this should be considered the exception rather than the rule. I was referring to highlighting this in the documentation as the object generation within prepare() is unnecessary for a wide majority of simulations. Moreover the prepare() step is run on a single core on the cluster, while all the prepare() functions across the design could easily be run in parallel locally or, for instance, on a SLURM landing node, and stored as individual and tractable objects. If the objects themselves are large I don't see why uploads to the cluster are going to be an issue, unless for some reason bandwidth is the issue. Of course, if the objects are so large that the they can only be stored temporarily on the distributed arrays then you're forced to used this approach, in which case tracking what the actual generated objects were at a later time will be a time consuming nightmare.....

  1. For reproducibility by others (i.e. how easy the code is to run and undesrstand), it might be better to have one simulation file that does all preparation in one function instead of two functions for calculating and loading precalcultated results.

The two step can be performed using the usual source() approach early in the object preparation stage on or off the landing node, while prepare() does the ladder. I don't see the need to split more than is already available.

This change allows both use cases, 1) prepare as a loader and 2) prepare as a data generator shared with all replications.

Great, I think this is coming together. Could you update the NEWS.md file to reflect the two pulls, and switch your ctb status to aut in DESCRIPTION? A few tests should probably be added to the tests/ directory as well just to make sure this works consistently in future releases.

R/analysis.R Outdated
.GlobalEnv$.Random.seed <- load_seed_prepare

# Ensure .Random.seed exists (initialize RNG if needed)
else if(!exists(".Random.seed", envir = .GlobalEnv))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved to .on.Attach() as it affects the other .Random.seed instances too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed the change that moves RNG initialization to .onAttach()

- Rewrite @param prepare to prioritize loading RDS files over dynamic generation
- Add RNG reproducibility warning when generating within prepare()
- Note that prepare objects are not stored by runSimulation()
- Add complete working example demonstrating recommended two-step workflow
- Document prepare seed storage in save_seeds and store_Random.seeds sections

Changes address feedback from PR philchalmers#78 to position prepare() primarily as
an object loader for cluster workflows, with dynamic generation as a
secondary use case requiring explicit RNG state management.
@mronkko
Copy link
Contributor Author

mronkko commented Dec 15, 2025

I added tests, updated documentation, added new release to NEWS and fixed my details in DESCRIPTION.

@philchalmers philchalmers merged commit 962eb77 into philchalmers:main Dec 15, 2025
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants