Skip to content

Support migrating existing R script to JupyterLab (R Docker): packages, paths, and DB connectivity #16

@ShishirShakya

Description

@ShishirShakya

GitHub Issue Log: Support migrating existing R script to JupyterLab (R Docker): packages, paths, and DB connectivity


Context

We have a fully functional R script developed in RStudio for our analysis. We are requesting assistance or guidance on the most efficient and recommended path to adapt and utilize this existing R code within the new JupyterLab environment (R Docker image). We want to ensure we do not have to re-create our work. A meeting with a data engineer has not yet been scheduled; we would like to align this issue with that discussion when it is scheduled.

Reference script: shakya_sepsis_july2025-Long.R. Credentials are suppressed. Line numbers in this issue refer to the original script; the shared copy may differ slightly.


1. R package availability

Request: Please confirm whether the following R packages can be installed (or are pre-installed) in the R Docker image used in the ML Workspace / JupyterLab:

Package Purpose in script
rstudioapi Used only for script directory / working directory (see §2).
tidyverse dplyr, tidyr, ggplot2, purrr, stringr, tibble, lubridate, etc.
data.table rbindlist for combining batched query results.
DatabaseConnector PostgreSQL connection and querySql for OMOP CDM.
caret nearZeroVar for feature screening.
janitor make_clean_names for variable names.
mice Multiple imputation.
fpc clusterboot, kmeansCBI for cluster stability.
FSA dunnTest for post-hoc pairwise comparisons.

Note: DatabaseConnector may require Java/JDBC drivers for PostgreSQL. Please confirm whether Java is available in the Docker image and whether any additional driver setup is required.

Request (install from session): Are users allowed to install packages from CRAN from within the R session (e.g. via install.packages(..., repos = "http://cran.us.r-project.org") for updates or extra packages), or must all packages be pre-installed in the image?


2. RStudio-specific code (adaptation for JupyterLab)

The script currently relies on RStudio in two places and will need equivalent patterns in JupyterLab:

Working directory (lines 23–24)

  • Current: path <- dirname(rstudioapi::getActiveDocumentContext()$path) and setwd(path).
  • Issue: In JupyterLab there is no “active document path” in the same way.
  • Request: What is the recommended way to set the working directory in the R kernel (e.g., fixed project path, environment variable, or “notebook directory” equivalent) so we can replace this without breaking file paths?

Graphics device (lines 4–6)

  • Current: while (!is.null(dev.list())) { dev.off(dev.list()["RStudioGD"]) } to clear the RStudio graphics device.
  • Issue: JupyterLab may use a different graphics device.
  • Request: What is the recommended way to clear or close plots in the R kernel under JupyterLab?

3. Selective environment cleanup

The script uses selective removal of objects to keep a clean workspace and separation of concerns (e.g. rm(list = ls()[!ls() %in% c("path", "ipak", "conn", "connectionDetails")]) after DB setup, and a helper stable_clean_environment() that keeps only a defined set of names).

Request: What is the recommended way in the R kernel under JupyterLab to achieve the same (remove all objects except a chosen set), or should we rely on restarting the kernel or a different workflow?


4. File paths and output location

The script:

  • Creates and uses a subfolder outcomes/ for cached RDS files and figures.
  • Depends on a pre-existing file: outcomes/var_selection.rds (see script ~line 471). This file is read and must exist (or be generated) in the new environment.
  • Writes outputs such as:
    • outcomes/measurement_snapshot_72_after_before_batch_July2025.rds
    • outcomes/selected_var_imputed_data_july2025.rds
    • outcomes/cluster_results_1000_iter.rds
    • outcomes/observed_imputed_distribution.png
    • outcomes/cluster_results.png
    • outcomes/DunningZtest.png

Request: How does persistent storage and project-relative pathing work in the ML Workspace? Is there a project or home directory that persists across sessions, and how should we set the working directory so that paths like outcomes/... work consistently?


5. Database connectivity

The script connects to Azure PostgreSQL (OMOP CDM) using DatabaseConnector with host, user, password, and database name (script lines 31–44). Credentials are currently hardcoded.

Request:

  • Confirm that outbound connections to Azure PostgreSQL (or the provided OMOP database) are allowed from the JupyterLab/R Docker environment.
  • Recommend the supported way to provide credentials (e.g., environment variables, secrets service, or config file) so we can remove hardcoded credentials and comply with NIST/CHoRUS policies.

6. Cached / long-running steps

The script uses RDS files to cache heavy steps (measurements, prior conditions, imputation, cluster stability) so that re-runs do not repeat long queries or computations.

Request: Confirm that read/write access to a persistent directory (e.g. outcomes/) is supported and that such files persist across sessions (or that we can rely on a documented persistent volume/mount).


7. Progress bars

The script uses utils::txtProgressBar in batched DB steps (e.g. get_measurements_before_after_event_batched, get_prior_conditions_batched). In JupyterLab, text progress bars can behave differently (buffering, display).

Request: Do txtProgressBar (or other progress APIs) work as expected in the R kernel, or is there a recommended alternative for long-running steps?


8. Long-running and resource-heavy steps

The script includes heavy operations (large batched DB queries, mice with m = 20, clusterboot with B = 1000).

Request: Are there limits on cell runtime or memory for the R kernel (e.g. timeouts, OOM), and is there a recommended way to run long jobs (e.g. chunking, running as a batch script, or increasing resources)?


9. Reproducibility and “no re-creation of work”

Request:

  • Any documented “migration checklist” or template for moving an RStudio R script into JupyterLab (e.g., replace rstudioapi with X, use Y for working directory, use Z for plots).
  • Confirmation that the same R version and key package versions can be fixed (e.g., via Docker image tag or a lockfile) so our analysis remains reproducible.

10. External integrations (for tracking only)

We understand that integrations with VS Code, GitHub, or tools like Claude are not currently supported; we will log those as separate requests. This issue is focused only on running our existing R script inside the provided JupyterLab R environment.


Summary checklist for this issue

  • Install or allow installation of: rstudioapi, tidyverse, data.table, DatabaseConnector, caret, janitor, mice, fpc, FSA in the R Docker image.
  • Clarify whether users can install packages from CRAN from within the R session, or must rely on pre-installed packages.
  • Document or support working directory and script/notebook directory equivalent (replacement for rstudioapi::getActiveDocumentContext()$path).
  • Document recommended way to clear/display plots in R under JupyterLab (replace RStudioGD usage).
  • Document recommended way to selective environment cleanup in R under JupyterLab (remove all except a chosen set of objects), or recommend workflow (e.g. kernel restart).
  • Clarify persistent storage and project-relative paths for outcomes/ and other inputs/outputs.
  • Confirm Azure PostgreSQL (or OMOP DB) connectivity and credential management best practices.
  • Confirm Java/JDBC availability for DatabaseConnector if required.
  • Confirm progress bars (txtProgressBar or alternative) work as expected in the R kernel.
  • Clarify limits on cell runtime/memory and recommended approach for long-running jobs (chunking, batch script, or increased resources).
  • Provide or link to a migration guide for RStudio → JupyterLab R scripts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions