Support migrating existing R script to JupyterLab (R Docker): packages, paths, and DB connectivity

# GitHub Issue Log: Support migrating existing R script to JupyterLab (R Docker): packages, paths, and DB connectivity


---

## Context

We have a fully functional R script developed in RStudio for our analysis. We are requesting assistance or guidance on the most efficient and recommended path to adapt and utilize this existing R code within the new JupyterLab environment (R Docker image). We want to ensure we do not have to re-create our work. A meeting with a data engineer has not yet been scheduled; we would like to align this issue with that discussion when it is scheduled.

Reference script: [shakya_sepsis_july2025-Long.R](https://www.dropbox.com/scl/fi/n2izlsu9dddyaawc2al0d/shakya_sepsis_july2025-Long.R?rlkey=v5629knx84x0limvsvfvdthmm&st=3cdfnvof&dl=0). Credentials are suppressed. Line numbers in this issue refer to the original script; the shared copy may differ slightly.

---

## 1. R package availability

**Request:** Please confirm whether the following R packages can be installed (or are pre-installed) in the R Docker image used in the ML Workspace / JupyterLab:

| Package | Purpose in script |
|--------|--------------------|
| **rstudioapi** | Used only for script directory / working directory (see §2). |
| **tidyverse** | dplyr, tidyr, ggplot2, purrr, stringr, tibble, lubridate, etc. |
| **data.table** | `rbindlist` for combining batched query results. |
| **DatabaseConnector** | PostgreSQL connection and `querySql` for OMOP CDM. |
| **caret** | `nearZeroVar` for feature screening. |
| **janitor** | `make_clean_names` for variable names. |
| **mice** | Multiple imputation. |
| **fpc** | `clusterboot`, `kmeansCBI` for cluster stability. |
| **FSA** | `dunnTest` for post-hoc pairwise comparisons. |

**Note:** `DatabaseConnector` may require Java/JDBC drivers for PostgreSQL. Please confirm whether Java is available in the Docker image and whether any additional driver setup is required.

**Request (install from session):** Are users allowed to install packages from CRAN from within the R session (e.g. via `install.packages(..., repos = "http://cran.us.r-project.org")` for updates or extra packages), or must all packages be pre-installed in the image?

---

## 2. RStudio-specific code (adaptation for JupyterLab)

The script currently relies on RStudio in two places and will need equivalent patterns in JupyterLab:

### Working directory (lines 23–24)

- **Current:** `path <- dirname(rstudioapi::getActiveDocumentContext()$path)` and `setwd(path)`.
- **Issue:** In JupyterLab there is no “active document path” in the same way.
- **Request:** What is the recommended way to set the working directory in the R kernel (e.g., fixed project path, environment variable, or “notebook directory” equivalent) so we can replace this without breaking file paths?

### Graphics device (lines 4–6)

- **Current:** `while (!is.null(dev.list())) { dev.off(dev.list()["RStudioGD"]) }` to clear the RStudio graphics device.
- **Issue:** JupyterLab may use a different graphics device.
- **Request:** What is the recommended way to clear or close plots in the R kernel under JupyterLab?

---

## 3. Selective environment cleanup

The script uses selective removal of objects to keep a clean workspace and separation of concerns (e.g. `rm(list = ls()[!ls() %in% c("path", "ipak", "conn", "connectionDetails")])` after DB setup, and a helper `stable_clean_environment()` that keeps only a defined set of names).

**Request:** What is the recommended way in the R kernel under JupyterLab to achieve the same (remove all objects except a chosen set), or should we rely on restarting the kernel or a different workflow?

---

## 4. File paths and output location

The script:

- Creates and uses a subfolder **`outcomes/`** for cached RDS files and figures.
- **Depends on a pre-existing file:** `outcomes/var_selection.rds` (see script ~line 471). This file is read and must exist (or be generated) in the new environment.
- Writes outputs such as:
  - `outcomes/measurement_snapshot_72_after_before_batch_July2025.rds`
  - `outcomes/selected_var_imputed_data_july2025.rds`
  - `outcomes/cluster_results_1000_iter.rds`
  - `outcomes/observed_imputed_distribution.png`
  - `outcomes/cluster_results.png`
  - `outcomes/DunningZtest.png`

**Request:** How does persistent storage and project-relative pathing work in the ML Workspace? Is there a project or home directory that persists across sessions, and how should we set the working directory so that paths like `outcomes/...` work consistently?

---

## 5. Database connectivity

The script connects to **Azure PostgreSQL** (OMOP CDM) using `DatabaseConnector` with host, user, password, and database name (script lines 31–44). Credentials are currently hardcoded.

**Request:**

- Confirm that outbound connections to Azure PostgreSQL (or the provided OMOP database) are allowed from the JupyterLab/R Docker environment.
- Recommend the supported way to provide credentials (e.g., environment variables, secrets service, or config file) so we can remove hardcoded credentials and comply with NIST/CHoRUS policies.

---

## 6. Cached / long-running steps

The script uses **RDS files** to cache heavy steps (measurements, prior conditions, imputation, cluster stability) so that re-runs do not repeat long queries or computations.

**Request:** Confirm that read/write access to a persistent directory (e.g. `outcomes/`) is supported and that such files persist across sessions (or that we can rely on a documented persistent volume/mount).

---

## 7. Progress bars

The script uses `utils::txtProgressBar` in batched DB steps (e.g. `get_measurements_before_after_event_batched`, `get_prior_conditions_batched`). In JupyterLab, text progress bars can behave differently (buffering, display).

**Request:** Do `txtProgressBar` (or other progress APIs) work as expected in the R kernel, or is there a recommended alternative for long-running steps?

---

## 8. Long-running and resource-heavy steps

The script includes heavy operations (large batched DB queries, `mice` with `m = 20`, `clusterboot` with `B = 1000`).

**Request:** Are there limits on cell runtime or memory for the R kernel (e.g. timeouts, OOM), and is there a recommended way to run long jobs (e.g. chunking, running as a batch script, or increasing resources)?

---

## 9. Reproducibility and “no re-creation of work”

**Request:**

- Any documented “migration checklist” or template for moving an RStudio R script into JupyterLab (e.g., replace `rstudioapi` with X, use Y for working directory, use Z for plots).
- Confirmation that the same R version and key package versions can be fixed (e.g., via Docker image tag or a lockfile) so our analysis remains reproducible.

---

## 10. External integrations (for tracking only)

We understand that integrations with VS Code, GitHub, or tools like Claude are not currently supported; we will log those as separate requests. This issue is focused only on running our existing R script inside the provided JupyterLab R environment.

---

## Summary checklist for this issue

- [ ] Install or allow installation of: **rstudioapi**, **tidyverse**, **data.table**, **DatabaseConnector**, **caret**, **janitor**, **mice**, **fpc**, **FSA** in the R Docker image.
- [ ] Clarify whether users can **install packages from CRAN** from within the R session, or must rely on pre-installed packages.
- [ ] Document or support **working directory** and **script/notebook directory** equivalent (replacement for `rstudioapi::getActiveDocumentContext()$path`).
- [ ] Document recommended way to **clear/display plots** in R under JupyterLab (replace `RStudioGD` usage).
- [ ] Document recommended way to **selective environment cleanup** in R under JupyterLab (remove all except a chosen set of objects), or recommend workflow (e.g. kernel restart).
- [ ] Clarify **persistent storage** and **project-relative paths** for `outcomes/` and other inputs/outputs.
- [ ] Confirm **Azure PostgreSQL** (or OMOP DB) connectivity and **credential management** best practices.
- [ ] Confirm **Java/JDBC** availability for `DatabaseConnector` if required.
- [ ] Confirm **progress bars** (`txtProgressBar` or alternative) work as expected in the R kernel.
- [ ] Clarify **limits on cell runtime/memory** and recommended approach for **long-running jobs** (chunking, batch script, or increased resources).
- [ ] Provide or link to a **migration guide** for RStudio → JupyterLab R scripts.


Package	Purpose in script
rstudioapi	Used only for script directory / working directory (see §2).
tidyverse	dplyr, tidyr, ggplot2, purrr, stringr, tibble, lubridate, etc.
data.table	`rbindlist` for combining batched query results.
DatabaseConnector	PostgreSQL connection and `querySql` for OMOP CDM.
caret	`nearZeroVar` for feature screening.
janitor	`make_clean_names` for variable names.
mice	Multiple imputation.
fpc	`clusterboot`, `kmeansCBI` for cluster stability.
FSA	`dunnTest` for post-hoc pairwise comparisons.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support migrating existing R script to JupyterLab (R Docker): packages, paths, and DB connectivity #16

GitHub Issue Log: Support migrating existing R script to JupyterLab (R Docker): packages, paths, and DB connectivity

Context

1. R package availability

2. RStudio-specific code (adaptation for JupyterLab)

Working directory (lines 23–24)

Graphics device (lines 4–6)

3. Selective environment cleanup

4. File paths and output location

5. Database connectivity

6. Cached / long-running steps

7. Progress bars

8. Long-running and resource-heavy steps

9. Reproducibility and “no re-creation of work”

10. External integrations (for tracking only)

Summary checklist for this issue

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Support migrating existing R script to JupyterLab (R Docker): packages, paths, and DB connectivity #16

Description

GitHub Issue Log: Support migrating existing R script to JupyterLab (R Docker): packages, paths, and DB connectivity

Context

1. R package availability

2. RStudio-specific code (adaptation for JupyterLab)

Working directory (lines 23–24)

Graphics device (lines 4–6)

3. Selective environment cleanup

4. File paths and output location

5. Database connectivity

6. Cached / long-running steps

7. Progress bars

8. Long-running and resource-heavy steps

9. Reproducibility and “no re-creation of work”

10. External integrations (for tracking only)

Summary checklist for this issue

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions