DaneshjouLab · gtcha2 · May 15, 2025 · May 10, 2025 · May 10, 2025 · May 10, 2025
diff --git a/requirements.txt b/requirements.txt
@@ -11,7 +11,7 @@ asyncer==0.0.8
 attrs==25.3.0
 backoff==2.2.1
 beautifulsoup4==4.13.4
-biopython==1.85
+bert-score>=0.3.13
 blis==1.3.0
 cachetools==5.5.2
 catalogue==2.0.10
@@ -28,7 +28,8 @@ dill==0.3.8
 diskcache==5.6.3
 distro==1.9.0
 dotenv==0.9.9
-dspy @ git+https://github.com/stanfordnlp/dspy.git@b7375196ba95215dc6237175490302fb8e5976df
+dspy @ git+https://github.com/stanfordnlp/dspy.git@ab340835d1a83f62fb18f86c5b17fdfb9d172713
+dspy-ai>=2.3.3
 en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl#sha256=1932429db727d4bff3deed6b34cfc05df17794f4a52eeb26cf8928f7c1a0fb85
 exceptiongroup==1.2.2
 fastapi==0.115.12
@@ -64,6 +65,7 @@ mdurl==0.1.2
 multidict==6.4.2
 multiprocess==0.70.16
 murmurhash==1.0.12
+networkx>=3.1
 numpy==2.0.2
 openai==1.61.0
 optuna==4.2.1
@@ -107,6 +109,7 @@ tiktoken==0.9.0
 tokenizers==0.21.1
 tomli==2.2.1
 tomlkit==0.13.2
+transformers>=4.37
 tqdm==4.67.1
 typer==0.15.2
 typing-inspection==0.4.0

diff --git a/requirements.txt.license b/requirements.txt.license
@@ -0,0 +1,5 @@
+This source file is part of the Daneshjou Lab projects
+
+SPDX-FileCopyrightText: 2024 Stanford University and the project authors (see AUTHORS.md)
+
+SPDX-License-Identifier: MIT
diff --git a/src/benchmark/Modular Benchmarking: Design Rules and Guidelines.md b/src/benchmark/Modular Benchmarking: Design Rules and Guidelines.md
@@ -0,0 +1,94 @@
+## Modular Benchmarking: Design Rules and Guidelines
+
+To support benchmarking across any task, data function, or model type, a modular benchmark must be built from interchangeable, well-scoped components. The following rules define how to structure such a benchmark to support flexibility, extensibility, and insight.
+
+---
+
+### Rule 1: Decompose into independent axes
+Separate all components of the system under test into **three disjoint axes**:
+
+- **Data function**: A parameterized function or generator that defines input-output pairs under specific sampling, augmentation, or preprocessing schemes.
+- **Model interface**: A callable or wrapped object that takes input and returns predictions, with a clear API across model types.
+- **Evaluation kernel**: A set of metrics or transformation functions that take predictions and targets and return scores, possibly per slice.
+
+Each axis must be independently swappable without requiring changes to the others.
+
+---
+
+### Rule 2: Treat data as a first-class function
+Data should not be loaded as static artifacts but built as **declarative recipes** with clearly scoped variability:
+
+- Input: Sampling strategy, transformation pipeline, split logic, perturbation level
+- Output: Deterministic or stochastic dataset objects conforming to a unified format
+- Example: `data_fn(task='translation', domain='medical', noise=0.1) -> Dataset`
+
+This ensures all experiments can vary data meaningfully and reproducibly.
+
+---
+
+### Rule 3: Enforce interface constraints
+Every component must conform to a simple, well-documented API:
+
+- **Model**: `predict(batch) -> outputs`
+- **Evaluator**: `evaluate(preds, targets, **kwargs) -> score_dict`
+- **Data function**: `data_config -> dataset object or iterator`
+
+Adapters should be used to wrap legacy models or datasets into these interfaces.
+
+---
+
+### Rule 4: Capture all variation as configuration
+All variable aspects (e.g., seed, data slice, model variant, metric set) must be represented as **configurable parameters**, not code changes. This allows:
+
+- Automatic sweep generation
+- Logging and reproducibility
+- Interpolation across experimental conditions
+
+Prefer structured configs (e.g., YAML, Pydantic) over hardcoded logic.
+
+---
+
+### Rule 5: Benchmark = Cartesian product of controlled variations
+A benchmark run is defined as the evaluation over a **Cartesian product** of selected variations from each axis:
+
+- `(model_i, data_j, eval_k) → score_ijk`
+
+This allows benchmarking to expose not just best performers but meaningful **interactions** between choices.
+
+---
+
+### Rule 6: Support hierarchical slicing and stratification
+To assess robustness and generalization, benchmarks must allow evaluation over:
+
+- Subpopulations (e.g., by label, length, domain, metadata)
+- Perturbations (e.g., noisy, adversarial, low-resource)
+- Shifts (e.g., domain, temporal, source-target pairs)
+
+Evaluators must optionally support slice-aware evaluation: `evaluate(preds, targets, slices=...)`.
+
+---
+
+### Rule 7: Log provenance at every level
+Every benchmark result must be traceable back to:
+
+- The exact model config
+- The data function and its parameters
+- The evaluation strategy and metric set
+- The random seed and runtime environment
+
+All runs should emit structured metadata with full parameter context (e.g., JSON, database, or hashable IDs).
+
+---
+
+### Rule 8: No single score without decomposition
+Avoid single-number metrics without exposing score components:
+
+- Report per-slice, per-class, or per-perturbation breakdowns
+- Visualize variance, not just central tendencies
+- Make performance *explainable*, not just measurable
+
+Benchmarks should yield diagnostic insight, not just rankings.
+
+---
+
+These rules make benchmarking **composable**, **transparent**, and **task-agnostic**, enabling research to scale across problem types while remaining grounded in meaningful comparisons.
diff --git a/src/benchmark/README.md b/src/benchmark/README.md
@@ -1,94 +1,76 @@
-## Modular Benchmarking: Design Rules and Guidelines
+# 📊 Benchmarking Pipeline for Graph-Based Biomedical Case Reports
 
-To support benchmarking across any task, data function, or model type, a modular benchmark must be built from interchangeable, well-scoped components. The following rules define how to structure such a benchmark to support flexibility, extensibility, and insight.
+This repository contains a benchmarking framework for analyzing and visualizing the performance of NLP models (e.g., `BERTScore`) on biomedical case reports represented as graphs. It includes tools for:
 
----
-
-### Rule 1: Decompose into independent axes
-Separate all components of the system under test into **three disjoint axes**:
-
-- **Data function**: A parameterized function or generator that defines input-output pairs under specific sampling, augmentation, or preprocessing schemes.
-- **Model interface**: A callable or wrapped object that takes input and returns predictions, with a clear API across model types.
-- **Evaluation kernel**: A set of metrics or transformation functions that take predictions and targets and return scores, possibly per slice.
-
-Each axis must be independently swappable without requiring changes to the others.
-
----
-
-### Rule 2: Treat data as a first-class function
-Data should not be loaded as static artifacts but built as **declarative recipes** with clearly scoped variability:
-
-- Input: Sampling strategy, transformation pipeline, split logic, perturbation level
-- Output: Deterministic or stochastic dataset objects conforming to a unified format
-- Example: `data_fn(task='translation', domain='medical', noise=0.1) -> Dataset`
-
-This ensures all experiments can vary data meaningfully and reproducibly.
+- Computing similarity scores (e.g., `BERTScore`, `ROUGE`, `BLEU`)
+- Visualizing results (`F1` plots, `t-SNE`, topology distributions)
+- Summarizing metrics in tables
+- Preparing data for downstream evaluations
 
 ---
 
-### Rule 3: Enforce interface constraints
-Every component must conform to a simple, well-documented API:
-
-- **Model**: `predict(batch) -> outputs`
-- **Evaluator**: `evaluate(preds, targets, **kwargs) -> score_dict`
-- **Data function**: `data_config -> dataset object or iterator`
+## 🔧 Setup Instructions
 
-Adapters should be used to wrap legacy models or datasets into these interfaces.
+To set up the environment and run the benchmark scripts, use the provided shell script:
 
----
-
-### Rule 4: Capture all variation as configuration
-All variable aspects (e.g., seed, data slice, model variant, metric set) must be represented as **configurable parameters**, not code changes. This allows:
-
-- Automatic sweep generation
-- Logging and reproducibility
-- Interpolation across experimental conditions
+```bash
+bash setup_and_run.sh
+```
 
-Prefer structured configs (e.g., YAML, Pydantic) over hardcoded logic.
+This script:
+- Creates a Python 3.11 virtual environment
+- Installs dependencies from `requirements.txt`
+- Runs the benchmark modules as well as the visualization module to generate summary plots
 
 ---
 
-### Rule 5: Benchmark = Cartesian product of controlled variations
-A benchmark run is defined as the evaluation over a **Cartesian product** of selected variations from each axis:
-
-- `(model_i, data_j, eval_k) → score_ijk`
-
-This allows benchmarking to expose not just best performers but meaningful **interactions** between choices.
+## 🗂 Directory Structure
+
+```
+src/
+└── benchmark/
+    │
+    ├── batch_run.py                 # Main script to execute benchmarking
+    ├── generate_visuals.py          # Generates plots and summary tables
+    ├── requirements.txt             # Python dependencies
+    ├── setup_and_run.sh             # Script to setup the environment and run visuals
+    ├── output/
+    │   ├── results/                 # Directory where results are saved
+    │   └── plots/                   # Directory where plots are saved
+    └── modules/
+        ├── __init__.py              # Marks the directory as a Python package
+        ├── config.py                # Contains configuration variables and constants for the benchmarking pipeline
+        ├── embedding.py             # Functions for computing and managing graph or text embeddings
+        ├── evalution.py             # Evaluation metrics (e.g., BERTScore, ROUGE, BLEU) and utility functions
+        ├── io_utils.py              # File I/O helper functions (e.g., loading graphs, saving outputs)
+        ├── logging_utils.py         # Configures logging for tracking experiment progress and debugging
+        ├── reconstruciton.py        # Graph or text reconstruction methods from processed data
+        ├── regex_utils.py           # Utility functions for pattern matching and extraction using regular expressions
+        ├── run_benchmark.py         # Orchestrates the full benchmarking pipeline
+        └── visualization.py         # Plotting functions for visual summaries and metric reporting
+```
 
 ---
 
-### Rule 6: Support hierarchical slicing and stratification
-To assess robustness and generalization, benchmarks must allow evaluation over:
+## 📉 Outputs
 
-- Subpopulations (e.g., by label, length, domain, metadata)
-- Perturbations (e.g., noisy, adversarial, low-resource)
-- Shifts (e.g., domain, temporal, source-target pairs)
-
-Evaluators must optionally support slice-aware evaluation: `evaluate(preds, targets, slices=...)`.
+- `output/plots/bertscore_f1_barplot.png` – Bar chart of BERTScore F1 scores
+- `output/plots/trajectory_tsne.png` – 2D t-SNE embedding of graph trajectories
+- `output/plots/topology_distributions.png` – Histograms of node and edge counts
+- `bertscore_stats.csv` – Summary statistics of BERTScore F1 (optional export)
 
 ---
 
-### Rule 7: Log provenance at every level
-Every benchmark result must be traceable back to:
+## 📊 Example Metrics
 
-- The exact model config
-- The data function and its parameters
-- The evaluation strategy and metric set
-- The random seed and runtime environment
+If you're using string similarity tools, results may include:
 
-All runs should emit structured metadata with full parameter context (e.g., JSON, database, or hashable IDs).
+- `ROUGE-1`, `ROUGE-2`, `ROUGE-L`
+- `BLEU` score
+- Mean, median, std deviation, and percentiles of F1 scores
 
 ---
 
-### Rule 8: No single score without decomposition
-Avoid single-number metrics without exposing score components:
-
-- Report per-slice, per-class, or per-perturbation breakdowns
-- Visualize variance, not just central tendencies
-- Make performance *explainable*, not just measurable
-
-Benchmarks should yield diagnostic insight, not just rankings.
-
----
+## 📬 Contact
 
-These rules make benchmarking **composable**, **transparent**, and **task-agnostic**, enabling research to scale across problem types while remaining grounded in meaningful comparisons.
+For questions or contributions, please contact the maintainers of the [Daneshjou Lab](https://daneshjoulab.stanford.edu).
diff --git a/src/benchmark/batch_run.py b/src/benchmark/batch_run.py
@@ -0,0 +1,81 @@
+# This source file is part of the Daneshjou Lab projects
+#
+# SPDX-FileCopyrightText: 2025 Stanford University and the project authors (see AUTHORS.md)
+#
+# SPDX-License-Identifier: MIT
+#
+
+"""
+Run the full benchmarking pipeline on a batch of graph files in JSON format.
+Each graph is benchmarked for information fidelity, topology structure,
+and trajectory representation.
+"""
+
+from pathlib import Path
+
+# Local application imports
+from modules.run_benchmark import run_pipeline
+from modules.logging_utils import setup_logger
+from modules.io_utils import (
+    load_graph_from_file,
+    save_results,
+    build_graph_to_text_mapping,
+    extract_case_presentation_from_file,
+)
+
+logger = setup_logger(__name__)
+
+
+# Input directory: All graph files
+GRAPH_INPUT_DIR = Path(__file__).resolve().parents[2] / "webapp/static/graphs/"
+
+# Output directory: Results
+RESULTS_OUTPUT_DIR = Path("output/results/")
+RESULTS_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+METRIC_SUMMARY_OUTPUT_DIR = Path("output/results/plots/metric_summary.csv")
+
+
+# Paths
+METADATA_CSV_PATH = Path(__file__).resolve().parents[2] / "webapp/static/graphs/mapping/graph_metadata.csv"
+HTML_ROOT_DIR = Path(__file__).resolve().parents[2] / "webapp/static/pmc_htmls"
+
+graph_to_html = build_graph_to_text_mapping(METADATA_CSV_PATH, HTML_ROOT_DIR)
+
+if __name__ == "__main__":
+    graph_files = list(GRAPH_INPUT_DIR.glob("*.json"))
+
+    if not graph_files:
+        logger.warning("No JSON files found in input directory.")
+    else:
+        for fpath in graph_files:
+            graph_id = fpath.stem
+            html_path = graph_to_html.get(graph_id)
+
+            if not html_path:
+                logger.warning("No HTML path found for %s", graph_id)
+                continue
+
+            reference_case_text = extract_case_presentation_from_file(str(html_path))
+
+            logger.info("Running pipeline on: %s",fpath.name)
+
+            graph, ok = load_graph_from_file(fpath)
+            if not ok:
+                logger.error("Skipping %s due to load error.", fpath.name)
+                continue
+
+            cfg = {
+                "reconstruct_params": {"include_nodes": True, "include_edges": True},
+                "bertscore": True,
+                "string_similarity": True,
+                "topology": True,
+                "trajectory_embedding": True,
+            }
+
+            results = run_pipeline(graph, reference_case_text, cfg)
+
+            output_path = RESULTS_OUTPUT_DIR / f"results_{graph_id}.json"
+            save_results(results, output_path)
+
+
+        logger.info("Batch benchmarking completed.")