Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ asyncer==0.0.8
attrs==25.3.0
backoff==2.2.1
beautifulsoup4==4.13.4
biopython==1.85
bert-score>=0.3.13
blis==1.3.0
cachetools==5.5.2
catalogue==2.0.10
Expand All @@ -28,7 +28,8 @@ dill==0.3.8
diskcache==5.6.3
distro==1.9.0
dotenv==0.9.9
dspy @ git+https://github.com/stanfordnlp/dspy.git@b7375196ba95215dc6237175490302fb8e5976df
dspy @ git+https://github.com/stanfordnlp/dspy.git@ab340835d1a83f62fb18f86c5b17fdfb9d172713
dspy-ai>=2.3.3
en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl#sha256=1932429db727d4bff3deed6b34cfc05df17794f4a52eeb26cf8928f7c1a0fb85
exceptiongroup==1.2.2
fastapi==0.115.12
Expand Down Expand Up @@ -64,6 +65,7 @@ mdurl==0.1.2
multidict==6.4.2
multiprocess==0.70.16
murmurhash==1.0.12
networkx>=3.1
numpy==2.0.2
openai==1.61.0
optuna==4.2.1
Expand Down Expand Up @@ -107,6 +109,7 @@ tiktoken==0.9.0
tokenizers==0.21.1
tomli==2.2.1
tomlkit==0.13.2
transformers>=4.37
tqdm==4.67.1
typer==0.15.2
typing-inspection==0.4.0
Expand Down
5 changes: 5 additions & 0 deletions requirements.txt.license
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
This source file is part of the Daneshjou Lab projects

SPDX-FileCopyrightText: 2024 Stanford University and the project authors (see AUTHORS.md)

SPDX-License-Identifier: MIT
94 changes: 94 additions & 0 deletions src/benchmark/Modular Benchmarking: Design Rules and Guidelines.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
## Modular Benchmarking: Design Rules and Guidelines

To support benchmarking across any task, data function, or model type, a modular benchmark must be built from interchangeable, well-scoped components. The following rules define how to structure such a benchmark to support flexibility, extensibility, and insight.

---

### Rule 1: Decompose into independent axes
Separate all components of the system under test into **three disjoint axes**:

- **Data function**: A parameterized function or generator that defines input-output pairs under specific sampling, augmentation, or preprocessing schemes.
- **Model interface**: A callable or wrapped object that takes input and returns predictions, with a clear API across model types.
- **Evaluation kernel**: A set of metrics or transformation functions that take predictions and targets and return scores, possibly per slice.

Each axis must be independently swappable without requiring changes to the others.

---

### Rule 2: Treat data as a first-class function
Data should not be loaded as static artifacts but built as **declarative recipes** with clearly scoped variability:

- Input: Sampling strategy, transformation pipeline, split logic, perturbation level
- Output: Deterministic or stochastic dataset objects conforming to a unified format
- Example: `data_fn(task='translation', domain='medical', noise=0.1) -> Dataset`

This ensures all experiments can vary data meaningfully and reproducibly.

---

### Rule 3: Enforce interface constraints
Every component must conform to a simple, well-documented API:

- **Model**: `predict(batch) -> outputs`
- **Evaluator**: `evaluate(preds, targets, **kwargs) -> score_dict`
- **Data function**: `data_config -> dataset object or iterator`

Adapters should be used to wrap legacy models or datasets into these interfaces.

---

### Rule 4: Capture all variation as configuration
All variable aspects (e.g., seed, data slice, model variant, metric set) must be represented as **configurable parameters**, not code changes. This allows:

- Automatic sweep generation
- Logging and reproducibility
- Interpolation across experimental conditions

Prefer structured configs (e.g., YAML, Pydantic) over hardcoded logic.

---

### Rule 5: Benchmark = Cartesian product of controlled variations
A benchmark run is defined as the evaluation over a **Cartesian product** of selected variations from each axis:

- `(model_i, data_j, eval_k) → score_ijk`

This allows benchmarking to expose not just best performers but meaningful **interactions** between choices.

---

### Rule 6: Support hierarchical slicing and stratification
To assess robustness and generalization, benchmarks must allow evaluation over:

- Subpopulations (e.g., by label, length, domain, metadata)
- Perturbations (e.g., noisy, adversarial, low-resource)
- Shifts (e.g., domain, temporal, source-target pairs)

Evaluators must optionally support slice-aware evaluation: `evaluate(preds, targets, slices=...)`.

---

### Rule 7: Log provenance at every level
Every benchmark result must be traceable back to:

- The exact model config
- The data function and its parameters
- The evaluation strategy and metric set
- The random seed and runtime environment

All runs should emit structured metadata with full parameter context (e.g., JSON, database, or hashable IDs).

---

### Rule 8: No single score without decomposition
Avoid single-number metrics without exposing score components:

- Report per-slice, per-class, or per-perturbation breakdowns
- Visualize variance, not just central tendencies
- Make performance *explainable*, not just measurable

Benchmarks should yield diagnostic insight, not just rankings.

---

These rules make benchmarking **composable**, **transparent**, and **task-agnostic**, enabling research to scale across problem types while remaining grounded in meaningful comparisons.
122 changes: 52 additions & 70 deletions src/benchmark/README.md
Original file line number Diff line number Diff line change
@@ -1,94 +1,76 @@
## Modular Benchmarking: Design Rules and Guidelines
# 📊 Benchmarking Pipeline for Graph-Based Biomedical Case Reports

To support benchmarking across any task, data function, or model type, a modular benchmark must be built from interchangeable, well-scoped components. The following rules define how to structure such a benchmark to support flexibility, extensibility, and insight.
This repository contains a benchmarking framework for analyzing and visualizing the performance of NLP models (e.g., `BERTScore`) on biomedical case reports represented as graphs. It includes tools for:

---

### Rule 1: Decompose into independent axes
Separate all components of the system under test into **three disjoint axes**:

- **Data function**: A parameterized function or generator that defines input-output pairs under specific sampling, augmentation, or preprocessing schemes.
- **Model interface**: A callable or wrapped object that takes input and returns predictions, with a clear API across model types.
- **Evaluation kernel**: A set of metrics or transformation functions that take predictions and targets and return scores, possibly per slice.

Each axis must be independently swappable without requiring changes to the others.

---

### Rule 2: Treat data as a first-class function
Data should not be loaded as static artifacts but built as **declarative recipes** with clearly scoped variability:

- Input: Sampling strategy, transformation pipeline, split logic, perturbation level
- Output: Deterministic or stochastic dataset objects conforming to a unified format
- Example: `data_fn(task='translation', domain='medical', noise=0.1) -> Dataset`

This ensures all experiments can vary data meaningfully and reproducibly.
- Computing similarity scores (e.g., `BERTScore`, `ROUGE`, `BLEU`)
- Visualizing results (`F1` plots, `t-SNE`, topology distributions)
- Summarizing metrics in tables
- Preparing data for downstream evaluations

---

### Rule 3: Enforce interface constraints
Every component must conform to a simple, well-documented API:

- **Model**: `predict(batch) -> outputs`
- **Evaluator**: `evaluate(preds, targets, **kwargs) -> score_dict`
- **Data function**: `data_config -> dataset object or iterator`
## 🔧 Setup Instructions

Adapters should be used to wrap legacy models or datasets into these interfaces.
To set up the environment and run the benchmark scripts, use the provided shell script:

---

### Rule 4: Capture all variation as configuration
All variable aspects (e.g., seed, data slice, model variant, metric set) must be represented as **configurable parameters**, not code changes. This allows:

- Automatic sweep generation
- Logging and reproducibility
- Interpolation across experimental conditions
```bash
bash setup_and_run.sh
```

Prefer structured configs (e.g., YAML, Pydantic) over hardcoded logic.
This script:
- Creates a Python 3.11 virtual environment
- Installs dependencies from `requirements.txt`
- Runs the benchmark modules as well as the visualization module to generate summary plots

---

### Rule 5: Benchmark = Cartesian product of controlled variations
A benchmark run is defined as the evaluation over a **Cartesian product** of selected variations from each axis:

- `(model_i, data_j, eval_k) → score_ijk`

This allows benchmarking to expose not just best performers but meaningful **interactions** between choices.
## 🗂 Directory Structure

```
src/
└── benchmark/
├── batch_run.py # Main script to execute benchmarking
├── generate_visuals.py # Generates plots and summary tables
├── requirements.txt # Python dependencies
├── setup_and_run.sh # Script to setup the environment and run visuals
├── output/
│ ├── results/ # Directory where results are saved
│ └── plots/ # Directory where plots are saved
└── modules/
├── __init__.py # Marks the directory as a Python package
├── config.py # Contains configuration variables and constants for the benchmarking pipeline
├── embedding.py # Functions for computing and managing graph or text embeddings
├── evalution.py # Evaluation metrics (e.g., BERTScore, ROUGE, BLEU) and utility functions
├── io_utils.py # File I/O helper functions (e.g., loading graphs, saving outputs)
├── logging_utils.py # Configures logging for tracking experiment progress and debugging
├── reconstruciton.py # Graph or text reconstruction methods from processed data
├── regex_utils.py # Utility functions for pattern matching and extraction using regular expressions
├── run_benchmark.py # Orchestrates the full benchmarking pipeline
└── visualization.py # Plotting functions for visual summaries and metric reporting
```

---

### Rule 6: Support hierarchical slicing and stratification
To assess robustness and generalization, benchmarks must allow evaluation over:
## 📉 Outputs

- Subpopulations (e.g., by label, length, domain, metadata)
- Perturbations (e.g., noisy, adversarial, low-resource)
- Shifts (e.g., domain, temporal, source-target pairs)

Evaluators must optionally support slice-aware evaluation: `evaluate(preds, targets, slices=...)`.
- `output/plots/bertscore_f1_barplot.png` – Bar chart of BERTScore F1 scores
- `output/plots/trajectory_tsne.png` – 2D t-SNE embedding of graph trajectories
- `output/plots/topology_distributions.png` – Histograms of node and edge counts
- `bertscore_stats.csv` – Summary statistics of BERTScore F1 (optional export)

---

### Rule 7: Log provenance at every level
Every benchmark result must be traceable back to:
## 📊 Example Metrics

- The exact model config
- The data function and its parameters
- The evaluation strategy and metric set
- The random seed and runtime environment
If you're using string similarity tools, results may include:

All runs should emit structured metadata with full parameter context (e.g., JSON, database, or hashable IDs).
- `ROUGE-1`, `ROUGE-2`, `ROUGE-L`
- `BLEU` score
- Mean, median, std deviation, and percentiles of F1 scores

---

### Rule 8: No single score without decomposition
Avoid single-number metrics without exposing score components:

- Report per-slice, per-class, or per-perturbation breakdowns
- Visualize variance, not just central tendencies
- Make performance *explainable*, not just measurable

Benchmarks should yield diagnostic insight, not just rankings.

---
## 📬 Contact

These rules make benchmarking **composable**, **transparent**, and **task-agnostic**, enabling research to scale across problem types while remaining grounded in meaningful comparisons.
For questions or contributions, please contact the maintainers of the [Daneshjou Lab](https://daneshjoulab.stanford.edu).

Check failure on line 76 in src/benchmark/README.md

View workflow job for this annotation

GitHub Actions / Linkspector

[linkspector] src/benchmark/README.md#L76

Cannot reach https://daneshjoulab.stanford.edu Status: null net::ERR_NAME_NOT_RESOLVED at https://daneshjoulab.stanford.edu
Raw output
message:"Cannot reach https://daneshjoulab.stanford.edu Status: null net::ERR_NAME_NOT_RESOLVED at https://daneshjoulab.stanford.edu" location:{path:"src/benchmark/README.md" range:{start:{line:76 column:71} end:{line:76 column:121}}} severity:ERROR source:{name:"linkspector" url:"https://github.com/UmbrellaDocs/linkspector"}
81 changes: 81 additions & 0 deletions src/benchmark/batch_run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# This source file is part of the Daneshjou Lab projects
#
# SPDX-FileCopyrightText: 2025 Stanford University and the project authors (see AUTHORS.md)
#
# SPDX-License-Identifier: MIT
#

"""
Run the full benchmarking pipeline on a batch of graph files in JSON format.
Each graph is benchmarked for information fidelity, topology structure,
and trajectory representation.
"""

from pathlib import Path

# Local application imports
from modules.run_benchmark import run_pipeline
from modules.logging_utils import setup_logger
from modules.io_utils import (
load_graph_from_file,
save_results,
build_graph_to_text_mapping,
extract_case_presentation_from_file,
)

logger = setup_logger(__name__)


# Input directory: All graph files
GRAPH_INPUT_DIR = Path(__file__).resolve().parents[2] / "webapp/static/graphs/"

# Output directory: Results
RESULTS_OUTPUT_DIR = Path("output/results/")
RESULTS_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
METRIC_SUMMARY_OUTPUT_DIR = Path("output/results/plots/metric_summary.csv")


# Paths
METADATA_CSV_PATH = Path(__file__).resolve().parents[2] / "webapp/static/graphs/mapping/graph_metadata.csv"
HTML_ROOT_DIR = Path(__file__).resolve().parents[2] / "webapp/static/pmc_htmls"

graph_to_html = build_graph_to_text_mapping(METADATA_CSV_PATH, HTML_ROOT_DIR)

if __name__ == "__main__":
graph_files = list(GRAPH_INPUT_DIR.glob("*.json"))

if not graph_files:
logger.warning("No JSON files found in input directory.")
else:
for fpath in graph_files:
graph_id = fpath.stem
html_path = graph_to_html.get(graph_id)

if not html_path:
logger.warning("No HTML path found for %s", graph_id)
continue

reference_case_text = extract_case_presentation_from_file(str(html_path))

logger.info("Running pipeline on: %s",fpath.name)

graph, ok = load_graph_from_file(fpath)
if not ok:
logger.error("Skipping %s due to load error.", fpath.name)
continue

cfg = {
"reconstruct_params": {"include_nodes": True, "include_edges": True},
"bertscore": True,
"string_similarity": True,
"topology": True,
"trajectory_embedding": True,
}

results = run_pipeline(graph, reference_case_text, cfg)

output_path = RESULTS_OUTPUT_DIR / f"results_{graph_id}.json"
save_results(results, output_path)


logger.info("Batch benchmarking completed.")
Loading
Loading