A standalone, configurable pipeline for running the Foundation Model FloorBenchmark on different data samples.
This pipeline allows you to:
- Run benchmarks on different samples (whole sample, stratified sample, or custom samples)
- Configure models and prompts through a YAML configuration file
- Parallelize execution for faster processing
- Automatically analyze results and generate performance metrics
- Track runs and compare different configurations
FloorBenchmark_Pipeline/
├── config/
│ └── pipeline_config.yaml # Main configuration file
├── scripts/
│ └── analyze_results.r # Analysis script
├── prompts/ # Prompt JSON files (copy from main project)
├── data/
│ ├── inputs/ # Symlinks to data samples
│ └── outputs/ # Pipeline results
├── results/ # Analysis results and summaries
├── logs/ # Execution logs
├── run_pipeline.r # Main pipeline runner
├── .env # API keys (copy from main project)
└── README.md # This file
From the main project directory, copy or symlink:
# Copy prompts
cp -r ../prompts/FloorBenchmark_Prompts/* prompts/
# Copy or create symlink to .env file
cp ../.env .
# OR
ln -s ../.env .env
# Create symlinks to data directories
ln -s ../data/inputs/TalkMoves/all_transcripts data/inputs/whole_sample
ln -s ../data/inputs/stratified_samples/RepresentativeSample_Oversampled_Chunks data/inputs/stratified_sampleinstall.packages(c(
"dplyr", "readr", "stringr", "purrr", "jsonlite",
"fs", "httr", "furrr", "yaml", "irr", "tidyr"
))Edit config/pipeline_config.yaml:
# Choose your sample
sample_type: "stratified_sample" # or "whole_sample" or "custom"
# Set run name
run_name: "FloorBenchmark_MyRun"
# Select models
models:
- "anthropic.claude-4.5-opus"
- "openai.gpt-5.1"
# ... add more models
# Enable test mode for quick testing
test_mode: false
test_subset_size: 10Run the pipeline with default configuration:
Rscript run_pipeline.rRscript run_pipeline.r config/my_custom_config.yamlFor quick testing with a small subset:
- Edit
config/pipeline_config.yaml:
test_mode: true
test_subset_size: 10- Run:
Rscript run_pipeline.rTo analyze existing results without running the pipeline:
Rscript scripts/analyze_results.rTo analyze existing results without running the pipeline:
Rscript scripts/analyze_results.rsample_type: "stratified_sample" # Options: whole_sample, stratified_sample, custom
sample_paths:
whole_sample: "../data/inputs/TalkMoves/all_transcripts"
stratified_sample: "../data/inputs/stratified_samples/RepresentativeSample_Oversampled_Chunks"
# For custom samples
custom_sample_path: "/path/to/your/custom/sample"models:
- "anthropic.claude-4.5-opus"
- "anthropic.claude-4.5-sonnet"
- "openai.gpt-5.1"
- "openai.o3"
- "google.gemini-2.5-pro"
- "google.gemini-3-pro-preview"Use all prompts:
# Set prompt directory (e.g., for No Reasoning experiments)
prompt_directory: "prompts/TalkMoves_NoReasoningRequested"
specific_prompts:
- "TalkMoves_Teacher_ZeroShot.json"
- "TalkMoves_Teacher_OneShot.json"
- "TalkMoves_Teacher_FewShot_3.json"
- "TalkMoves_Teacher_FewShot_ALL.json"parallel_workers: 20 # Number of parallel workers
temperature: 0.0 # LLM temperature (0.0 = deterministic)
max_tokens: 4000 # Max tokens per request
request_timeout: 300 # Timeout in secondstest_mode: true # Enable test mode
test_subset_size: 10 # Number of files to processdata/outputs/FloorBenchmark_MyRun/
├── run_metadata.json # Run information
├── analysis_results.csv # Performance metrics
├── TalkMoves_Teacher_ZeroShot/
│ ├── anthropic.claude-4.5-opus/
│ │ ├── file1.json_raw.txt
│ │ └── file2.json_raw.txt
│ └── openai.gpt-5.1/
│ └── ...
└── TalkMoves_Teacher_OneShot/
└── ...
Results are saved in the results/ directory:
{run_name}_results.csv: Detailed results for each model/prompt combination{run_name}_summary.csv: Summary statistics per model
Detailed Results (_results.csv):
Prompt,Model,N,Accuracy,Kappa
TalkMoves_Teacher_ZeroShot,anthropic.claude-4.5-opus,8350,0.686,0.513Summary (_summary.csv):
Model,Prompts_Tested,Avg_Accuracy,Avg_Kappa,Best_Kappa,Total_Utterances
anthropic.claude-4.5-opus,4,0.685,0.511,0.513,27000sample_type: "whole_sample"
run_name: "FloorBenchmark_Whole"Rscript run_pipeline.rsample_type: "stratified_sample"
run_name: "FloorBenchmark_Stratified"Rscript run_pipeline.r# Load results
whole <- read_csv("results/FloorBenchmark_Whole_results.csv")
stratified <- read_csv("results/FloorBenchmark_Stratified_results.csv")
# Compare
comparison <- full_join(
whole %>% rename(Kappa_Whole = Kappa),
stratified %>% rename(Kappa_Stratified = Kappa),
by = c("Prompt", "Model")
) %>%
mutate(Kappa_Diff = Kappa_Stratified - Kappa_Whole)Logs are saved to logs/pipeline_{timestamp}.log when verbose_logging: true.
Example log entries:
[2025-12-18 09:00:00] === FloorBenchmark Pipeline Started ===
[2025-12-18 09:00:00] Sample Type: stratified_sample
[2025-12-18 09:00:00] Found 405 input files
[2025-12-18 09:00:00] Total tasks: 8100
[2025-12-18 09:45:00] === Pipeline Complete ===
[2025-12-18 09:45:00] Elapsed time: 45.2 minutes
Always test with a small subset first:
test_mode: true
test_subset_size: 5Watch the logs in real-time:
tail -f logs/pipeline_*.logThe pipeline automatically skips completed files, so you can safely re-run after failures.
Adjust based on your system and API rate limits:
- For API rate limits: Use fewer workers (5-10)
- For fast execution: Use more workers (20-30)
Save different configurations for different experiments:
cp config/pipeline_config.yaml config/experiment_1.yaml
# Edit experiment_1.yaml
Rscript run_pipeline.r config/experiment_1.yaml- Ensure
.envfile exists with valid credentials:
AI_GATEWAY_KEY=your_key_here
AI_GATEWAY_BASE_URL=https://your-gateway.com
- Check that symlinks are created correctly
- Verify
sample_pathsin configuration - Ensure input directory contains
.jsonfiles
- Check that
ground_truth_filepath is correct - Ensure ground truth CSV has required columns:
ID,Transcript,Turn,TalkMove_truth
- Reduce
parallel_workers - Enable
test_modeto process smaller batches
Same as parent project.
For issues or questions, please refer to the main project repository.