Skip to content

feat: add statistics + illustrative traces#347

Merged
MrtinoRG merged 3 commits intodevfrom
last_todoss
Apr 21, 2026
Merged

feat: add statistics + illustrative traces#347
MrtinoRG merged 3 commits intodevfrom
last_todoss

Conversation

@MrtinoRG
Copy link
Copy Markdown
Collaborator

@MrtinoRG MrtinoRG commented Apr 19, 2026

Summary by Sourcery

Add analysis assets for the paper, including illustrative epistemic reasoning traces and a script to compute corpus-level statistics from experiment reports.

New Features:

  • Introduce a LaTeX appendix section showcasing representative reasoning breakdown traces with annotated epistemic graphs for each major failure category.
  • Add a Python analysis script that aggregates token usage, configuration–environment coverage, estimated API costs, and malformed-response rates from reports.jsonl.

Enhancements:

  • Provide LaTeX-ready summary metrics and structured logging output to streamline transferring analysis statistics into the paper.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Apr 19, 2026

Reviewer's Guide

Adds an appendix-style LaTeX file with illustrative reasoning traces for four breakdown categories, and introduces a Python script that computes paper statistics (token usage, config–environment counts, estimated API cost, and scaffold error rates) from the experiment reports JSONL file.

Class diagram for compute_paper_stats module structure

classDiagram
  class compute_paper_stats_module {
    +dict PRICING
    +Path DATA_PATH
    +set config_env_pairs
    +defaultdict tokens
    +defaultdict scaffold_errors
    +defaultdict trials_affected_count
    +int total_trials
    +enc
    +count_tokens(text str) int
    +fmt_tokens(n int) str
    +agg_model(model str) (int input_tokens, int output_tokens)
    +agg_all() (int input_tokens, int output_tokens)
    +agg_verbosity(verb str) (int input_tokens, int output_tokens)
  }

  class tiktoken_encoder {
    +encode(text str) list~int~
  }

  class loguru_logger {
    +info(message str)
  }

  class reports_source {
    +path str
    +open()
  }

  compute_paper_stats_module --> tiktoken_encoder : uses
  compute_paper_stats_module --> loguru_logger : logs_via
  compute_paper_stats_module --> reports_source : reads_from
Loading

Flow diagram for compute_paper_stats statistics pipeline

flowchart TD
  A_Start([Start compute_paper_stats.py]) --> B_ReadFile
  B_ReadFile["Open results/data/reports.jsonl"] --> C_LoopLines

  subgraph S_LineProcessing[Per JSONL record]
    C_LoopLines --> D_Parse["Parse JSON line to rec"]
    D_Parse --> E_ExtractMeta["Extract model, agent_type, environment, verbosity"]
    E_ExtractMeta --> F_AddConfigEnv["Add (model, agent_type, verbosity, env) to config_env_pairs"]
    F_AddConfigEnv --> G_LoopTasks["Loop over Task Results"]

    subgraph S_TaskTrials[Per task and trial]
      G_LoopTasks --> H_LoopTrials["Loop over trials"]
      H_LoopTrials --> I_IncTotalTrials["Increment total_trials"]
      I_IncTotalTrials --> J_GetMessages["Get trial messages"]
      J_GetMessages --> K_TokenizeMsgs["Compute msg_tok via count_tokens(content)"]

      K_TokenizeMsgs --> L_AssistantTurns["For each assistant message"]
      L_AssistantTurns --> M_SumInput["Input tokens += sum(prior msg_tok)"]
      M_SumInput --> N_AddOutput["Output tokens += this assistant msg_tok"]

      K_TokenizeMsgs --> O_ScaffoldInit["Initialize trial_has_error=False"]
      O_ScaffoldInit --> P_ScanScaffold["Scan messages for scaffold errors"]
      P_ScanScaffold --> Q_UpdateScaffold["Update scaffold_errors[(model, agent_type)]"]
      Q_UpdateScaffold --> R_UpdateTrialsAffected{agent_type == react?}
      R_UpdateTrialsAffected -- Yes --> S_UpdateReact["Update trials_affected_count[model]"]
      R_UpdateTrialsAffected -- No --> T_NextTrial["Next trial"]
      S_UpdateReact --> T_NextTrial
    end
  end

  T_NextTrial --> U_NextLine{More lines?}
  U_NextLine -- Yes --> C_LoopLines
  U_NextLine -- No --> V_Aggregation

  subgraph S_Aggregation[Aggregation helpers]
    V_Aggregation["Compute aggregates via agg_model, agg_all, agg_verbosity"]
  end

  V_Aggregation --> W_LoggingSections

  subgraph S_Logging[Reporting via loguru]
    W_LoggingSections["Log configuration counts, token tables, costs, error rates, LaTeX-ready values"]
  end

  W_LoggingSections --> X_End([End])
Loading

File-Level Changes

Change Details Files
Introduce LaTeX appendix with illustrative epistemic traces for four reasoning-breakdown categories, including custom node badges, color scheme, and annotated examples with diagrams and annotator quotes.
  • Define color palette and a reusable \nodebadge macro for epistemic node types (H, T, E, J, C, F).
  • Add a new subsection describing how to read the trace visualizations and linking to the online browser.
  • For each breakdown category (evidence non-uptake, untested claim, fixed belief trace, contradiction without repair), add a tcolorbox with model/context metadata, excerpted messages, node/edge annotations, a small TikZ diagram, and an annotator quote explaining the pattern.
analysis/illustrative_traces.tex
Add an analysis script to compute aggregate experiment statistics (token counts, configuration–environment pairs, estimated proprietary API cost, and ReAct scaffold error rates) from reports.jsonl and log them in human- and LaTeX-ready formats.
  • Read analysis/results/data/reports.jsonl and iterate over all records, aggregating by model, agent type, environment, and verbosity.
  • Compute per-assistant-turn token input/output using tiktoken with the o200k_base encoding, mirroring API usage by counting all prior messages as input and the current assistant message as output.
  • Aggregate token statistics by model and verbosity, including grand totals, and pretty-print values using a compact formatter (K/M/B).
  • Estimate API cost for proprietary models using a per-million-token pricing table, including a breakdown by verbosity.
  • Detect ReAct scaffold errors by scanning for 'No actions to execute' in user messages, computing per-model error rates and percentage of affected trials.
  • Log a summary section with LaTeX-ready scalar values for tokens, number of config–environment pairs, total API cost, and malformed-response rates for selected models.
analysis/compute_paper_stats.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 19, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5b52b2fc-41ef-4670-befc-f331b21ccbb4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch last_todoss

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The per-assistant-turn input token calculation currently does sum(msg_tok[:i]) inside the loop, which is O(n²) over messages; consider precomputing a prefix sum array so each input count is O(1) and the loop stays linear in the number of messages per trial.
  • All the JSONL parsing and aggregation runs at import time; wrapping this logic in a main() function and guarding with if __name__ == "__main__": would make the module safer to import and easier to reuse programmatically.
  • The top-level docstring still says "Compute the TODO statistics"—updating this to accurately describe the current outputs (tokens, config–env pairs, cost, malformed-response rates) will make the script’s purpose clearer to future readers.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The per-assistant-turn input token calculation currently does `sum(msg_tok[:i])` inside the loop, which is O(n²) over messages; consider precomputing a prefix sum array so each input count is O(1) and the loop stays linear in the number of messages per trial.
- All the JSONL parsing and aggregation runs at import time; wrapping this logic in a `main()` function and guarding with `if __name__ == "__main__":` would make the module safer to import and easier to reuse programmatically.
- The top-level docstring still says "Compute the TODO statistics"—updating this to accurately describe the current outputs (tokens, config–env pairs, cost, malformed-response rates) will make the script’s purpose clearer to future readers.

## Individual Comments

### Comment 1
<location path="analysis/illustrative_traces.tex" line_range="3" />
<code_context>
+% Illustrative trace excerpts for the four major reasoning-breakdown categories.
+% Requires: tcolorbox, tikz, xcolor, enumitem, listings
+% Usage: \input{analysis/results/illustrative_traces.tex}
+
+\definecolor{colENU}{RGB}{198,219,239}   % Evidence non-uptake — light blue
</code_context>
<issue_to_address>
**issue (bug_risk):** The documented \input path does not match the actual file location in the repo.

The usage line references `\input{analysis/results/illustrative_traces.tex}`, but the file is at `analysis/illustrative_traces.tex`. This mismatch will cause the include to fail if copied. Please either correct the documented path or move the file to match it.
</issue_to_address>

Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@@ -0,0 +1,371 @@
% Illustrative trace excerpts for the four major reasoning-breakdown categories.
% Requires: tcolorbox, tikz, xcolor, enumitem, listings
% Usage: \input{analysis/results/illustrative_traces.tex}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): The documented \input path does not match the actual file location in the repo.

The usage line references \input{analysis/results/illustrative_traces.tex}, but the file is at analysis/illustrative_traces.tex. This mismatch will cause the include to fail if copied. Please either correct the documented path or move the file to match it.

@MrtinoRG MrtinoRG merged commit 6ddd5a1 into dev Apr 21, 2026
8 of 11 checks passed
@MrtinoRG MrtinoRG deleted the last_todoss branch April 21, 2026 13:11
MrtinoRG added a commit that referenced this pull request Apr 22, 2026
* build(deps): bump actions/cache from 3 to 5 (#259)

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* build(deps): bump actions/checkout from 5 to 6 (#239)

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: refactor metrics and report (#236)

* feat: refactor metrics and report

* fix: remove dummy files

* fix: standarize files

* chore: correct docs

* feat: simplify task_metric calculation

* feat: expose the metrics in the runner

* feat: incorporate report saving

* feat: add messages to the report

* fix: debug

* fix: bug when content in reasoning + handle special characters in model name

* fix: solve issues in the report generation

* fix: remove commented code

* fix: apply suggestions from code review

* feat: add hardcoded context window for GPT-oss-120b

* fix: apply suggestions from docs code review

* fix: apply suggestions from code review

* feat: attach logprobs in message, add to hook context (#263)

* feat: attach logprobs in message, add to hook context

* fix: minor bug in logger

* chore: example script for run with logprob

* fix: update test mock oject to have logprob

* feat: add after iteration hook in toolcalling as well

* fix: update test

* feat: llm response object and messages have id (#266)

* feat: llm response object and messages have id

* fix: lint and update tests

* fix: update tests

* fix: sourcery suggestion

* fix: update default in example script

* added docstrings for scoring functions (#272)

* feat: load from trace and fix toolcalling intervention (#271)

* feat: load from trace and fix toolcalling intervention

* chore: workdir autoset

* feat: make the visualization tool general (#261)

* feat: spectra runs (02022026) (#200)

* feat: organize the spectra environments as others

* fix: correct isomers tool

* feat: add gpt-4o tool calling brief

* fix: correct TOML file

* feat: add tests to the spectra elucidation environment (#212)

* feat: add tests to the spectra elucidation environment

* fix: solve issues raised in code review

* feat: add level 1 claude tasks

* chore: remove old files

* feat: add Budged exhaust limit

* feat: make the error handling more robust

* feat: add runs and reports

* fix: solve CI

* fix: add corral as dependency

* feat: add subtask scores

* fix: remove old files

* fix: remove dummy files

* feat: add runs gptOss level 1

* feat: add agent logs substask gpt_oss

* feat: add more level 1 reports

* feat: add level 1 subtasks

* feat: add logprobs brief

* feat: add level1 logprobs

* feat: add level 1 metrics

* feat: add level 2 tasks

* feat: add level 2 tasks logprobs

* feat: add level 2 metrics

* feat: add level 2 subtasks logprobs

* feat: add level 2 comprehensive logprobs

* feat: add subtask 1 logprobs

* feat: add subtask 1 logprobs

* feat: add last logprobs

* feat: update scoring functions

* chore: remove uploaded logprobs

* fix: first scoring reruns

* fix: correct Claude runs

* fix: solve scoring for GPT-4o

* fix: rerun gpt_oss

* fix: solve tests

* feat: add first files of the retro env (#251)

* feat: add first files of the retro env

* fix: solve hooks

* fix: solve hooks score function

* feat: add cas management tools

* feat: add tools and target molecules

* feat: add functional group detector

* feat: add tool descriptions

* feat: add the level 1 tasks

* feat: add database setup files

* feat: add the last visuals

* fix: correct the template ids for the known reactions

* feat: add dataset filtering functions

* feat: add SMARTS checking to the search tool

* feat: add level 3

* feat: level 3

* feat: add final tasks

* fix: solve some issues + add script for custom reactions

* fix: remove old runs

* fix: solve task 1

* feat: add tool decorator

* fix: remove conflicting tasks

* feat: update visual and helpers

* feat: add some runs

* chore: update keywords of the docstring

* feat: add new runs

* feat: add subtasks

* fix: remove old files

* fix: correct subtask logic

* fix: correct logic in level 2 subtasks

* feat: update level_3

* feat: add level 1 task runs

* feat: add level 1 results for claude

* feat: add the origin of the SMARTS patterns

* feat: add level 2-task results

* feat: add level 3 task brief runs

* fix: correct scoring in the level 2

* feat: add reports level 2 subtask (begin)

* chore: move some files for test of the annotation app

* feat: add more reports

* feat: add level 1 subtasks gpt reports

* feat: add subtask level 2 gpt

* feat: add level 2 subtask claude

* chore: rerun subtask level2

* feat: rerun level 1 subtasks

* feat: rerun level 1 subtasks

* feat: add level 3 reports

* feat: add tests to the retro env + add database checks (#210)

* feat: add test to the retro env + check database availability at the beginning

* fix: move credentials into database config

* fix: apply suggestions from code review

* fix: remove dummy files

* feat: add some plots with the results

* feat: remove old reports

* feat: add the corrected retro environment

* feat: add  claude reports

* feat: add gpt-4o reports

* fix: remove files from other commit

* feat: add claude last results

* feat: add gpt-oss level 2 subtasks

* fix: solve organization of level 1 subtasks for GPT Oss

* feat: add level 2 tasks gpt-oss

* feat: add gpt-oss subtask level 2

* feat: some level 3 runs

* feat: new score

* chore: remove dummy file

* feat: add level 3 runs

* feat: rerun scoring of subtasks

* chore: remove uploaded logprobs

* fix: add last changes

* fix: solve tests

* feat: add pytest to the dependencies

* feat: add chemprice to the dependencies

* fix: solve the requests versioning

* fix: add swifter as a build dependency

* fix: add pandas as a build dependency

* fix: add others as a build dependency

* fix: add pyarrow as a build dependency

* fix: add psutil as a build dependency

* feat: add pandas as override dependency

* fix: add dummy environment variables to fix tests

* fix: solve tests that depend on the database

* fix: solve scoring for the spectra and retro subtasks (#285)

* feat: add gpt_oss ml runs  (#269)

* feat: add gpt_oss ml runs

* chore: remove logprobs and metrics

* chore: reorganize files

---------

Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com>

* feat: gpt_oss reports for resistor (#268)

* feat: gpt_oss reports for resistor

* feat: rerun rate limits

* chore: remove old files

* chore: reorganize files

---------

Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com>

* feat: add gpt_oss catalyst runs (#270)

* feat: add gpt_oss ml runs

* chore: upate deps

* chore: remove and rename files

---------

Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com>

* feat: reasoning qa , resistor, ml, catalyst (#280)

* chore: example template

* chore: utility script to push to hf

* chore: update

* chore: reasoning qa resisitor

* feat: update reasoning qa

* feat: update reasoning qa

* feat: ml reasonign questions

* feat: add reasoning qa for ml

* feat: update keywords

* feat: catalyst reasoning qa

* feat: reasoning qa

* feat: update catalyst question

* wetlab reasoning qa

* chore: add requires_knowledge keywords

* chore: address review points

* fix keywords

* chore: code quality

* feat: apply suggestions from review

* chore: update review comments

* chore: update review comments

* chore: update keyword order

---------

Co-authored-by: Sadra <139479461+aaaghajani@users.noreply.github.com>
Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com>

* afm and md qa - final version, commends already addressed (#287)

* afm and md qa - final version, commends already addressed

* fix: afm keyword requires_knowledge

* fix: afm keyword issues

* fix: add env as first option

---------

Co-authored-by: “imandal98” <indrajeetmandal.aaa@gmail.com>
Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com>

* feat: add Spectra Reasoning QA (#279)

* feat: add Spectra Reasoning QA

* feat: add requires_reasoning keyword

* fix: apply suggestions from code review

* chore: shuffle target scores

* fix: make questions more clear + move knowledge questions

* feat: add retro reasoning qa (#276)

* feat: add questions

* feat: format tasks

* chore: untrack dummy files

* fix: add new Reasoning QA for the retro

* fix: correct some of the questions + move knowledge ones

* feat: add missing QA scores (including reasoning) (#267)

* feat: add QA scores for GPT-OSS

* feat: add all QA runs + reorganize repo

* chore: remove logprobs files

* fix: remove ambiguity with the old and new questions

* chore: remove uploaded reports

* AFM updates (#288)

* fix: subtasks

* fix: tasks

* fix: env files

* chore: solve code quality CI

* chore: remove corrupted lfs files

---------

Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com>
Co-authored-by: Martiño Rios-Garcia <147555961+MrtinoRG@users.noreply.github.com>

* Md corr vis dev (#289)

* MD environment changes

* deleted additional files

* feat: expose history (#298)

* Fix issue #264: Pythoncom missing in the dependencies of AFM (#265)

* Fix issue #264: Pythoncom missing in the dependencies of AFM

* Fix: CoUninitialize

* Fix: CoUninitialize in tools

* Fix: Wrap COM initialization in try/finally to ensure proper cleanup

* fix: solve ci

---------

Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com>

* build(deps): bump actions/checkout from 4 to 6 (#277)

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: expose history

* feat: add the from_trace classmethod

* fix: remove tests that targeted history

* fix: solve error with deepcopy

--------

* feat: add code for `to_latex()` method (#262)

* feat: add the first version of code2latex

* fix: correct the code2latex

* fix: correct the code and add tables for spectra

* fix: apply suggestions from code review and remove tool cache

* feat: add tables to the single envs

* fix: correct the ml tools

* feat: add automatic longtable generation for scoring functions

* fix: move into a generate latex method

* fix: apply suggestions from code review

* feat: add verbosity level as an argument

* Update src/corral/backend/env.py

Co-authored-by: Kevin M Jablonka <32935233+kjappelbaum@users.noreply.github.com>

* Update src/corral/backend/env.py

Co-authored-by: Kevin M Jablonka <32935233+kjappelbaum@users.noreply.github.com>

* Update src/corral/backend/env.py

Co-authored-by: Kevin M Jablonka <32935233+kjappelbaum@users.noreply.github.com>

* Update src/corral/backend/env.py

Co-authored-by: Kevin M Jablonka <32935233+kjappelbaum@users.noreply.github.com>

* Update src/corral/backend/env.py

Co-authored-by: Kevin M Jablonka <32935233+kjappelbaum@users.noreply.github.com>

* Update src/corral/backend/env.py

Co-authored-by: Kevin M Jablonka <32935233+kjappelbaum@users.noreply.github.com>

---------

Co-authored-by: Kevin M Jablonka <32935233+kjappelbaum@users.noreply.github.com>

* chore: rm  older runs (#305)

* chore: rm ai4mat runs

* chore: rm md_optimized as they are not final

* fix: correct that trace is not saved when agent fail (#308)

* fix: correct that trace is not saved when agent fail

* feat: add reruns

* chore: fixes in consistencies with trial ids and missing traces (#306)

* chore: rename attempt_ -> trail ids

* chore: reran, ml environment comprehesive subtask

* chore: oss catalyst reruns

* chore: reruns catalyst subtask

* Md corrected runs (#290)

* MD FINAL RUNS

* updated gitignore

* gpt-4o runs restructured

* claude_45 runs - restructured

* gpt-oss runs - restructured

* fix: update with dev

* fix: solve dev again

* merge conflict - .gitignore

* MD reruns updated

* removed previous MD runs

---------

Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com>
Co-authored-by: Martiño Rios-Garcia <147555961+MrtinoRG@users.noreply.github.com>

* feat: add script to pull data + snakefile (#309)

* chore: script to get QA score similar to reports (#310)

* chore: script to get QA score

* chore: update the script to the correct structure

* feat: add plot scripts (#312)

* feat: add appendix fig 5 plot

* feat: add appendix figure 4 plots

* feat: add more plots for analysis

* fix: apply suggestions from code review

* fix: review title case notation

* fix: apply suggestions from review

* fix: correct the plots for the new range_frame

* GPT OSS runs for AFM using a new naming convention (#320)

* fix: subtasks

* fix: tasks

* fix: env files

* chore: solve code quality CI

* chore: remove corrupted lfs files

* feat: add GPT OSS runs for AFM using a new naming convention

* fix: run pre-commit such that CI check pass

* fix: try to solve CI

---------

Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com>

* Reports to hf (#302)

* script to push traces to HF

* added main

* script to push reports to HF

* improved hierarchies

* changed dir structure

* script to pull data from HF

* final version for subset name

* comments - in progress

* addressed comments

* chore: move reports around

---------

Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com>

* feat: react oss - catalyst subtask and resistor comprehensive (#322)

* feat: rerun catalyst react subtask oss

* feat: resistor reasct oss comprehensive subtask reruns

* chore: utility shared across plots (#314)

* chore: utility shared across plots

* chore: .snakemake in gitignore

* Update analysis/plot_config.py

Co-authored-by: Kevin M Jablonka <32935233+kjappelbaum@users.noreply.github.com>

* feat: add display names, environment groups, and centralize level config

- Update ENVIRONMENT_NAMES with descriptive display names
- Add ENVIRONMENT_GROUPS for high-level categorization
- Rename ENVIRONMENT_LEVELS to ENVIRONMENT_MAX_LEVELS
- Move DEFAULT_ENV_LEVEL_MAP from plot_utils to plot_config

* refactor: use dicts for colours and simplify filter API

- Convert MODEL_COLOURS, AGENT_COLOURS, ENVIRONMENT_COLOURS from lists
  to dicts keyed by id (removes separate _MAP dicts)
- Simplify filter functions: replace "average" strategy with None,
  remove redundant override parameters

* fix: rename generic df variable to satisfy ruff PD901

---------

Co-authored-by: Kevin M Jablonka <32935233+kjappelbaum@users.noreply.github.com>

* Md re runs (#330)

* claude reruns added

* gpt 4o reruns added

* minor update to handle lammps log files

* Wetlab+Reports (#323)

* bring wetlab env from wetlab branch

* fix: correct JSON schema for simulate_color_mixture in wetlab

* wetlab: removed old qa files

* task-level reports

* subtask-level reports PART1

* subtask-level reports PART2

* feat: rerun missing traces

* fix: run pre-commit

* fix: run pre-commit in src

---------

Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com>

* feat: add appendix tables (#325)

* feat: add scripts and tables

* fix: correct table

* feat: split figure 4 appendix

* feat: retro db can be pulled from ghcr

* feat: add panel 5 plots (#317)

* feat: add plots about the annotations

* fix: apply suggestions from review

* feat: apply suggestions from code review

* chore: merge 'dev' into 'marker_plots'

* feat: add second plot to the figure

* feat: update the env names

* feat: add wetlab + change env names

* feat: move legend below

* feat: correct qa table

* fix: get model names as the other figures

* feat: add epistemic analysis (#319)

* feat: add first annotation

* feat: add raw counts script

* feat: update plot

* feat: add title case

* feat: add final annotations

* feat: add creation of a latex table to the analysis script

* feat: add first draft of the figures

* feat: add the scripts to the snakemake file

* feat: update figure

* fix: apply suggestion from review to the image

* chore: update capitalization

* feat: add tables and figure for the appendix

* fix: update figure

* feat: panel 2  performance plots (#313)

* Fix issue #264: Pythoncom missing in the dependencies of AFM (#265)

* Fix issue #264: Pythoncom missing in the dependencies of AFM

* Fix: CoUninitialize

* Fix: CoUninitialize in tools

* Fix: Wrap COM initialization in try/finally to ensure proper cleanup

* fix: solve ci

---------

Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com>

* build(deps): bump actions/checkout from 4 to 6 (#277)

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore: script to get QA score

* chore: plots

* chore: bottleneck plot

* chore: bottleneck plot

* chore: plot utils update

* chore: move to common utils that are shared

* chore: utility shared across plots

* chore: update gitignore

* chore: more plots

* chore: try more plots

* chore: task category plot

* chore: .snakemake in gitignore

* feat: mean logprobs

* chore: update styles in plot

* chore: correlation with logprobs

* chore: correlation with logprobs

* refactor: consolidate panel 2 plots with proper layout and config

- Rename scripts to 2a/2b/2c naming convention
- Add combined 2_panel.py with gridspec layout
- Fix group labels to appear above heatmap (use transAxes)
- Reduce heatmap cell height, match font sizes
- Remove old scripts and plot outputs

* fix: update figure sizes for 2b (1/3 width) and 2c (2/3 width)

* feat: add marginal mean bar charts to 2a heatmap

Add top (column means) and right (row means) bar charts alongside
the heatmap. Move model labels to right side with group lines.
Add vertical separators between environment groups. Remove colorbar.

* feat: update environment names and two-level x-axis labels

Update ENVIRONMENT_NAMES to match paper naming conventions (e.g.
Spectroscopic Structure Elucidation, AFM Experiment Execution).
Split x-axis into S1/S2/S3 tick labels with environment names
below as 45-degree rotated labels with grouping lines.

* feat: add panel 2 subplot variants and combined layout

- 2b scatter: color-code by environment group, model/scaffold spread labels
- 2c task category: add bar and line plot variants (models averaged)
- 2d logprobs: standalone script with y-axis on right
- 2_panel: combined figure with heatmap + 3 bottom subplots
- plot_config: add GROUP_COLOURS, update wetlab/afm category tags

* chore: updates in heatmap

* chore: scatter plot

* chore: push origin main

* chore: feedback

* chore: font size increase

* chore: font size increase

* chore: font size increase

* chore: add subset logprobs plot and fix linting issues

- Add 2d_logprobs_subset.py with 5-env subset and red-to-blue gradient
- Fix ruff warnings: unused vars, ambiguous names, implicit concat, noqa

* feat: migrate panel_2 and panel_6 plots to analysis/ with Snakemake integration

- Move 9 panel_2 scripts and 1 panel_6 script into analysis/ with plot_panel2_* naming
- Extract shared classify_subtask/load_category_tags into plot_utils.py
- Remove sys.path hacks; scripts now import plot_config/plot_utils directly
- Output to analysis/results/figures/panel_2/ and panel_6/
- Add 10 Snakemake rules and update rule all targets

* chore: remove plots/ directory (scripts and outputs)

Canonical scripts are in analysis/plot_panel2_*.py and
analysis/plot_panel6_*.py with Snakefile rules that output to
analysis/results/figures/. The plots/ directory was a duplicate.

* feat: update panel_2 plots and add grouped logprobs

- Remove "(Score)" from scatter plot axis labels
- Clarify heatmap marginal bar labels (per environment / per agent)
- Add spacing between env names and S1/S2 tick labels in heatmap
- Remove combined panel plot from Snakefile
- Add grouped logprobs plot averaging by environment group

---------

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: panel 4 irt plots (#315)

* chore: utility shared across plots

* feat: add irt modeling and plots

* chore: task_category into modelling

* chore: improve plots

* chore: make plots

* chore: script to evaluate

* chore: evals

* chore: update eval

* chore: validation

* chore: new plots

* chore: irt tikz

* feat: add lfm and lfm-binomial Bayesian latent factor models with final report plots

- Add lfm/ (Bernoulli) and lfm-binomial/ (Binomial) with 8 hierarchical models each
- Extend analysis/Snakefile with lfm and lfm-binomial rules
- Add final_report.py plotting script and 7 publication-quality plots for best model (M7)
- Remove old IRT evaluation scripts and outputs

* feat: add LFM 3-subplot panel (variance decomposition, LOO predictions, task-averaged scatter)

* chore: update IRT tikz figure with purple theme, legend, and panel label

* chore: feedbacks

* chore: feedback

* chore: model capabilities and stuff

* chore: plots

* feat: migrate panel_4 plots to analysis/ with Snakemake integration

Move 7 panel_4 plotting scripts into analysis/ directory, removing
importlib hacks and integrating with Snakefile. Remove capability_comparison,
capability_profiles, and task_level_residuals plots.

* chore: rm old plits

* fix: resolve pre-commit lint errors and update gitignore

Fix ruff errors (PD901, RUF001/002/003, ARG001, B026, PD011, C408)
across analysis and plot files. Add **/.snakemake/ to gitignore to
cover nested dirs, remove tracked .snakemake metadata, and delete
duplicate plot file.

* chore: remove duplicate plots/panel_4/ and tracked png/pdf files

Canonical scripts live in analysis/plot_panel4_*.py and output to
analysis/results/figures/panel_4/. Remove the old plots/panel_4/
directory (scripts + generated output) and stray PDFs from analysis/.

* refactor: rename plot scripts to plot_irt_*, remove old lfm directory

Rename plot_panel4_* and plot_panel4_irt_* scripts to plot_irt_*.
Remove old lfm/ directory and its Snakefile rules (superseded by
lfm-binomial). Clean up stray tex/aux files.

* chore: remove model3 LOO plot and fitting rule (no longer best model)

* fix: update plot_irt_results rule to use lfm-binomial data

* feat: add HF download script for lfm-binomial results

Add download_lfm_binomial_from_hf.py to fetch pre-fitted model results
from jablonkagroup/corral_lfm_binomial_results. Add download_lfm_binomial
Snakefile rule so plot rules can resolve without local model fitting.

* chore: move fire to dev dependency (not required at runtime)

* fix: remove hardcoded environment count from docstring

* chore: add epistemology panel PDF

* chore: remove report files (#337)

* chore: remove report files

* chore: mv databse creation to scripts

---------

Co-authored-by: n0w0f <pvt.nawaf@gmail.com>

* fix: standardize @tool docstring tags across all environments (#331)

* feat: add pre-commit hook to validate @tool docstring tags

Adds a validation script that checks all @tool-decorated functions
for the required tagged docstring format (BRIEF, DETAILED, PROCEDURAL,
WORKFLOW_INTEGRATION, CONTEXTUAL, SYNTACTICAL, RAISES, LIMITATIONS,
ARGS_*, RETURNS_*, and nested tags). Also detects common misspellings
like ARGS_SYNTACTIC vs ARGS_SYNTACTICAL and unclosed tags.

* ci: add dedicated docstring validation job

Adds a validate-docstrings job that runs the full standalone scan
across all task environments, independent of pre-commit.

* revert: remove redundant docstring validation CI job

The pre-commit hook already runs in CI via the code-quality job,
so a separate job is unnecessary duplication.

* fix: rename ARGS_SYNTACTIC to ARGS_SYNTACTICAL across all environments

The verbosity system expects ARGS_SYNTACTICAL but catalyst, ml,
corral_md, and resistor_network used ARGS_SYNTACTIC (missing -AL),
causing those tags to be silently ignored during filtering.

Also switches pre-commit hook entry to python3 since the script
uses only stdlib modules.

* fix: standardize AFM tool docstring tags to use ARGS_*/RETURNS_* prefixes

AFM tools used bare [BRIEF], [DETAILED], [SYNTACTICAL], [EXAMPLES]
inside Args/Returns sections instead of the required [ARGS_BRIEF],
[ARGS_DETAILED], [ARGS_SYNTACTICAL], [ARGS_EXAMPLES] and
[RETURNS_BRIEF], [RETURNS_DETAILED], [RETURNS_EXAMPLES].
RAISES blocks were already correct and left unchanged.

* fix: rename ARGS_* to RETURNS_* in corral_md Returns sections

All 8 tools in corral_md used ARGS_BRIEF/ARGS_DETAILED/ARGS_EXAMPLES
inside Returns sections instead of RETURNS_BRIEF/RETURNS_DETAILED/
RETURNS_EXAMPLES. Also fixes a typo: ARGSDETAILED -> RETURNS_DETAILED
in execute_python_script.

* fix: standardize RAISES blocks across resistor, retrosynthesis, spectra

- retrosynthesis: fix typo [//ERROR_WHEN] -> [/ERROR_WHEN]
- resistor_network: add nested ERROR_WHEN/ERROR_DETAILS/ERROR_RECOVERY
  tags to propose_simple_topology and generate_test_measurements,
  fix unclosed RAISES block, fix EXAMPLES -> RETURNS_EXAMPLES
- spectra_elucidation: add nested RAISES tags to 6 tools that don't
  raise, add real error docs for obtain_isomers_from_molecular_formula
  (remote_call failure) and return_possible_fragments (ValueError)

* fix: standardize spectra_elucidation docstring tags

- Add PREREQUISITE/CURRENT/FOLLOW_UP nested tags to WORKFLOW_INTEGRATION
  in obtain_isomers_from_molecular_formula and validate_smiles
- Rename bare BRIEF/DETAILED/SYNTACTICAL/EXAMPLES to ARGS_*/RETURNS_*
  in simulate_spectra Args/Returns sections

* fix: rewrite samplemath tool docstrings with full tagged format

Both calculator and percentage_calculator had plain docstrings with
no verbosity tags. Rewrites both with the complete tagged format
including BRIEF, DETAILED, PROCEDURAL, WORKFLOW_INTEGRATION,
CONTEXTUAL, SYNTACTICAL, ARGS_*, RETURNS_*, RAISES, and LIMITATIONS.

* fix: standardize docstring tags in src/corral/utils/ tools and extend validator scope

Rename bare [BRIEF]/[DETAILED]/[SYNTACTIC]/[EXAMPLES] tags to
ARGS_*/RETURNS_* prefixed versions in terminal_tools.py, context7_tools.py,
and code_tools.py (5 @tool functions, 75 violations fixed).

Update validator find_tool_files() to also scan src/corral/utils/ and
widen pre-commit hook file pattern to match *_tools.py files.

* fix: scope pre-commit docstring hook to avoid matching test files

Narrow _tools.py pattern to only match src/corral/utils/*_tools.py,
preventing tests/test_tools.py from being validated as real tool files.

* fix: standardize wetlab tool docstring tags and fix pre-commit scope

- Rename ARGS_SYNTACTIC → ARGS_SYNTACTICAL across 8 tools
- Add missing ARGS_SYNTACTICAL tags for 5 args (sol1_label, sol2_label, sol_label)
- Add RAISES nested tags to 6 tools with empty RAISES blocks
- Scope pre-commit hook pattern to src/corral/utils/*_tools.py to avoid
  matching test files

* fix: exclude tests/ from tool docstring validation hook

Test files contain @tool fixtures with simple docstrings that don't
need the full tagged format. This was caught by CI (--all-files) but
not locally since test files were never staged.

* fix: add pytest-mock to dependency-groups dev

Was already in [project.optional-dependencies] but missing from
[dependency-groups] which is what uv sync --dev uses. Fixes 13 test
errors in test_prompt_utils.py.

* refactor: standardize task configs across all environments (#328)

* refactor: standardize task configs to follow spectra_elucidation pattern

All environments now use a consistent directory structure:
  <env>/environments/level_N/tasks_json/task_M.json
  <env>/environments/level_N/subtasks_json/task_M.json

Changes per environment:
- catalyst: config/{single,chained}/ → environments/level_1/{tasks_json,subtasks_json}/
  Split 3 tasks (si, tio2, cu2o) into individual files
- ml: config/{single,chained}/ → environments/level_1/{tasks_json,subtasks_json}/
  Split 3 tasks (oxides, nitrides, sulphides) into individual files
- resistor_network: config/{single,chained}/ → environments/level_1/{tasks_json,subtasks_json}/
  Split 6 tasks into individual files with grouped subtasks
- corral_md: Flattened categories (melting, quenching, surface_energy) into
  numbered tasks within level dirs, renamed tasks/ → tasks_json/, subtasks/ → subtasks_json/
- afm: src/enviroment/{tasks_N,subtasks_N}.json → environments/level_N/{tasks_json,subtasks_json}/task_1.json
  4 levels with tasks and subtasks
- retrosynthesis: Renamed tasks/ → tasks_json/, subtasks/ → subtasks_json/

* ci: add pre-commit hook to enforce environment config convention

Validates that all task environments follow the standard structure:
  <env>/environments/level_N/{tasks_json,subtasks_json}/<task>.json

Catches legacy patterns (config/, src/enviroment/) and checks for
valid JSON, correct directory naming, and required tasks_json/ dirs.

* feat: standardize task JSON schemas and add HF upload script

Schema standardization:
- All task/subtask JSONs now use array format at top level
- Dict-based files converted: dict keys become `id` field on each entry
- `scoring_fn` renamed to `scoring_function` across all environments
- `description` field added from `input.prompt` for spectra/retrosynthesis
- UUID added to every task and subtask entry

New scripts:
- scripts/standardize_task_schemas.py: normalizes all JSON schemas
- scripts/upload_tasks_to_hf.py: uploads tasks to HF dataset
  (jablonkagroup/corral-environment-tasks) with 26 subsets,
  681 total entries across all environments and levels

* fix: add subset configs to HF dataset README for proper viewer support

The HF dataset viewer requires configs declared in the YAML front matter
of README.md to display subset selection. Now generates proper config
entries for all 26 subsets.

* feat: add unified task loader utility and refactor catalyst env.py

Add `corral.utils.task_loader` with `load_task_entries()` and
`load_task_entries_from_env_package()` that load standardized task
configs from either HuggingFace or local JSON directories.

Refactor catalyst env.py as reference implementation: replace custom
`load_tasks_from_json()` with the shared task loader, add
`entries_to_task_definitions()` for converting entries to TaskDefinition
objects using the environment's scoring function registry.

* fix: deduplicate subtask/task IDs across files in same directory

Prefix duplicated IDs with filename stem (e.g. task_1_retrieve_structure)
in catalyst and ml subtasks. Also updates input_from_tasks references.

* fix: update catalyst integration tests for refactored task loader API

Replace removed load_tasks_from_json with entries_to_task_definitions +
load_task_entries, update fixtures to standardized array format, and fix
create_environments call signatures.

* feat: standardize wetlab task configs to match other environments

- Move wetlab tasks/subtasks from wetlab/wetlab/{tasks,subtasks}_json/level_X/
  to wetlab/environments/level_X/{tasks,subtasks}_json/
- Add wetlab to GROUP_A_ENVS in standardize script
- Standardize all 60 wetlab JSON files: scoring_fn → scoring_function,
  add uuid, add description from input.prompt

* feat: standardize samplemath (demo env) to follow framework conventions

- Move config/example.json → environments/level_1/{tasks,subtasks}_json/
- Remove legacy config/ directory
- Standardize JSON: dict→array format, add uuid, add id from dict keys
- Remove samplemath from SKIP_ENVS in all scripts (standardize, upload, check)
- Demo env should exemplify best practices, not be an exception

* chore: remove legacy checks and one-time migration script

- Remove legacy config/ and src/enviroment/ detection from check_env_convention.py
  since all environments are now standardized
- Delete standardize_task_schemas.py — migration is complete, no longer needed

* feat: landing page for Corral benchmark (#324)

* feat: add landing page with environment explorer and verbosity slider

- extract_data.py: AST-based extraction of tools, tasks, and scoring
  functions from all 7 environments (docstrings stripped from code)
- data.js: auto-generated data with verbosity-tagged sections per tool
- index.html: Tailwind + Alpine.js page with environment explorer,
  interactive verbosity slider, code panels, leaderboard, and team tabs

* feat: redesign landing page with hero, architecture section, and font pairing

- Add proper Home tab with hero, stats, architecture diagram, env grid
- Use Space Grotesk (headings) + Newsreader (body) + JetBrains Mono (code)
- Copy logo and arch images into site/ for correct relative paths
- Increase code snippet extraction limit from 20 to 50 lines
- Add M3RG Lab credit to footer and team section
- Add external Docs link in nav pointing to GitHub Pages
- Environments tab now separate from landing page

* docs: add README with local setup instructions for site

* update team members

* refactor: standardize task configs to follow spectra_elucidation pattern

All environments now use a consistent directory structure:
  <env>/environments/level_N/tasks_json/task_M.json
  <env>/environments/level_N/subtasks_json/task_M.json

Changes per environment:
- catalyst: config/{single,chained}/ → environments/level_1/{tasks_json,subtasks_json}/
  Split 3 tasks (si, tio2, cu2o) into individual files
- ml: config/{single,chained}/ → environments/level_1/{tasks_json,subtasks_json}/
  Split 3 tasks (oxides, nitrides, sulphides) into individual files
- resistor_network: config/{single,chained}/ → environments/level_1/{tasks_json,subtasks_json}/
  Split 6 tasks into individual files with grouped subtasks
- corral_md: Flattened categories (melting, quenching, surface_energy) into
  numbered tasks within level dirs, renamed tasks/ → tasks_json/, subtasks/ → subtasks_json/
- afm: src/enviroment/{tasks_N,subtasks_N}.json → environments/level_N/{tasks_json,subtasks_json}/task_1.json
  4 levels with tasks and subtasks
- retrosynthesis: Renamed tasks/ → tasks_json/, subtasks/ → subtasks_json/

* ci: add pre-commit hook to enforce environment config convention

Validates that all task environments follow the standard structure:
  <env>/environments/level_N/{tasks_json,subtasks_json}/<task>.json

Catches legacy patterns (config/, src/enviroment/) and checks for
valid JSON, correct directory naming, and required tasks_json/ dirs.

* feat: standardize task JSON schemas and add HF upload script

Schema standardization:
- All task/subtask JSONs now use array format at top level
- Dict-based files converted: dict keys become `id` field on each entry
- `scoring_fn` renamed to `scoring_function` across all environments
- `description` field added from `input.prompt` for spectra/retrosynthesis
- UUID added to every task and subtask entry

New scripts:
- scripts/standardize_task_schemas.py: normalizes all JSON schemas
- scripts/upload_tasks_to_hf.py: uploads tasks to HF dataset
  (jablonkagroup/corral-environment-tasks) with 26 subsets,
  681 total entries across all environments and levels

* fix: add subset configs to HF dataset README for proper viewer support

The HF dataset viewer requires configs declared in the YAML front matter
of README.md to display subset selection. Now generates proper config
entries for all 26 subsets.

* feat: add unified task loader utility and refactor catalyst env.py

Add `corral.utils.task_loader` with `load_task_entries()` and
`load_task_entries_from_env_package()` that load standardized task
configs from either HuggingFace or local JSON directories.

Refactor catalyst env.py as reference implementation: replace custom
`load_tasks_from_json()` with the shared task loader, add
`entries_to_task_definitions()` for converting entries to TaskDefinition
objects using the environment's scoring function registry.

* feat: show all tasks and subtasks across levels on landing page

- Rewrite extract_data.py to walk standardized directory structure
- Load 85 tasks + 596 subtasks across all 7 environments and levels
- Add subtasks tab to environment detail view
- Add level selector for multi-level environments
- Update stats strip: 7 envs, 83 tools, 85 tasks, 596 subtasks
- Environment cards now show task + subtask counts

* fix: deduplicate subtask/task IDs across files in same directory

Prefix duplicated IDs with filename stem (e.g. task_1_retrieve_structure)
in catalyst and ml subtasks. Also updates input_from_tasks references.

* fix: unique subtask IDs, cap chips at 25 with HF link, equal-height panels

- Use _uid (from uuid) for task/subtask selection to handle duplicate IDs
- Cap task/subtask chip lists at 25 with "+ N more" link to HuggingFace
- Remove 50-line truncation on tool and scoring function code snippets
- Equal-height columns for description and code panels (items-stretch)

* feat: show arguments and returns in tool descriptions by verbosity level

Include ARGS_BRIEF/DETAILED/SYNTACTICAL and RETURNS_BRIEF/DETAILED/EXAMPLES
progressively in the verbosity slider output. Add section headers
(Arguments, Returns, Raises, etc.) for clarity.

* fix: render args/returns as styled HTML sections in tool description

Switch from x-text to x-html for tool description body. Args, Returns,
Raises, Limitations, and Examples now render with uppercase header labels
and visual separators so they're clearly visible at higher verbosity.

* refactor: auto-discover environments, remove team tab, dynamic stats

- extract_data.py now auto-discovers environments from tasks/*/environments/
  and finds tools.py/score.py via rglob (no hardcoded paths)
- Display names and descriptions moved to site/env_meta.json
- Stats strip computed dynamically from CORRAL_DATA (no hardcoded numbers)
- Remove team tab and team members data

* docs: update environment names and descriptions to match paper

Rename environments to paper terminology: Surface Construction,
Circuit Inference, Retrosynthetic Planning, AFM Operation,
Molecular Simulation, ML Property Prediction, Spectra Elucidation.

* fix: update catalyst integration tests for refactored task loader API

Replace removed load_tasks_from_json with entries_to_task_definitions +
load_task_entries, update fixtures to standardized array format, and fix
create_environments call signatures.

* fix: update catalyst integration tests for refactored task loader API

Replace removed load_tasks_from_json with entries_to_task_definitions +
load_task_entries, update fixtures to standardized array format, and fix
create_environments call signatures.

* fix: discovery ignores venv

* feat: add pre-commit hook to validate @tool docstring tags

Adds a validation script that checks all @tool-decorated functions
for the required tagged docstring format (BRIEF, DETAILED, PROCEDURAL,
WORKFLOW_INTEGRATION, CONTEXTUAL, SYNTACTICAL, RAISES, LIMITATIONS,
ARGS_*, RETURNS_*, and nested tags). Also detects common misspellings
like ARGS_SYNTACTIC vs ARGS_SYNTACTICAL and unclosed tags.

* ci: add dedicated docstring validation job

Adds a validate-docstrings job that runs the full standalone scan
across all task environments, independent of pre-commit.

* revert: remove redundant docstring validation CI job

The pre-commit hook already runs in CI via the code-quality job,
so a separate job is unnecessary duplication.

* fix: rename ARGS_SYNTACTIC to ARGS_SYNTACTICAL across all environments

The verbosity system expects ARGS_SYNTACTICAL but catalyst, ml,
corral_md, and resistor_network used ARGS_SYNTACTIC (missing -AL),
causing those tags to be silently ignored during filtering.

Also switches pre-commit hook entry to python3 since the script
uses only stdlib modules.

* fix: standardize AFM tool docstring tags to use ARGS_*/RETURNS_* prefixes

AFM tools used bare [BRIEF], [DETAILED], [SYNTACTICAL], [EXAMPLES]
inside Args/Returns sections instead of the required [ARGS_BRIEF],
[ARGS_DETAILED], [ARGS_SYNTACTICAL], [ARGS_EXAMPLES] and
[RETURNS_BRIEF], [RETURNS_DETAILED], [RETURNS_EXAMPLES].
RAISES blocks were already correct and left unchanged.

* fix: rename ARGS_* to RETURNS_* in corral_md Returns sections

All 8 tools in corral_md used ARGS_BRIEF/ARGS_DETAILED/ARGS_EXAMPLES
inside Returns sections instead of RETURNS_BRIEF/RETURNS_DETAILED/
RETURNS_EXAMPLES. Also fixes a typo: ARGSDETAILED -> RETURNS_DETAILED
in execute_python_script.

* fix: standardize RAISES blocks across resistor, retrosynthesis, spectra

- retrosynthesis: fix typo [//ERROR_WHEN] -> [/ERROR_WHEN]
- resistor_network: add nested ERROR_WHEN/ERROR_DETAILS/ERROR_RECOVERY
  tags to propose_simple_topology and generate_test_measurements,
  fix unclosed RAISES block, fix EXAMPLES -> RETURNS_EXAMPLES
- spectra_elucidation: add nested RAISES tags to 6 tools that don't
  raise, add real error docs for obtain_isomers_from_molecular_formula
  (remote_call failure) and return_possible_fragments (ValueError)

* fix: standardize spectra_elucidation docstring tags

- Add PREREQUISITE/CURRENT/FOLLOW_UP nested tags to WORKFLOW_INTEGRATION
  in obtain_isomers_from_molecular_formula and validate_smiles
- Rename bare BRIEF/DETAILED/SYNTACTICAL/EXAMPLES to ARGS_*/RETURNS_*
  in simulate_spectra Args/Returns sections

* fix: rewrite samplemath tool docstrings with full tagged format

Both calculator and percentage_calculator had plain docstrings with
no verbosity tags. Rewrites both with the complete tagged format
including BRIEF, DETAILED, PROCEDURAL, WORKFLOW_INTEGRATION,
CONTEXTUAL, SYNTACTICAL, ARGS_*, RETURNS_*, RAISES, and LIMITATIONS.

* fix: standardize docstring tags in src/corral/utils/ tools and extend validator scope

Rename bare [BRIEF]/[DETAILED]/[SYNTACTIC]/[EXAMPLES] tags to
ARGS_*/RETURNS_* prefixed versions in terminal_tools.py, context7_tools.py,
and code_tools.py (5 @tool functions, 75 violations fixed).

Update validator find_tool_files() to also scan src/corral/utils/ and
widen pre-commit hook file pattern to match *_tools.py files.

* fix: scope pre-commit docstring hook to avoid matching test files

Narrow _tools.py pattern to only match src/corral/utils/*_tools.py,
preventing tests/test_tools.py from being validated as real tool files.

* fix: standardize wetlab tool docstring tags and fix pre-commit scope

- Rename ARGS_SYNTACTIC → ARGS_SYNTACTICAL across 8 tools
- Add missing ARGS_SYNTACTICAL tags for 5 args (sol1_label, sol2_label, sol_label)
- Add RAISES nested tags to 6 tools with empty RAISES blocks
- Scope pre-commit hook pattern to src/corral/utils/*_tools.py to avoid
  matching test files

* fix: exclude tests/ from tool docstring validation hook

Test files contain @tool fixtures with simple docstrings that don't
need the full tagged format. This was caught by CI (--all-files) but
not locally since test files were never staged.

* fix: add pytest-mock to dependency-groups dev

Was already in [project.optional-dependencies] but missing from
[dependency-groups] which is what uv sync --dev uses. Fixes 13 test
errors in test_prompt_utils.py.

* fix: combine ARGS/RETURNS sections into unified blocks and show at all verbosity levels

- ARGS_BRIEF and RETURNS_BRIEF now shown from Brief level onward
- Higher verbosity levels are cumulative (ARGS_DETAILED adds to ARGS_BRIEF, not replaces)
- All ARGS_* sections render as single "Arguments" block, RETURNS_* as single "Returns" block
- Regenerated data.js with standardized docstring tags

* refactor: remove leaderboard tab from landing page

* feat: floating glassmorphism navbar and dark/light mode toggle

- Navbar is now floating with rounded corners and frosted glass effect
- Added dark mode support with localStorage persistence
- Sun/moon toggle button in the nav bar
- Dark mode covers all surfaces, text, borders, cards, and code panels
- Respects reduced-motion preferences

* fix: dark mode glassmorphism for cards and stats strip

- Stats strip uses transparent bg with proper dark dividers
- Feature cards (Environments/Agents/Tasks) get glass-card effect in dark mode
- Environment grid cards get glass-card effect in dark mode
- Environment selector bar styled for dark mode
- Added violet dark mode color tokens

* ui: move verbosity slider above description body

Control appears before the content it filters, improving discoverability.

* feat: standardize wetlab task configs to match other environments

- Move wetlab tasks/subtasks from wetlab/wetlab/{tasks,subtasks}_json/level_X/
  to wetlab/environments/level_X/{tasks,subtasks}_json/
- Add wetlab to GROUP_A_ENVS in standardize script
- Standardize all 60 wetlab JSON files: scoring_fn → scoring_function,
  add uuid, add description from input.prompt

* feat: standardize samplemath (demo env) to follow framework conventions

- Move config/example.json → environments/level_1/{tasks,subtasks}_json/
- Remove legacy config/ directory
- Standardize JSON: dict→array format, add uuid, add id from dict keys
- Remove samplemath from SKIP_ENVS in all scripts (standardize, upload, check)
- Demo env should exemplify best practices, not be an exception

* chore: remove legacy checks and one-time migration script

- Remove legacy config/ and src/enviroment/ detection from check_env_convention.py
  since all environments are now standardized
- Delete standardize_task_schemas.py — migration is complete, no longer needed

* feat: add Wet Chemistry environment to landing page

- Added wetlab entry to env_meta.json
- Regenerated data.js (14 tools, 30 tasks, 190 subtasks, 3 levels)
- Totals now: 8 environments, 97 tools, 115 tasks, 786 subtasks

* fix: remove Tool Verbosity from architecture, trim principles, rename wetlab

- Remove Tool Verbosity bullet from Decoupled Architecture section
- Remove Efficiency and Simplicity from foundational principles
- Rename wetlab to "Qualitative Analysis" with proper description
- Regenerated data.js

* fix: center foundational principles grid for 4 items

* copy: change heading to "A framework for..."

* feat: render inline code in tool descriptions with monospace styling

Backtick-wrapped content in docstring sections now renders as styled
<code> tags with monospace font and subtle background, matching GitHub
inline code appearance. Works in both light and dark mode.

* feat: default dark mode to system preference

Falls back to prefers-color-scheme when no localStorage override exists.
Manual toggle still persists the user's choice.

* fix: resolve missing scoring functions across all environments

- AFM: rename check_equation to check_mathematical_eq to match registry key
- extract_data.py: parse SCORING_FUNCTIONS dict from env.py to build
  registry-key aliases for scoring functions automatically
- Normalize scoring_function field to string (fixes spectra integer IDs)
- All 8 environments now have complete scoring function coverage

* feat: add verifiers count to stats strip and fix single-row layout

* fix: render per-argument descriptions for multi-arg tools

ARGS_*/RETURNS_* sections were stored as single strings, so only the
last argument's description survived when a tool had multiple params.
Now collected as arrays in extract_data.py and rendered per-argument
in the frontend with arg name labels and clean verbosity formatting.

* feat: deploy landing page and docs via GitHub Actions Pages

Replace mkdocs gh-deploy with actions/deploy-pages to serve both
the landing page (root) and MkDocs docs (/docs/) from a single
GitHub Pages deployment. CI now also generates data.js from task sources.

* chore: update environment display names to match paper conventions

* ci: enable site deployment on push to dev branch (#344)

* feat: intervention experiment pipeline (#334)

* feat: intervention experiment pipeline

Add scripts to run intervention experiments that inject steps from
successful/failed traces into new agent runs to measure knowledge
vs reasoning gaps across scientific environments.

Pipeline: select tasks (from reports_v2) -> run baseline -> pick
traces from baseline -> run intervention conditions -> analyze.

* feat: per-agent server ports, env venv setup, resistor argparse

- Each env now has two server ports (react/toolcalling) to allow safe
  parallel runs — the server is stateful and concurrent agents would clash
- Add scripts/setup_envs.sh for one-time venv creation (uv for spectra/
  resistor, micromamba for wetlab due to conda-only reaktoro)
- launch_sweep.sh gains --start-servers/--stop-servers/--server-status
- Resistor env.py uses argparse with --mode single/chained (no path needed)
- Wetlab pyproject.toml updated with corral dep and uv.sources

* fix: bash 3 compatibility for launch_sweep.sh

Replace declare -A (bash 4+) with case-based lookup functions.
Tested on macOS bash 3.2.57. Also add generated task_selection.json.

* fix: count dry-run launches in launch_sweep.sh

* feat: smoke-tested baseline pipeline with Bedrock

- setup_envs.sh: upgrade promptstore + install boto3 for Bedrock
- launch_sweep.sh: add --trials flag for smoke testing (e.g. --trials 1)
- run_intervention.py: cap k_values at trials count to avoid validation error
- Verified end-to-end: setup venvs → start servers → launch baselines → reports

* chore: update lock files and add promptstore index

Updated uv.lock files across all task environments after
upgrading promptstore. Added generated prompts/index.json.

* chore: baseline runs

* chore: pass plot

* chore: checkpoint intervention runs

* chore: push update

* chore: intervention first batch

* chore: retro interention runs

* chore: internvention more runs

* chore: new plots

* chore: plot results

* refactor: move intervention analysis to HF-backed pipeline

- Add intervention plotting scripts to analysis/ (pass@k, pass^k,
  recovery curves, baseline compact, statistical tests)
- Create intervention_utils.py and aggregate_intervention_results.py
  to read from HF-downloaded JSONL instead of local filesystem
- Add download_intervention_reports_from_hf.py for fetching data
- Update Snakefile with intervention analysis rules
- Remove reports_v3/intervention/runs/ (4.3 GB, now on HF)
- Remove reports_v3/intervention/analysis/ (moved to analysis/)

* chore: remove Snakefile rules for deleted plot scripts

Remove rules and outputs for plot_avg_output_tokens,
plot_avg_tool_calls_per_task, plot_action_distributions,
plot_behavior_summary_panel, and plot_env_verbosity_performance
whose scripts were already deleted.

* feat: add grouped recovery curves and baseline plots

New scripts that average metrics across environment groups:
- Hypothesis-driven inquiry (spectra, wetlab, resistor)
- Strategic reasoning (retrosynthesis)
- Workflow construction (catalyst, md, ml)

* chore: add HF push scripts and clean up stale intervention artifacts

Add scripts to push intervention reports and traces to HuggingFace.
Remove stale agent logs, pid files, and temporary documents.

* chore: remove reports_v3 and update stale intervention paths

Delete reports_v3/intervention (now lives under analysis/intervention).
Update RUNS_ROOT and docstrings to reference the new location.

* fix: remove stale context window error test

The test expected a bare Message return but get_llm_response now wraps
it in LLMResponse.

* fix: restore analysis plot scripts accidentally deleted in refactor

These files were removed in 09ea114a5 but are still referenced by the
Snakefile and present on dev.

* feat: add guidelines plus the app (#340)

* feat: add guidelines plus the app

* chore: remove useless comments

* fix: solve logging

* fix: solve multi-file issues

* fix: solve path problem

* feat: add files for analysis

* feat: add new annotations

* feat: add a new iteratiion

* feat: add new annotations

* feat: update app

* feat: add antipattern excerpt figure

* feat: solve comments

* feat: update the plots and tables from the epistemology analysis

* feat: add annotations and analysis

* chore: remove data as it is in HF

---------

Co-authored-by: Nawaf Alampara <pvt.nawaf@gmail.com>

* feat: add scripts for solving last ToDos (#342)

* feat: add the domain summary table

* chore: update colors

---------

Co-authored-by: Nawaf Alampara <pvt.nawaf@gmail.com>

* feat: plot enhancements for panel 2, panel 4, and tikz figure (#343)

* feat: intervention experiment pipeline

Add scripts to run intervention experiments that inject steps from
successful/failed traces into new agent runs to measure knowledge
vs reasoning gaps across scientific environments.

Pipeline: select tasks (from reports_v2) -> run baseline -> pick
traces from baseline -> run intervention conditions -> analyze.

* feat: per-agent server ports, env venv setup, resistor argparse

- Each env now has two server ports (react/toolcalling) to allow safe
  parallel runs — the server is stateful and concurrent agents would clash
- Add scripts/setup_envs.sh for one-time venv creation (uv for spectra/
  resistor, micromamba for wetlab due to conda-only reaktoro)
- launch_sweep.sh gains --start-servers/--stop-servers/--server-status
- Resistor env.py uses argparse with --mode single/chained (no path needed)
- Wetlab pyproject.toml updated with corral dep and uv.sources

* fix: bash 3 compatibility for launch_sweep.sh

Replace declare -A (bash 4+) with case-based lookup functions.
Tested on macOS bash 3.2.57. Also add generated task_selection.json.

* fix: count dry-run launches in launch_sweep.sh

* feat: smoke-tested baseline pipeline with Bedrock

- setup_envs.sh: upgrade promptstore + install boto3 for Bedrock
- launch_sweep.sh: add --trials flag for smoke testing (e.g. --trials 1)
- run_intervention.py: cap k_values at trials count to avoid validation error
- Verified end-to-end: setup venvs → start servers → launch baselines → reports

* chore: update lock files and add promptstore index

Updated uv.lock files across all task environments after
upgrading promptstore. Added generated prompts/index.json.

* chore: baseline runs

* chore: pass plot

* chore: checkpoint intervention runs

* chore: push update

* chore: intervention first batch

* chore: retro interention runs

* chore: internvention more runs

* chore: new plots

* chore: plot results

* refactor: move intervention analysis to HF-backed pipeline

- Add intervention plotting scripts to analysis/ (pass@k, pass^k,
  recovery curves, baseline compact, statistical tests)
- Create intervention_utils.py and aggregate_intervention_results.py
  to read from HF-downloaded JSONL instead of local filesystem
- Add download_intervention_reports_from_hf.py for fetching data
- Update Snakefile with intervention analysis rules
- Remove reports_v3/intervention/runs/ (4.3 GB, now on HF)
- Remove reports_v3/intervention/analysis/ (moved to analysis/)

* chore: remove Snakefile rules for deleted plot scripts

Remove rules and outputs for plot_avg_output_tokens,
plot_avg_tool_calls_per_task, plot_action_distributions,
plot_behavior_summary_panel, and plot_env_verbosity_performance
whose scripts were already deleted.

* feat: add grouped recovery curves and baseline plots

New scripts that average metrics across environment groups:
- Hypothesis-driven inquiry (spectra, wetlab, resistor)
- Strategic reasoning (retrosynthesis)
- Workflow construction (catalyst, md, ml)

* chore: add HF push scripts and clean up stale intervention artifacts

Add scripts to push intervention reports and traces to HuggingFace.
Remove stale agent logs, pid files, and temporary documents.

* chore: remove reports_v3 and update stale intervention paths

Delete reports_v3/intervention (now lives under analysis/intervention).
Update RUNS_ROOT and docstrings to reference the new location.

* fix: remove stale context window error test

The test expected a bare Message return but get_llm_response now wraps
it in LLMResponse.

* fix: restore analysis plot scripts accidentally deleted in refactor

These files were removed in 09ea114a5 but are still referenced by the
Snakefile and present on dev.

* feat: plot enhancements for panel 2, panel 4, and tikz figure

Panel 2:
- Fix OOM crash in logprob scripts by streaming logprobs.jsonl via
  load_logprobs_stats() instead of loading full per-token arrays into RAM
- Scatter: x-axis reduced to 2 ticks, y-axis fixed to 0.2 intervals,
  legend handles explicitly coloured to match scatter dots
- Coverage heatmap: S-labels rotated 90°
- Task category line: remove grey background grid

Panel 4 (IRT/LFM):
- Variance decomposition labels now include notation symbols
  (γ_s, δ_ℓ, ξ_v, κ_c, e, t, θ_K, θ_R) in both radar_and_variance
  and final_report scripts
- Switch default model from model3 to model7_abilities_env_level
- Fix arviz API: hdi_prob → prob

Tikz figure:
- Fix theta notation to match table (θ^(K/R) → θ_K/R)
- Add λ/ψ slope labels on capability → model arrows
- Add Category (κ_c), Task (t), … to covariates box
- Text colour updated to lama_aesthetics grey (#758D99)

deps: add datasets, netcdf4, snakemake


* chore: minor plot enhancements

* chore: enhancements to plots

* chore: apply consistent colour scheme across intervention and panel 2 grouped plots



* fix: resolve pre-commit failures (ruff lint + formatting)


---------

* feat: add scripts to do ai scientists search (#339)

* feat: add scripts to do ai scientists search

* feat: add tables for examples + snakemake

* feat: add new plot

* feat: add script to push to HF + remove data

---------

Co-authored-by: Nawaf Alampara <pvt.nawaf@gmail.com>

* chore: remove duplicate deps, pin litellm<1.82, clean gitignore (#346)

- Remove duplicate litellm, requests, modal entries in pyproject.toml
- Pin litellm to >=1.56.4,<1.82
- Remove duplicate mkdocs-gen-files and mkdocstrings entries
- Remove duplicate classifier entry
- Add *.aux to .gitignore and untrack analysis/tikz_figure.aux

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: add epistemic trace explorer to landing page (#345)

* feat: add epistemic trace explorer to landing page

Replace the placeholder Results tab with an interactive Explainers tab
that visualizes epistemological graphs of LLM agent reasoning traces.

Key design decisions:

Data pipeline (extract_traces.py):
- Pulls annotated traces from HF dataset (jablonkagroup/corral-reasoning-annotations)
- Strips raw messages field (~50-100KB/trace) since support quotes in
  nodes/edges already provide grounding text
- Truncates node text (200 chars) and quotes (150 chars)
- Caps pattern instance lists at 5 per pattern type to control file size
- Selects 54 traces (18/model) via diversity scoring across 8 environments
- Result: 1.2MB traces.js (vs ~150MB if all 619 traces with messages)

Visualization (index.html):
- Nested Alpine.js scope (traceExplorer) isolated from corralApp
- D3.js v7 for graph rendering with two layout modes:
  - Temporal: X=message time, Y=node type lanes (H/T/E/J/U/C)
  - Force-directed: draggable physics simulation
- 6 node types with distinct color palettes (light + dark mode)
- 6 edge relations with unique stroke colors and dash patterns
- Pattern highlighting: click productive/breakdown patterns to glow
  involved nodes (green/red halos) and dim unrelated nodes
- MutationObserver re-renders graph on dark mode toggle
- Cascading filters: model -> environment -> level
- 3-column glass layout matching existing landing page design system

* fix: repair graph toolbar buttons (zoom, fit, layout toggle)

- Replace viewBox with explicit width/height so D3 zoom transforms
  are not visually cancelled by SVG auto-scaling
- Fit button now computes bounding box and centers content
- Layout toggle stops force simulation before switching
- Add fill:none on temporal edge paths to prevent arc fill

* feat: show full node text in collapsible panel, improve trace curation

- Remove text/quote truncation — show full node text and support quotes
- Reduce to 10 traces per model (30 total, 620KB) since full text fits
- Guarantee environment coverage: pick 1 best per env before filling by score
- Add collapsible Node Text section in detail panel (expanded by default)
- Show all support quotes (was limited to 3), increase max-h for readability

* feat: add node-type-adaptive color accent to text panel

- Left border tinted to node type color (violet/blue/amber/cyan/emerald/rose)
- Header background gets subtle node-color wash (8% light, 12% dark)
- Improves visual connection between graph node and detail panel

* feat: use descriptive environment display names

Map raw env keys (afm, catalyst, md, etc.) to full display names
(AFM Experiment Execution, Adsorption Surface Construction, etc.)
in filter dropdown and trace list.

* feat: rename level→scope in UI, guarantee scope coverage in curation

- Rename all user-facing "Level" labels to "Scope" across both
  Environments and Explainers tabs (matches paper terminology)
- Display level_1 as "scope 1" throughout
- Curation now picks 1 best trace per (env, scope) pair before
  filling remaining slots by score — all 17 pairs covered per model
- Increase to 20 traces/model (60 total, 1.2MB) to fit all pairs

* feat: add hash-based permalinks for all tabs

Navigate directly to tabs via URL hash:
  /index.html#explainers → Explainers tab
  /index.html#environments → Environments tab

- Read hash on init to set active tab
- Push hash to history on tab change
- Handle browser back/forward via popstate

* fix: point docs links to mkdocs site at /corral/docs/

All four docs links were pointing to the landing page itself (/corral/).
Updated to point to the mkdocs-deployed documentation at /corral/docs/.

* feat: embed trace annotator as 'Annotate' tab in landing page

Integrates docs/trace-visualizer into the landing page as a new tab with
3-column layout (Details | Graph | Annotation), glassmorphism styling,
dark mode support, and ann- namespace isolation to avoid D3/Alpine conflicts.

* fix: include traces.js and annotator.js in CI site assembly

The deploy workflow was only copying 4 files to _build/, missing the
new traces.js and annotator.js needed by the Explainers and Annotate tabs.

* fix: force-add annotator.js ignored by /site gitignore rule

* chore: remove /site from gitignore

The /site rule was a leftover from mkdocs defaults, but this project
builds mkdocs to _build/docs. The site/ directory contains landing page
source files that were all force-added — removing the rule avoids that.

* chore: switch mkdocs palette from gruvbox_dark to dark

Better visual consistency with the landing page's dark slate theme.

* feat: add statistics + illustrative traces (#347)

* feat: add statistics + illustrative traces

* fix: update graph tables

* feat: possible solution for the tool types (#333)

* feat: possible solution for the tool types

* fix: remove hardcoded code

* feat: add corrections for all the environments + fix the tool implementation

* fix: apply suggestions from code review

* fix: change imports in MD

---------

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Nawaf Alampara <86834161+n0w0f@users.noreply.github.com>
Co-authored-by: Chandan Gupta <chandan18386@iiitd.ac.in>
Co-authored-by: Nawaf Alampara <pvt.nawaf@gmail.com>
Co-authored-by: Sadra <139479461+aaaghajani@users.noreply.github.com>
Co-authored-by: “imandal98” <indrajeetmandal.aaa@gmail.com>
Co-authored-by: Indrajeet Mandal <143293460+imandal98@users.noreply.github.com>
Co-authored-by: Kevin M Jablonka <32935233+kjappelbaum@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant