feat: add statistics + illustrative traces by MrtinoRG · Pull Request #347 · lamalab-org/corral

MrtinoRG · 2026-04-19T13:23:05Z

Summary by Sourcery

Add analysis assets for the paper, including illustrative epistemic reasoning traces and a script to compute corpus-level statistics from experiment reports.

New Features:

Introduce a LaTeX appendix section showcasing representative reasoning breakdown traces with annotated epistemic graphs for each major failure category.
Add a Python analysis script that aggregates token usage, configuration–environment coverage, estimated API costs, and malformed-response rates from reports.jsonl.

Enhancements:

Provide LaTeX-ready summary metrics and structured logging output to streamline transferring analysis statistics into the paper.

sourcery-ai · 2026-04-19T13:23:11Z

Reviewer's Guide

Adds an appendix-style LaTeX file with illustrative reasoning traces for four breakdown categories, and introduces a Python script that computes paper statistics (token usage, config–environment counts, estimated API cost, and scaffold error rates) from the experiment reports JSONL file.

Class diagram for compute_paper_stats module structure

classDiagram
  class compute_paper_stats_module {
    +dict PRICING
    +Path DATA_PATH
    +set config_env_pairs
    +defaultdict tokens
    +defaultdict scaffold_errors
    +defaultdict trials_affected_count
    +int total_trials
    +enc
    +count_tokens(text str) int
    +fmt_tokens(n int) str
    +agg_model(model str) (int input_tokens, int output_tokens)
    +agg_all() (int input_tokens, int output_tokens)
    +agg_verbosity(verb str) (int input_tokens, int output_tokens)
  }

  class tiktoken_encoder {
    +encode(text str) list~int~
  }

  class loguru_logger {
    +info(message str)
  }

  class reports_source {
    +path str
    +open()
  }

  compute_paper_stats_module --> tiktoken_encoder : uses
  compute_paper_stats_module --> loguru_logger : logs_via
  compute_paper_stats_module --> reports_source : reads_from

Flow diagram for compute_paper_stats statistics pipeline

flowchart TD
  A_Start([Start compute_paper_stats.py]) --> B_ReadFile
  B_ReadFile["Open results/data/reports.jsonl"] --> C_LoopLines

  subgraph S_LineProcessing[Per JSONL record]
    C_LoopLines --> D_Parse["Parse JSON line to rec"]
    D_Parse --> E_ExtractMeta["Extract model, agent_type, environment, verbosity"]
    E_ExtractMeta --> F_AddConfigEnv["Add (model, agent_type, verbosity, env) to config_env_pairs"]
    F_AddConfigEnv --> G_LoopTasks["Loop over Task Results"]

    subgraph S_TaskTrials[Per task and trial]
      G_LoopTasks --> H_LoopTrials["Loop over trials"]
      H_LoopTrials --> I_IncTotalTrials["Increment total_trials"]
      I_IncTotalTrials --> J_GetMessages["Get trial messages"]
      J_GetMessages --> K_TokenizeMsgs["Compute msg_tok via count_tokens(content)"]

      K_TokenizeMsgs --> L_AssistantTurns["For each assistant message"]
      L_AssistantTurns --> M_SumInput["Input tokens += sum(prior msg_tok)"]
      M_SumInput --> N_AddOutput["Output tokens += this assistant msg_tok"]

      K_TokenizeMsgs --> O_ScaffoldInit["Initialize trial_has_error=False"]
      O_ScaffoldInit --> P_ScanScaffold["Scan messages for scaffold errors"]
      P_ScanScaffold --> Q_UpdateScaffold["Update scaffold_errors[(model, agent_type)]"]
      Q_UpdateScaffold --> R_UpdateTrialsAffected{agent_type == react?}
      R_UpdateTrialsAffected -- Yes --> S_UpdateReact["Update trials_affected_count[model]"]
      R_UpdateTrialsAffected -- No --> T_NextTrial["Next trial"]
      S_UpdateReact --> T_NextTrial
    end
  end

  T_NextTrial --> U_NextLine{More lines?}
  U_NextLine -- Yes --> C_LoopLines
  U_NextLine -- No --> V_Aggregation

  subgraph S_Aggregation[Aggregation helpers]
    V_Aggregation["Compute aggregates via agg_model, agg_all, agg_verbosity"]
  end

  V_Aggregation --> W_LoggingSections

  subgraph S_Logging[Reporting via loguru]
    W_LoggingSections["Log configuration counts, token tables, costs, error rates, LaTeX-ready values"]
  end

  W_LoggingSections --> X_End([End])

File-Level Changes

Change	Details	Files
Introduce LaTeX appendix with illustrative epistemic traces for four reasoning-breakdown categories, including custom node badges, color scheme, and annotated examples with diagrams and annotator quotes.	Define color palette and a reusable \nodebadge macro for epistemic node types (H, T, E, J, C, F). Add a new subsection describing how to read the trace visualizations and linking to the online browser. For each breakdown category (evidence non-uptake, untested claim, fixed belief trace, contradiction without repair), add a tcolorbox with model/context metadata, excerpted messages, node/edge annotations, a small TikZ diagram, and an annotator quote explaining the pattern.	`analysis/illustrative_traces.tex`
Add an analysis script to compute aggregate experiment statistics (token counts, configuration–environment pairs, estimated proprietary API cost, and ReAct scaffold error rates) from reports.jsonl and log them in human- and LaTeX-ready formats.	Read analysis/results/data/reports.jsonl and iterate over all records, aggregating by model, agent type, environment, and verbosity. Compute per-assistant-turn token input/output using tiktoken with the o200k_base encoding, mirroring API usage by counting all prior messages as input and the current assistant message as output. Aggregate token statistics by model and verbosity, including grand totals, and pretty-print values using a compact formatter (K/M/B). Estimate API cost for proprietary models using a per-million-token pricing table, including a breakdown by verbosity. Detect ReAct scaffold errors by scanning for 'No actions to execute' in user messages, computing per-model error rates and percentage of affected trials. Log a summary section with LaTeX-ready scalar values for tokens, number of config–environment pairs, total API cost, and malformed-response rates for selected models.	`analysis/compute_paper_stats.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

coderabbitai · 2026-04-19T13:23:12Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5b52b2fc-41ef-4670-befc-f331b21ccbb4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch last_todoss

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

The per-assistant-turn input token calculation currently does sum(msg_tok[:i]) inside the loop, which is O(n²) over messages; consider precomputing a prefix sum array so each input count is O(1) and the loop stays linear in the number of messages per trial.
All the JSONL parsing and aggregation runs at import time; wrapping this logic in a main() function and guarding with if __name__ == "__main__": would make the module safer to import and easier to reuse programmatically.
The top-level docstring still says "Compute the TODO statistics"—updating this to accurately describe the current outputs (tokens, config–env pairs, cost, malformed-response rates) will make the script’s purpose clearer to future readers.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The per-assistant-turn input token calculation currently does `sum(msg_tok[:i])` inside the loop, which is O(n²) over messages; consider precomputing a prefix sum array so each input count is O(1) and the loop stays linear in the number of messages per trial.
- All the JSONL parsing and aggregation runs at import time; wrapping this logic in a `main()` function and guarding with `if __name__ == "__main__":` would make the module safer to import and easier to reuse programmatically.
- The top-level docstring still says "Compute the TODO statistics"—updating this to accurately describe the current outputs (tokens, config–env pairs, cost, malformed-response rates) will make the script’s purpose clearer to future readers.

## Individual Comments

### Comment 1
<location path="analysis/illustrative_traces.tex" line_range="3" />
<code_context>
+% Illustrative trace excerpts for the four major reasoning-breakdown categories.
+% Requires: tcolorbox, tikz, xcolor, enumitem, listings
+% Usage: \input{analysis/results/illustrative_traces.tex}
+
+\definecolor{colENU}{RGB}{198,219,239}   % Evidence non-uptake — light blue
</code_context>
<issue_to_address>
**issue (bug_risk):** The documented \input path does not match the actual file location in the repo.

The usage line references `\input{analysis/results/illustrative_traces.tex}`, but the file is at `analysis/illustrative_traces.tex`. This mismatch will cause the include to fail if copied. Please either correct the documented path or move the file to match it.
</issue_to_address>

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-04-19T13:24:43Z

@@ -0,0 +1,371 @@
+% Illustrative trace excerpts for the four major reasoning-breakdown categories.
+% Requires: tcolorbox, tikz, xcolor, enumitem, listings
+% Usage: \input{analysis/results/illustrative_traces.tex}


issue (bug_risk): The documented \input path does not match the actual file location in the repo.

The usage line references \input{analysis/results/illustrative_traces.tex}, but the file is at analysis/illustrative_traces.tex. This mismatch will cause the include to fail if copied. Please either correct the documented path or move the file to match it.

* build(deps): bump actions/cache from 3 to 5 (#259) Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * build(deps): bump actions/checkout from 5 to 6 (#239) Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * feat: refactor metrics and report (#236) * feat: refactor metrics and report * fix: remove dummy files * fix: standarize files * chore: correct docs * feat: simplify task_metric calculation * feat: expose the metrics in the runner * feat: incorporate report saving * feat: add messages to the report * fix: debug * fix: bug when content in reasoning + handle special characters in model name * fix: solve issues in the report generation * fix: remove commented code * fix: apply suggestions from code review * feat: add hardcoded context window for GPT-oss-120b * fix: apply suggestions from docs code review * fix: apply suggestions from code review * feat: attach logprobs in message, add to hook context (#263) * feat: attach logprobs in message, add to hook context * fix: minor bug in logger * chore: example script for run with logprob * fix: update test mock oject to have logprob * feat: add after iteration hook in toolcalling as well * fix: update test * feat: llm response object and messages have id (#266) * feat: llm response object and messages have id * fix: lint and update tests * fix: update tests * fix: sourcery suggestion * fix: update default in example script * added docstrings for scoring functions (#272) * feat: load from trace and fix toolcalling intervention (#271) * feat: load from trace and fix toolcalling intervention * chore: workdir autoset * feat: make the visualization tool general (#261) * feat: spectra runs (02022026) (#200) * feat: organize the spectra environments as others * fix: correct isomers tool * feat: add gpt-4o tool calling brief * fix: correct TOML file * feat: add tests to the spectra elucidation environment (#212) * feat: add tests to the spectra elucidation environment * fix: solve issues raised in code review * feat: add level 1 claude tasks * chore: remove old files * feat: add Budged exhaust limit * feat: make the error handling more robust * feat: add runs and reports * fix: solve CI * fix: add corral as dependency * feat: add subtask scores * fix: remove old files * fix: remove dummy files * feat: add runs gptOss level 1 * feat: add agent logs substask gpt_oss * feat: add more level 1 reports * feat: add level 1 subtasks * feat: add logprobs brief * feat: add level1 logprobs * feat: add level 1 metrics * feat: add level 2 tasks * feat: add level 2 tasks logprobs * feat: add level 2 metrics * feat: add level 2 subtasks logprobs * feat: add level 2 comprehensive logprobs * feat: add subtask 1 logprobs * feat: add subtask 1 logprobs * feat: add last logprobs * feat: update scoring functions * chore: remove uploaded logprobs * fix: first scoring reruns * fix: correct Claude runs * fix: solve scoring for GPT-4o * fix: rerun gpt_oss * fix: solve tests * feat: add first files of the retro env (#251) * feat: add first files of the retro env * fix: solve hooks * fix: solve hooks score function * feat: add cas management tools * feat: add tools and target molecules * feat: add functional group detector * feat: add tool descriptions * feat: add the level 1 tasks * feat: add database setup files * feat: add the last visuals * fix: correct the template ids for the known reactions * feat: add dataset filtering functions * feat: add SMARTS checking to the search tool * feat: add level 3 * feat: level 3 * feat: add final tasks * fix: solve some issues + add script for custom reactions * fix: remove old runs * fix: solve task 1 * feat: add tool decorator * fix: remove conflicting tasks * feat: update visual and helpers * feat: add some runs * chore: update keywords of the docstring * feat: add new runs * feat: add subtasks * fix: remove old files * fix: correct subtask logic * fix: correct logic in level 2 subtasks * feat: update level_3 * feat: add level 1 task runs * feat: add level 1 results for claude * feat: add the origin of the SMARTS patterns * feat: add level 2-task results * feat: add level 3 task brief runs * fix: correct scoring in the level 2 * feat: add reports level 2 subtask (begin) * chore: move some files for test of the annotation app * feat: add more reports * feat: add level 1 subtasks gpt reports * feat: add subtask level 2 gpt * feat: add level 2 subtask claude * chore: rerun subtask level2 * feat: rerun level 1 subtasks * feat: rerun level 1 subtasks * feat: add level 3 reports * feat: add tests to the retro env + add database checks (#210) * feat: add test to the retro env + check database availability at the beginning * fix: move credentials into database config * fix: apply suggestions from code review * fix: remove dummy files * feat: add some plots with the results * feat: remove old reports * feat: add the corrected retro environment * feat: add claude reports * feat: add gpt-4o reports * fix: remove files from other commit * feat: add claude last results * feat: add gpt-oss level 2 subtasks * fix: solve organization of level 1 subtasks for GPT Oss * feat: add level 2 tasks gpt-oss * feat: add gpt-oss subtask level 2 * feat: some level 3 runs * feat: new score * chore: remove dummy file * feat: add level 3 runs * feat: rerun scoring of subtasks * chore: remove uploaded logprobs * fix: add last changes * fix: solve tests * feat: add pytest to the dependencies * feat: add chemprice to the dependencies * fix: solve the requests versioning * fix: add swifter as a build dependency * fix: add pandas as a build dependency * fix: add others as a build dependency * fix: add pyarrow as a build dependency * fix: add psutil as a build dependency * feat: add pandas as override dependency * fix: add dummy environment variables to fix tests * fix: solve tests that depend on the database * fix: solve scoring for the spectra and retro subtasks (#285) * feat: add gpt_oss ml runs (#269) * feat: add gpt_oss ml runs * chore: remove logprobs and metrics * chore: reorganize files --------- Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com> * feat: gpt_oss reports for resistor (#268) * feat: gpt_oss reports for resistor * feat: rerun rate limits * chore: remove old files * chore: reorganize files --------- Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com> * feat: add gpt_oss catalyst runs (#270) * feat: add gpt_oss ml runs * chore: upate deps * chore: remove and rename files --------- Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com> * feat: reasoning qa , resistor, ml, catalyst (#280) * chore: example template * chore: utility script to push to hf * chore: update * chore: reasoning qa resisitor * feat: update reasoning qa * feat: update reasoning qa * feat: ml reasonign questions * feat: add reasoning qa for ml * feat: update keywords * feat: catalyst reasoning qa * feat: reasoning qa * feat: update catalyst question * wetlab reasoning qa * chore: add requires_knowledge keywords * chore: address review points * fix keywords * chore: code quality * feat: apply suggestions from review * chore: update review comments * chore: update review comments * chore: update keyword order --------- Co-authored-by: Sadra <139479461+aaaghajani@users.noreply.github.com> Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com> * afm and md qa - final version, commends already addressed (#287) * afm and md qa - final version, commends already addressed * fix: afm keyword requires_knowledge * fix: afm keyword issues * fix: add env as first option --------- Co-authored-by: “imandal98” <indrajeetmandal.aaa@gmail.com> Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com> * feat: add Spectra Reasoning QA (#279) * feat: add Spectra Reasoning QA * feat: add requires_reasoning keyword * fix: apply suggestions from code review * chore: shuffle target scores * fix: make questions more clear + move knowledge questions * feat: add retro reasoning qa (#276) * feat: add questions * feat: format tasks * chore: untrack dummy files * fix: add new Reasoning QA for the retro * fix: correct some of the questions + move knowledge ones * feat: add missing QA scores (including reasoning) (#267) * feat: add QA scores for GPT-OSS * feat: add all QA runs + reorganize repo * chore: remove logprobs files * fix: remove ambiguity with the old and new questions * chore: remove uploaded reports * AFM updates (#288) * fix: subtasks * fix: tasks * fix: env files * chore: solve code quality CI * chore: remove corrupted lfs files --------- Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com> Co-authored-by: Martiño Rios-Garcia <147555961+MrtinoRG@users.noreply.github.com> * Md corr vis dev (#289) * MD environment changes * deleted additional files * feat: expose history (#298) * Fix issue #264: Pythoncom missing in the dependencies of AFM (#265) * Fix issue #264: Pythoncom missing in the dependencies of AFM * Fix: CoUninitialize * Fix: CoUninitialize in tools * Fix: Wrap COM initialization in try/finally to ensure proper cleanup * fix: solve ci --------- Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com> * build(deps): bump actions/checkout from 4 to 6 (#277) Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * feat: expose history * feat: add the from_trace classmethod * fix: remove tests that targeted history * fix: solve error with deepcopy -------- * feat: add code for `to_latex()` method (#262) * feat: add the first version of code2latex * fix: correct the code2latex * fix: correct the code and add tables for spectra * fix: apply suggestions from code review and remove tool cache * feat: add tables to the single envs * fix: correct the ml tools * feat: add automatic longtable generation for scoring functions * fix: move into a generate latex method * fix: apply suggestions from code review * feat: add verbosity level as an argument * Update src/corral/backend/env.py Co-authored-by: Kevin M Jablonka <32935233+kjappelbaum@users.noreply.github.com> * Update src/corral/backend/env.py Co-authored-by: Kevin M Jablonka <32935233+kjappelbaum@users.noreply.github.com> * Update src/corral/backend/env.py Co-authored-by: Kevin M Jablonka <32935233+kjappelbaum@users.noreply.github.com> * Update src/corral/backend/env.py Co-authored-by: Kevin M Jablonka <32935233+kjappelbaum@users.noreply.github.com> * Update src/corral/backend/env.py Co-authored-by: Kevin M Jablonka <32935233+kjappelbaum@users.noreply.github.com> * Update src/corral/backend/env.py Co-authored-by: Kevin M Jablonka <32935233+kjappelbaum@users.noreply.github.com> --------- Co-authored-by: Kevin M Jablonka <32935233+kjappelbaum@users.noreply.github.com> * chore: rm older runs (#305) * chore: rm ai4mat runs * chore: rm md_optimized as they are not final * fix: correct that trace is not saved when agent fail (#308) * fix: correct that trace is not saved when agent fail * feat: add reruns * chore: fixes in consistencies with trial ids and missing traces (#306) * chore: rename attempt_ -> trail ids * chore: reran, ml environment comprehesive subtask * chore: oss catalyst reruns * chore: reruns catalyst subtask * Md corrected runs (#290) * MD FINAL RUNS * updated gitignore * gpt-4o runs restructured * claude_45 runs - restructured * gpt-oss runs - restructured * fix: update with dev * fix: solve dev again * merge conflict - .gitignore * MD reruns updated * removed previous MD runs --------- Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com> Co-authored-by: Martiño Rios-Garcia <147555961+MrtinoRG@users.noreply.github.com> * feat: add script to pull data + snakefile (#309) * chore: script to get QA score similar to reports (#310) * chore: script to get QA score * chore: update the script to the correct structure * feat: add plot scripts (#312) * feat: add appendix fig 5 plot * feat: add appendix figure 4 plots * feat: add more plots for analysis * fix: apply suggestions from code review * fix: review title case notation * fix: apply suggestions from review * fix: correct the plots for the new range_frame * GPT OSS runs for AFM using a new naming convention (#320) * fix: subtasks * fix: tasks * fix: env files * chore: solve code quality CI * chore: remove corrupted lfs files * feat: add GPT OSS runs for AFM using a new naming convention * fix: run pre-commit such that CI check pass * fix: try to solve CI --------- Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com> * Reports to hf (#302) * script to push traces to HF * added main * script to push reports to HF * improved hierarchies * changed dir structure * script to pull data from HF * final version for subset name * comments - in progress * addressed comments * chore: move reports around --------- Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com> * feat: react oss - catalyst subtask and resistor comprehensive (#322) * feat: rerun catalyst react subtask oss * feat: resistor reasct oss comprehensive subtask reruns * chore: utility shared across plots (#314) * chore: utility shared across plots * chore: .snakemake in gitignore * Update analysis/plot_config.py Co-authored-by: Kevin M Jablonka <32935233+kjappelbaum@users.noreply.github.com> * feat: add display names, environment groups, and centralize level config - Update ENVIRONMENT_NAMES with descriptive display names - Add ENVIRONMENT_GROUPS for high-level categorization - Rename ENVIRONMENT_LEVELS to ENVIRONMENT_MAX_LEVELS - Move DEFAULT_ENV_LEVEL_MAP from plot_utils to plot_config * refactor: use dicts for colours and simplify filter API - Convert MODEL_COLOURS, AGENT_COLOURS, ENVIRONMENT_COLOURS from lists to dicts keyed by id (removes separate _MAP dicts) - Simplify filter functions: replace "average" strategy with None, remove redundant override parameters * fix: rename generic df variable to satisfy ruff PD901 --------- Co-authored-by: Kevin M Jablonka <32935233+kjappelbaum@users.noreply.github.com> * Md re runs (#330) * claude reruns added * gpt 4o reruns added * minor update to handle lammps log files * Wetlab+Reports (#323) * bring wetlab env from wetlab branch * fix: correct JSON schema for simulate_color_mixture in wetlab * wetlab: removed old qa files * task-level reports * subtask-level reports PART1 * subtask-level reports PART2 * feat: rerun missing traces * fix: run pre-commit * fix: run pre-commit in src --------- Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com> * feat: add appendix tables (#325) * feat: add scripts and tables * fix: correct table * feat: split figure 4 appendix * feat: retro db can be pulled from ghcr * feat: add panel 5 plots (#317) * feat: add plots about the annotations * fix: apply suggestions from review * feat: apply suggestions from code review * chore: merge 'dev' into 'marker_plots' * feat: add second plot to the figure * feat: update the env names * feat: add wetlab + change env names * feat: move legend below * feat: correct qa table * fix: get model names as the other figures * feat: add epistemic analysis (#319) * feat: add first annotation * feat: add raw counts script * feat: update plot * feat: add title case * feat: add final annotations * feat: add creation of a latex table to the analysis script * feat: add first draft of the figures * feat: add the scripts to the snakemake file * feat: update figure * fix: apply suggestion from review to the image * chore: update capitalization * feat: add tables and figure for the appendix * fix: update figure * feat: panel 2 performance plots (#313) * Fix issue #264: Pythoncom missing in the dependencies of AFM (#265) * Fix issue #264: Pythoncom missing in the dependencies of AFM * Fix: CoUninitialize * Fix: CoUninitialize in tools * Fix: Wrap COM initialization in try/finally to ensure proper cleanup * fix: solve ci --------- Co-authored-by: MrtinoRG <martinriosgarcia@gmail.com> * build(deps): bump actions/checkout from 4 to 6 (#277) Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * chore: script to get QA score * chore: plots * chore: bottleneck plot * chore: bottleneck plot * chore: plot utils update * chore: move to common utils that are shared * chore: utility shared across plots * chore: update gitignore * chore: more plots * chore: try more plots * chore: task category plot * chore: .snakemake in gitignore * feat: mean logprobs * chore: update styles in plot * chore: correlation with logprobs * chore: correlation with logprobs * refactor: consolidate panel 2 plots with proper layout and config - Rename scripts to 2a/2b/2c naming convention - Add combined 2_panel.py with gridspec layout - Fix group labels to appear above heatmap (use transAxes) - Reduce heatmap cell height, match font sizes - Remove old scripts and plot outputs * fix: update figure sizes for 2b (1/3 width) and 2c (2/3 width) * feat: add marginal mean bar charts to 2a heatmap Add top (column means) and right (row means) bar charts alongside the heatmap. Move model labels to right side with group lines. Add vertical separators between environment groups. Remove colorbar. * feat: update environment names and two-level x-axis labels Update ENVIRONMENT_NAMES to match paper naming conventions (e.g. Spectroscopic Structure Elucidation, AFM Experiment Execution). Split x-axis into S1/S2/S3 tick labels with environment names below as 45-degree rotated labels with grouping lines. * feat: add panel 2 subplot variants and combined layout - 2b scatter: color-code by environment group, model/scaffold spread labels - 2c task category: add bar and line plot variants (models averaged) - 2d logprobs: standalone script with y-axis on right - 2_panel: combined figure with heatmap + 3 bottom subplots - plot_config: add GROUP_COLOURS, update wetlab/afm category tags * chore: updates in heatmap * chore: scatter plot * chore: push origin main * chore: feedback * chore: font size increase * chore: font size increase * chore: font size increase * chore: add subset logprobs plot and fix linting issues - Add 2d_logprobs_subset.py with 5-env subset and red-to-blue gradient - Fix ruff warnings: unused vars, ambiguous names, implicit concat, noqa * feat: migrate panel_2 and panel_6 plots to analysis/ with Snakemake integration - Move 9 panel_2 scripts and 1 panel_6 script into analysis/ with plot_panel2_* naming - Extract shared classify_subtask/load_category_tags into plot_utils.py - Remove sys.path hacks; scripts now import plot_config/plot_utils directly - Output to analysis/results/figures/panel_2/ and panel_6/ - Add 10 Snakemake rules and update rule all targets * chore: remove plots/ directory (scripts and outputs) Canonical scripts are in analysis/plot_panel2_*.py and analysis/plot_panel6_*.py with Snakefile rules that output to analysis/results/figures/. The plots/ directory was a duplicate. * feat: update panel_2 plots and add grouped logprobs - Remove "(Score)" from scatter plot axis labels - Clarify heatmap marginal bar labels (per environment / per agent) - Add spacing between env names and S1/S2 tick labels in heatmap - Remove combined panel plot from Snakefile - Add grouped logprobs plot averaging by environment group --------- Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * feat: panel 4 irt plots (#315) * chore: utility shared across plots * feat: add irt modeling and plots * chore: task_category into modelling * chore: improve plots * chore: make plots * chore: script to evaluate * chore: evals * chore: update eval * chore: validation * chore: new plots * chore: irt tikz * feat: add lfm and lfm-binomial Bayesian latent factor models with final report plots - Add lfm/ (Bernoulli) and lfm-binomial/ (Binomial) with 8 hierarchical models each - Extend analysis/Snakefile with lfm and lfm-binomial rules - Add final_report.py plotting script and 7 publication-quality plots for best model (M7) - Remove old IRT evaluation scripts and outputs * feat: add LFM 3-subplot panel (variance decomposition, LOO predictions, task-averaged scatter) * chore: update IRT tikz figure with purple theme, legend, and panel label * chore: feedbacks * chore: feedback * chore: model capabilities and stuff * chore: plots * feat: migrate panel_4 plots to analysis/ with Snakemake integration Move 7 panel_4 plotting scripts into analysis/ directory, removing importlib hacks and integrating with Snakefile. Remove capability_comparison, capability_profiles, and task_level_residuals plots. * chore: rm old plits * fix: resolve pre-commit lint errors and update gitignore Fix ruff errors (PD901, RUF001/002/003, ARG001, B026, PD011, C408) across analysis and plot files. Add **/.snakemake/ to gitignore to cover nested dirs, remove tracked .snakemake metadata, and delete duplicate plot file. * chore: remove duplicate plots/panel_4/ and tracked png/pdf files Canonical scripts live in analysis/plot_panel4_*.py and output to analysis/results/figures/panel_4/. Remove the old plots/panel_4/ directory (scripts + generated output) and stray PDFs from analysis/. * refactor: rename plot scripts to plot_irt_*, remove old lfm directory Rename plot_panel4_* and plot_panel4_irt_* scripts to plot_irt_*. Remove old lfm/ directory and its Snakefile rules (superseded by lfm-binomial). Clean up stray tex/aux files. * chore: remove model3 LOO plot and fitting rule (no longer best model) * fix: update plot_irt_results rule to use lfm-binomial data * feat: add HF download script for lfm-binomial results Add download_lfm_binomial_from_hf.py to fetch pre-fitted model results from jablonkagroup/corral_lfm_binomial_results. Add download_lfm_binomial Snakefile rule so plot rules can resolve without local model fitting. * chore: move fire to dev dependency (not required at runtime) * fix: remove hardcoded environment count from docstring * chore: add epistemology panel PDF * chore: remove report files (#337) * chore: remove report files * chore: mv databse creation to scripts --------- Co-authored-by: n0w0f <pvt.nawaf@gmail.com> * fix: standardize @tool docstring tags across all environments (#331) * feat: add pre-commit hook to validate @tool docstring tags Adds a validation script that checks all @tool-decorated functions for the required tagged docstring format (BRIEF, DETAILED, PROCEDURAL, WORKFLOW_INTEGRATION, CONTEXTUAL, SYNTACTICAL, RAISES, LIMITATIONS, ARGS_*, RETURNS_*, and nested tags). Also detects common misspellings like ARGS_SYNTACTIC vs ARGS_SYNTACTICAL and unclosed tags. * ci: add dedicated docstring validation job Adds a validate-docstrings job that runs the full standalone scan across all task environments, independent of pre-commit. * revert: remove redundant docstring validation CI job The pre-commit hook already runs in CI via the code-quality job, so a separate job is unnecessary duplication. * fix: rename ARGS_SYNTACTIC to ARGS_SYNTACTICAL across all environments The verbosity system expects ARGS_SYNTACTICAL but catalyst, ml, corral_md, and resistor_network used ARGS_SYNTACTIC (missing -AL), causing those tags to be silently ignored during filtering. Also switches pre-commit hook entry to python3 since the script uses only stdlib modules. * fix: standardize AFM tool docstring tags to use ARGS_*/RETURNS_* prefixes AFM tools used bare [BRIEF], [DETAILED], [SYNTACTICAL], [EXAMPLES] inside Args/Returns sections instead of the required [ARGS_BRIEF], [ARGS_DETAILED], [ARGS_SYNTACTICAL], [ARGS_EXAMPLES] and [RETURNS_BRIEF], [RETURNS_DETAILED], [RETURNS_EXAMPLES]. RAISES blocks were already correct and left unchanged. * fix: rename ARGS_* to RETURNS_* in corral_md Returns sections All 8 tools in corral_md used ARGS_BRIEF/ARGS_DETAILED/ARGS_EXAMPLES inside Returns sections instead of RETURNS_BRIEF/RETURNS_DETAILED/ RETURNS_EXAMPLES. Also fixes a typo: ARGSDETAILED -> RETURNS_DETAILED in execute_python_script. * fix: standardize RAISES blocks across resistor, retrosynthesis, spectra - retrosynthesis: fix typo [//ERROR_WHEN] -> [/ERROR_WHEN] - resistor_network: add nested ERROR_WHEN/ERROR_DETAILS/ERROR_RECOVERY tags to propose_simple_topology and generate_test_measurements, fix unclosed RAISES block, fix EXAMPLES -> RETURNS_EXAMPLES - spectra_elucidation: add nested RAISES tags to 6 tools that don't raise, add real error docs for obtain_isomers_from_molecular_formula (remote_call failure) and return_possible_fragments (ValueError) * fix: standardize spectra_elucidation docstring tags - Add PREREQUISITE/CURRENT/FOLLOW_UP nested tags to WORKFLOW_INTEGRATION in obtain_isomers_from_molecular_formula and validate_smiles - Rename bare BRIEF/DETAILED/SYNTACTICAL/EXAMPLES to ARGS_*/RETURNS_* in simulate_spectra Args/Returns sections * fix: rewrite samplemath tool docstrings with full tagged format Both calculator and percentage_calculator had plain docstrings with no verbosity tags. Rewrites both with the complete tagged format including BRIEF, DETAILED, PROCEDURAL, WORKFLOW_INTEGRATION, CONTEXTUAL, SYNTACTICAL, ARGS_*, RETURNS_*, RAISES, and LIMITATIONS. * fix: standardize docstring tags in src/corral/utils/ tools and extend validator scope Rename bare [BRIEF]/[DETAILED]/[SYNTACTIC]/[EXAMPLES] tags to ARGS_*/RETURNS_* prefixed versions in terminal_tools.py, context7_tools.py, and code_tools.py (5 @tool functions, 75 violations fixed). Update validator find_tool_files() to also scan src/corral/utils/ and widen pre-commit hook file pattern to match *_tools.py files. * fix: scope pre-commit docstring hook to avoid matching test files Narrow _tools.py pattern to only match src/corral/utils/*_tools.py, preventing tests/test_tools.py from being validated as real tool files. * fix: standardize wetlab tool docstring tags and fix pre-commit scope - Rename ARGS_SYNTACTIC → ARGS_SYNTACTICAL across 8 tools - Add missing ARGS_SYNTACTICAL tags for 5 args (sol1_label, sol2_label, sol_label) - Add RAISES nested tags to 6 tools with empty RAISES blocks - Scope pre-commit hook pattern to src/corral/utils/*_tools.py to avoid matching test files * fix: exclude tests/ from tool docstring validation hook Test files contain @tool fixtures with simple docstrings that don't need the full tagged format. This was caught by CI (--all-files) but not locally since test files were never staged. * fix: add pytest-mock to dependency-groups dev Was already in [project.optional-dependencies] but missing from [dependency-groups] which is what uv sync --dev uses. Fixes 13 test errors in test_prompt_utils.py. * refactor: standardize task configs across all environments (#328) * refactor: standardize task configs to follow spectra_elucidation pattern All environments now use a consistent directory structure: <env>/environments/level_N/tasks_json/task_M.json <env>/environments/level_N/subtasks_json/task_M.json Changes per environment: - catalyst: config/{single,chained}/ → environments/level_1/{tasks_json,subtasks_json}/ Split 3 tasks (si, tio2, cu2o) into individual files - ml: config/{single,chained}/ → environments/level_1/{tasks_json,subtasks_json}/ Split 3 tasks (oxides, nitrides, sulphides) into individual files - resistor_network: config/{single,chained}/ → environments/level_1/{tasks_json,subtasks_json}/ Split 6 tasks into individual files with grouped subtasks - corral_md: Flattened categories (melting, quenching, surface_energy) into numbered tasks within level dirs, renamed tasks/ → tasks_json/, subtasks/ → subtasks_json/ - afm: src/enviroment/{tasks_N,subtasks_N}.json → environments/level_N/{tasks_json,subtasks_json}/task_1.json 4 levels with tasks and subtasks - retrosynthesis: Renamed tasks/ → tasks_json/, subtasks/ → subtasks_json/ * ci: add pre-commit hook to enforce environment config convention Validates that all task environments follow the standard structure: <env>/environments/level_N/{tasks_json,subtasks_json}/<task>.json Catches legacy patterns (config/, src/enviroment/) and checks for valid JSON, correct directory naming, and required tasks_json/ dirs. * feat: standardize task JSON schemas and add HF upload script Schema standardization: - All task/subtask JSONs now use array format at top level - Dict-based files converted: dict keys become `id` field on each entry - `scoring_fn` renamed to `scoring_function` across all environments - `description` field added from `input.prompt` for spectra/retrosynthesis - UUID added to every task and subtask entry New scripts: - scripts/standardize_task_schemas.py: normalizes all JSON schemas - scripts/upload_tasks_to_hf.py: uploads tasks to HF dataset (jablonkagroup/corral-environment-tasks) with 26 subsets, 681 total entries across all environments and levels * fix: add subset configs to HF dataset README for proper viewer support The HF dataset viewer requires configs declared in the YAML front matter of README.md to display subset selection. Now generates proper config entries for all 26 subsets. * feat: add unified task loader utility and refactor catalyst env.py Add `corral.utils.task_loader` with `load_task_entries()` and `load_task_entries_from_env_package()` that load standardized task configs from either HuggingFace or local JSON directories. Refactor catalyst env.py as reference implementation: replace custom `load_tasks_from_json()` with the shared task loader, add `entries_to_task_definitions()` for converting entries to TaskDefinition objects using the environment's scoring function registry. * fix: deduplicate subtask/task IDs across files in same directory Prefix duplicated IDs with filename stem (e.g. task_1_retrieve_structure) in catalyst and ml subtasks. Also updates input_from_tasks references. * fix: update catalyst integration tests for refactored task loader API Replace removed load_tasks_from_json with entries_to_task_definitions + load_task_entries, update fixtures to standardized array format, and fix create_environments call signatures. * feat: standardize wetlab task configs to match other environments - Move wetlab tasks/subtasks from wetlab/wetlab/{tasks,subtasks}_json/level_X/ to wetlab/environments/level_X/{tasks,subtasks}_json/ - Add wetlab to GROUP_A_ENVS in standardize script - Standardize all 60 wetlab JSON files: scoring_fn → scoring_function, add uuid, add description from input.prompt * feat: standardize samplemath (demo env) to follow framework conventions - Move config/example.json → environments/level_1/{tasks,subtasks}_json/ - Remove legacy config/ directory - Standardize JSON: dict→array format, add uuid, add id from dict keys - Remove samplemath from SKIP_ENVS in all scripts (standardize, upload, check) - Demo env should exemplify best practices, not be an exception * chore: remove legacy checks and one-time migration script - Remove legacy config/ and src/enviroment/ detection from check_env_convention.py since all environments are now standardized - Delete standardize_task_schemas.py — migration is complete, no longer needed * feat: landing page for Corral benchmark (#324) * feat: add landing page with environment explorer and verbosity slider - extract_data.py: AST-based extraction of tools, tasks, and scoring functions from all 7 environments (docstrings stripped from code) - data.js: auto-generated data with verbosity-tagged sections per tool - index.html: Tailwind + Alpine.js page with environment explorer, interactive verbosity slider, code panels, leaderboard, and team tabs * feat: redesign landing page with hero, architecture section, and font pairing - Add proper Home tab with hero, stats, architecture diagram, env grid - Use Space Grotesk (headings) + Newsreader (body) + JetBrains Mono (code) - Copy logo and arch images into site/ for correct relative paths - Increase code snippet extraction limit from 20 to 50 lines - Add M3RG Lab credit to footer and team section - Add external Docs link in nav pointing to GitHub Pages - Environments tab now separate from landing page * docs: add README with local setup instructions for site * update team members * refactor: standardize task configs to follow spectra_elucidation pattern All environments now use a consistent directory structure: <env>/environments/level_N/tasks_json/task_M.json <env>/environments/level_N/subtasks_json/task_M.json Changes per environment: - catalyst: config/{single,chained}/ → environments/level_1/{tasks_json,subtasks_json}/ Split 3 tasks (si, tio2, cu2o) into individual files - ml: config/{single,chained}/ → environments/level_1/{tasks_json,subtasks_json}/ Split 3 tasks (oxides, nitrides, sulphides) into individual files - resistor_network: config/{single,chained}/ → environments/level_1/{tasks_json,subtasks_json}/ Split 6 tasks into individual files with grouped subtasks - corral_md: Flattened categories (melting, quenching, surface_energy) into numbered tasks within level dirs, renamed tasks/ → tasks_json/, subtasks/ → subtasks_json/ - afm: src/enviroment/{tasks_N,subtasks_N}.json → environments/level_N/{tasks_json,subtasks_json}/task_1.json 4 levels with tasks and subtasks - retrosynthesis: Renamed tasks/ → tasks_json/, subtasks/ → subtasks_json/ * ci: add pre-commit hook to enforce environment config convention Validates that all task environments follow the standard structure: <env>/environments/level_N/{tasks_json,subtasks_json}/<task>.json Catches legacy patterns (config/, src/enviroment/) and checks for valid JSON, correct directory naming, and required tasks_json/ dirs. * feat: standardize task JSON schemas and add HF upload script Schema standardization: - All task/subtask JSONs now use array format at top level - Dict-based files converted: dict keys become `id` field on each entry - `scoring_fn` renamed to `scoring_function` across all environments - `description` field added from `input.prompt` for spectra/retrosynthesis - UUID added to every task and subtask entry New scripts: - scripts/standardize_task_schemas.py: normalizes all JSON schemas - scripts/upload_tasks_to_hf.py: uploads tasks to HF dataset (jablonkagroup/corral-environment-tasks) with 26 subsets, 681 total entries across all environments and levels * fix: add subset configs to HF dataset README for proper viewer support The HF dataset viewer requires configs declared in the YAML front matter of README.md to display subset selection. Now generates proper config entries for all 26 subsets. * feat: add unified task loader utility and refactor catalyst env.py Add `corral.utils.task_loader` with `load_task_entries()` and `load_task_entries_from_env_package()` that load standardized task configs from either HuggingFace or local JSON directories. Refactor catalyst env.py as reference implementation: replace custom `load_tasks_from_json()` with the shared task loader, add `entries_to_task_definitions()` for converting entries to TaskDefinition objects using the environment's scoring function registry. * feat: show all tasks and subtasks across levels on landing page - Rewrite extract_data.py to walk standardized directory structure - Load 85 tasks + 596 subtasks across all 7 environments and levels - Add subtasks tab to environment detail view - Add level selector for multi-level environments - Update stats strip: 7 envs, 83 tools, 85 tasks, 596 subtasks - Environment cards now show task + subtask counts * fix: deduplicate subtask/task IDs across files in same directory Prefix duplicated IDs with filename stem (e.g. task_1_retrieve_structure) in catalyst and ml subtasks. Also updates input_from_tasks references. * fix: unique subtask IDs, cap chips at 25 with HF link, equal-height panels - Use _uid (from uuid) for task/subtask selection to handle duplicate IDs - Cap task/subtask chip lists at 25 with "+ N more" link to HuggingFace - Remove 50-line truncation on tool and scoring function code snippets - Equal-height columns for description and code panels (items-stretch) * feat: show arguments and returns in tool descriptions by verbosity level Include ARGS_BRIEF/DETAILED/SYNTACTICAL and RETURNS_BRIEF/DETAILED/EXAMPLES progressively in the verbosity slider output. Add section headers (Arguments, Returns, Raises, etc.) for clarity. * fix: render args/returns as styled HTML sections in tool description Switch from x-text to x-html for tool description body. Args, Returns, Raises, Limitations, and Examples now render with uppercase header labels and visual separators so they're clearly visible at higher verbosity. * refactor: auto-discover environments, remove team tab, dynamic stats - extract_data.py now auto-discovers environments from tasks/*/environments/ and finds tools.py/score.py via rglob (no hardcoded paths) - Display names and descriptions moved to site/env_meta.json - Stats strip computed dynamically from CORRAL_DATA (no hardcoded numbers) - Remove team tab and team members data * docs: update environment names and descriptions to match paper Rename environments to paper terminology: Surface Construction, Circuit Inference, Retrosynthetic Planning, AFM Operation, Molecular Simulation, ML Property Prediction, Spectra Elucidation. * fix: update catalyst integration tests for refactored task loader API Replace removed load_tasks_from_json with entries_to_task_definitions + load_task_entries, update fixtures to standardized array format, and fix create_environments call signatures. * fix: update catalyst integration tests for refactored task loader API Replace removed load_tasks_from_json with entries_to_task_definitions + load_task_entries, update fixtures to standardized array format, and fix create_environments call signatures. * fix: discovery ignores venv * feat: add pre-commit hook to validate @tool docstring tags Adds a validation script that checks all @tool-decorated functions for the required tagged docstring format (BRIEF, DETAILED, PROCEDURAL, WORKFLOW_INTEGRATION, CONTEXTUAL, SYNTACTICAL, RAISES, LIMITATIONS, ARGS_*, RETURNS_*, and nested tags). Also detects common misspellings like ARGS_SYNTACTIC vs ARGS_SYNTACTICAL and unclosed tags. * ci: add dedicated docstring validation job Adds a validate-docstrings job that runs the full standalone scan across all task environments, independent of pre-commit. * revert: remove redundant docstring validation CI job The pre-commit hook already runs in CI via the code-quality job, so a separate job is unnecessary duplication. * fix: rename ARGS_SYNTACTIC to ARGS_SYNTACTICAL across all environments The verbosity system expects ARGS_SYNTACTICAL but catalyst, ml, corral_md, and resistor_network used ARGS_SYNTACTIC (missing -AL), causing those tags to be silently ignored during filtering. Also switches pre-commit hook entry to python3 since the script uses only stdlib modules. * fix: standardize AFM tool docstring tags to use ARGS_*/RETURNS_* prefixes AFM tools used bare [BRIEF], [DETAILED], [SYNTACTICAL], [EXAMPLES] inside Args/Returns sections instead of the required [ARGS_BRIEF], [ARGS_DETAILED], [ARGS_SYNTACTICAL], [ARGS_EXAMPLES] and [RETURNS_BRIEF], [RETURNS_DETAILED], [RETURNS_EXAMPLES]. RAISES blocks were already correct and left unchanged. * fix: rename ARGS_* to RETURNS_* in corral_md Returns sections All 8 tools in corral_md used ARGS_BRIEF/ARGS_DETAILED/ARGS_EXAMPLES inside Returns sections instead of RETURNS_BRIEF/RETURNS_DETAILED/ RETURNS_EXAMPLES. Also fixes a typo: ARGSDETAILED -> RETURNS_DETAILED in execute_python_script. * fix: standardize RAISES blocks across resistor, retrosynthesis, spectra - retrosynthesis: fix typo [//ERROR_WHEN] -> [/ERROR_WHEN] - resistor_network: add nested ERROR_WHEN/ERROR_DETAILS/ERROR_RECOVERY tags to propose_simple_topology and generate_test_measurements, fix unclosed RAISES block, fix EXAMPLES -> RETURNS_EXAMPLES - spectra_elucidation: add nested RAISES tags to 6 tools that don't raise, add real error docs for obtain_isomers_from_molecular_formula (remote_call failure) and return_possible_fragments (ValueError) * fix: standardize spectra_elucidation docstring tags - Add PREREQUISITE/CURRENT/FOLLOW_UP nested tags to WORKFLOW_INTEGRATION in obtain_isomers_from_molecular_formula and validate_smiles - Rename bare BRIEF/DETAILED/SYNTACTICAL/EXAMPLES to ARGS_*/RETURNS_* in simulate_spectra Args/Returns sections * fix: rewrite samplemath tool docstrings with full tagged format Both calculator and percentage_calculator had plain docstrings with no verbosity tags. Rewrites both with the complete tagged format including BRIEF, DETAILED, PROCEDURAL, WORKFLOW_INTEGRATION, CONTEXTUAL, SYNTACTICAL, ARGS_*, RETURNS_*, RAISES, and LIMITATIONS. * fix: standardize docstring tags in src/corral/utils/ tools and extend validator scope Rename bare [BRIEF]/[DETAILED]/[SYNTACTIC]/[EXAMPLES] tags to ARGS_*/RETURNS_* prefixed versions in terminal_tools.py, context7_tools.py, and code_tools.py (5 @tool functions, 75 violations fixed). Update validator find_tool_files() to also scan src/corral/utils/ and widen pre-commit hook file pattern to match *_tools.py files. * fix: scope pre-commit docstring hook to avoid matching test files Narrow _tools.py pattern to only match src/corral/utils/*_tools.py, preventing tests/test_tools.py from being validated as real tool files. * fix: standardize wetlab tool docstring tags and fix pre-commit scope - Rename ARGS_SYNTACTIC → ARGS_SYNTACTICAL across 8 tools - Add missing ARGS_SYNTACTICAL tags for 5 args (sol1_label, sol2_label, sol_label) - Add RAISES nested tags to 6 tools with empty RAISES blocks - Scope pre-commit hook pattern to src/corral/utils/*_tools.py to avoid matching test files * fix: exclude tests/ from tool docstring validation hook Test files contain @tool fixtures with simple docstrings that don't need the full tagged format. This was caught by CI (--all-files) but not locally since test files were never staged. * fix: add pytest-mock to dependency-groups dev Was already in [project.optional-dependencies] but missing from [dependency-groups] which is what uv sync --dev uses. Fixes 13 test errors in test_prompt_utils.py. * fix: combine ARGS/RETURNS sections into unified blocks and show at all verbosity levels - ARGS_BRIEF and RETURNS_BRIEF now shown from Brief level onward - Higher verbosity levels are cumulative (ARGS_DETAILED adds to ARGS_BRIEF, not replaces) - All ARGS_* sections render as single "Arguments" block, RETURNS_* as single "Returns" block - Regenerated data.js with standardized docstring tags * refactor: remove leaderboard tab from landing page * feat: floating glassmorphism navbar and dark/light mode toggle - Navbar is now floating with rounded corners and frosted glass effect - Added dark mode support with localStorage persistence - Sun/moon toggle button in the nav bar - Dark mode covers all surfaces, text, borders, cards, and code panels - Respects reduced-motion preferences * fix: dark mode glassmorphism for cards and stats strip - Stats strip uses transparent bg with proper dark dividers - Feature cards (Environments/Agents/Tasks) get glass-card effect in dark mode - Environment grid cards get glass-card effect in dark mode - Environment selector bar styled for dark mode - Added violet dark mode color tokens * ui: move verbosity slider above description body Control appears before the content it filters, improving discoverability. * feat: standardize wetlab task configs to match other environments - Move wetlab tasks/subtasks from wetlab/wetlab/{tasks,subtasks}_json/level_X/ to wetlab/environments/level_X/{tasks,subtasks}_json/ - Add wetlab to GROUP_A_ENVS in standardize script - Standardize all 60 wetlab JSON files: scoring_fn → scoring_function, add uuid, add description from input.prompt * feat: standardize samplemath (demo env) to follow framework conventions - Move config/example.json → environments/level_1/{tasks,subtasks}_json/ - Remove legacy config/ directory - Standardize JSON: dict→array format, add uuid, add id from dict keys - Remove samplemath from SKIP_ENVS in all scripts (standardize, upload, check) - Demo env should exemplify best practices, not be an exception * chore: remove legacy checks and one-time migration script - Remove legacy config/ and src/enviroment/ detection from check_env_convention.py since all environments are now standardized - Delete standardize_task_schemas.py — migration is complete, no longer needed * feat: add Wet Chemistry environment to landing page - Added wetlab entry to env_meta.json - Regenerated data.js (14 tools, 30 tasks, 190 subtasks, 3 levels) - Totals now: 8 environments, 97 tools, 115 tasks, 786 subtasks * fix: remove Tool Verbosity from architecture, trim principles, rename wetlab - Remove Tool Verbosity bullet from Decoupled Architecture section - Remove Efficiency and Simplicity from foundational principles - Rename wetlab to "Qualitative Analysis" with proper description - Regenerated data.js * fix: center foundational principles grid for 4 items * copy: change heading to "A framework for..." * feat: render inline code in tool descriptions with monospace styling Backtick-wrapped content in docstring sections now renders as styled <code> tags with monospace font and subtle background, matching GitHub inline code appearance. Works in both light and dark mode. * feat: default dark mode to system preference Falls back to prefers-color-scheme when no localStorage override exists. Manual toggle still persists the user's choice. * fix: resolve missing scoring functions across all environments - AFM: rename check_equation to check_mathematical_eq to match registry key - extract_data.py: parse SCORING_FUNCTIONS dict from env.py to build registry-key aliases for scoring functions automatically - Normalize scoring_function field to string (fixes spectra integer IDs) - All 8 environments now have complete scoring function coverage * feat: add verifiers count to stats strip and fix single-row layout * fix: render per-argument descriptions for multi-arg tools ARGS_*/RETURNS_* sections were stored as single strings, so only the last argument's description survived when a tool had multiple params. Now collected as arrays in extract_data.py and rendered per-argument in the frontend with arg name labels and clean verbosity formatting. * feat: deploy landing page and docs via GitHub Actions Pages Replace mkdocs gh-deploy with actions/deploy-pages to serve both the landing page (root) and MkDocs docs (/docs/) from a single GitHub Pages deployment. CI now also generates data.js from task sources. * chore: update environment display names to match paper conventions * ci: enable site deployment on push to dev branch (#344) * feat: intervention experiment pipeline (#334) * feat: intervention experiment pipeline Add scripts to run intervention experiments that inject steps from successful/failed traces into new agent runs to measure knowledge vs reasoning gaps across scientific environments. Pipeline: select tasks (from reports_v2) -> run baseline -> pick traces from baseline -> run intervention conditions -> analyze. * feat: per-agent server ports, env venv setup, resistor argparse - Each env now has two server ports (react/toolcalling) to allow safe parallel runs — the server is stateful and concurrent agents would clash - Add scripts/setup_envs.sh for one-time venv creation (uv for spectra/ resistor, micromamba for wetlab due to conda-only reaktoro) - launch_sweep.sh gains --start-servers/--stop-servers/--server-status - Resistor env.py uses argparse with --mode single/chained (no path needed) - Wetlab pyproject.toml updated with corral dep and uv.sources * fix: bash 3 compatibility for launch_sweep.sh Replace declare -A (bash 4+) with case-based lookup functions. Tested on macOS bash 3.2.57. Also add generated task_selection.json. * fix: count dry-run launches in launch_sweep.sh * feat: smoke-tested baseline pipeline with Bedrock - setup_envs.sh: upgrade promptstore + install boto3 for Bedrock - launch_sweep.sh: add --trials flag for smoke testing (e.g. --trials 1) - run_intervention.py: cap k_values at trials count to avoid validation error - Verified end-to-end: setup venvs → start servers → launch baselines → reports * chore: update lock files and add promptstore index Updated uv.lock files across all task environments after upgrading promptstore. Added generated prompts/index.json. * chore: baseline runs * chore: pass plot * chore: checkpoint intervention runs * chore: push update * chore: intervention first batch * chore: retro interention runs * chore: internvention more runs * chore: new plots * chore: plot results * refactor: move intervention analysis to HF-backed pipeline - Add intervention plotting scripts to analysis/ (pass@k, pass^k, recovery curves, baseline compact, statistical tests) - Create intervention_utils.py and aggregate_intervention_results.py to read from HF-downloaded JSONL instead of local filesystem - Add download_intervention_reports_from_hf.py for fetching data - Update Snakefile with intervention analysis rules - Remove reports_v3/intervention/runs/ (4.3 GB, now on HF) - Remove reports_v3/intervention/analysis/ (moved to analysis/) * chore: remove Snakefile rules for deleted plot scripts Remove rules and outputs for plot_avg_output_tokens, plot_avg_tool_calls_per_task, plot_action_distributions, plot_behavior_summary_panel, and plot_env_verbosity_performance whose scripts were already deleted. * feat: add grouped recovery curves and baseline plots New scripts that average metrics across environment groups: - Hypothesis-driven inquiry (spectra, wetlab, resistor) - Strategic reasoning (retrosynthesis) - Workflow construction (catalyst, md, ml) * chore: add HF push scripts and clean up stale intervention artifacts Add scripts to push intervention reports and traces to HuggingFace. Remove stale agent logs, pid files, and temporary documents. * chore: remove reports_v3 and update stale intervention paths Delete reports_v3/intervention (now lives under analysis/intervention). Update RUNS_ROOT and docstrings to reference the new location. * fix: remove stale context window error test The test expected a bare Message return but get_llm_response now wraps it in LLMResponse. * fix: restore analysis plot scripts accidentally deleted in refactor These files were removed in 09ea114a5 but are still referenced by the Snakefile and present on dev. * feat: add guidelines plus the app (#340) * feat: add guidelines plus the app * chore: remove useless comments * fix: solve logging * fix: solve multi-file issues * fix: solve path problem * feat: add files for analysis * feat: add new annotations * feat: add a new iteratiion * feat: add new annotations * feat: update app * feat: add antipattern excerpt figure * feat: solve comments * feat: update the plots and tables from the epistemology analysis * feat: add annotations and analysis * chore: remove data as it is in HF --------- Co-authored-by: Nawaf Alampara <pvt.nawaf@gmail.com> * feat: add scripts for solving last ToDos (#342) * feat: add the domain summary table * chore: update colors --------- Co-authored-by: Nawaf Alampara <pvt.nawaf@gmail.com> * feat: plot enhancements for panel 2, panel 4, and tikz figure (#343) * feat: intervention experiment pipeline Add scripts to run intervention experiments that inject steps from successful/failed traces into new agent runs to measure knowledge vs reasoning gaps across scientific environments. Pipeline: select tasks (from reports_v2) -> run baseline -> pick traces from baseline -> run intervention conditions -> analyze. * feat: per-agent server ports, env venv setup, resistor argparse - Each env now has two server ports (react/toolcalling) to allow safe parallel runs — the server is stateful and concurrent agents would clash - Add scripts/setup_envs.sh for one-time venv creation (uv for spectra/ resistor, micromamba for wetlab due to conda-only reaktoro) - launch_sweep.sh gains --start-servers/--stop-servers/--server-status - Resistor env.py uses argparse with --mode single/chained (no path needed) - Wetlab pyproject.toml updated with corral dep and uv.sources * fix: bash 3 compatibility for launch_sweep.sh Replace declare -A (bash 4+) with case-based lookup functions. Tested on macOS bash 3.2.57. Also add generated task_selection.json. * fix: count dry-run launches in launch_sweep.sh * feat: smoke-tested baseline pipeline with Bedrock - setup_envs.sh: upgrade promptstore + install boto3 for Bedrock - launch_sweep.sh: add --trials flag for smoke testing (e.g. --trials 1) - run_intervention.py: cap k_values at trials count to avoid validation error - Verified end-to-end: setup venvs → start servers → launch baselines → reports * chore: update lock files and add promptstore index Updated uv.lock files across all task environments after upgrading promptstore. Added generated prompts/index.json. * chore: baseline runs * chore: pass plot * chore: checkpoint intervention runs * chore: push update * chore: intervention first batch * chore: retro interention runs * chore: internvention more runs * chore: new plots * chore: plot results * refactor: move intervention analysis to HF-backed pipeline - Add intervention plotting scripts to analysis/ (pass@k, pass^k, recovery curves, baseline compact, statistical tests) - Create intervention_utils.py and aggregate_intervention_results.py to read from HF-downloaded JSONL instead of local filesystem - Add download_intervention_reports_from_hf.py for fetching data - Update Snakefile with intervention analysis rules - Remove reports_v3/intervention/runs/ (4.3 GB, now on HF) - Remove reports_v3/intervention/analysis/ (moved to analysis/) * chore: remove Snakefile rules for deleted plot scripts Remove rules and outputs for plot_avg_output_tokens, plot_avg_tool_calls_per_task, plot_action_distributions, plot_behavior_summary_panel, and plot_env_verbosity_performance whose scripts were already deleted. * feat: add grouped recovery curves and baseline plots New scripts that average metrics across environment groups: - Hypothesis-driven inquiry (spectra, wetlab, resistor) - Strategic reasoning (retrosynthesis) - Workflow construction (catalyst, md, ml) * chore: add HF push scripts and clean up stale intervention artifacts Add scripts to push intervention reports and traces to HuggingFace. Remove stale agent logs, pid files, and temporary documents. * chore: remove reports_v3 and update stale intervention paths Delete reports_v3/intervention (now lives under analysis/intervention). Update RUNS_ROOT and docstrings to reference the new location. * fix: remove stale context window error test The test expected a bare Message return but get_llm_response now wraps it in LLMResponse. * fix: restore analysis plot scripts accidentally deleted in refactor These files were removed in 09ea114a5 but are still referenced by the Snakefile and present on dev. * feat: plot enhancements for panel 2, panel 4, and tikz figure Panel 2: - Fix OOM crash in logprob scripts by streaming logprobs.jsonl via load_logprobs_stats() instead of loading full per-token arrays into RAM - Scatter: x-axis reduced to 2 ticks, y-axis fixed to 0.2 intervals, legend handles explicitly coloured to match scatter dots - Coverage heatmap: S-labels rotated 90° - Task category line: remove grey background grid Panel 4 (IRT/LFM): - Variance decomposition labels now include notation symbols (γ_s, δ_ℓ, ξ_v, κ_c, e, t, θ_K, θ_R) in both radar_and_variance and final_report scripts - Switch default model from model3 to model7_abilities_env_level - Fix arviz API: hdi_prob → prob Tikz figure: - Fix theta notation to match table (θ^(K/R) → θ_K/R) - Add λ/ψ slope labels on capability → model arrows - Add Category (κ_c), Task (t), … to covariates box - Text colour updated to lama_aesthetics grey (#758D99) deps: add datasets, netcdf4, snakemake * chore: minor plot enhancements * chore: enhancements to plots * chore: apply consistent colour scheme across intervention and panel 2 grouped plots * fix: resolve pre-commit failures (ruff lint + formatting) --------- * feat: add scripts to do ai scientists search (#339) * feat: add scripts to do ai scientists search * feat: add tables for examples + snakemake * feat: add new plot * feat: add script to push to HF + remove data --------- Co-authored-by: Nawaf Alampara <pvt.nawaf@gmail.com> * chore: remove duplicate deps, pin litellm<1.82, clean gitignore (#346) - Remove duplicate litellm, requests, modal entries in pyproject.toml - Pin litellm to >=1.56.4,<1.82 - Remove duplicate mkdocs-gen-files and mkdocstrings entries - Remove duplicate classifier entry - Add *.aux to .gitignore and untrack analysis/tikz_figure.aux Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: add epistemic trace explorer to landing page (#345) * feat: add epistemic trace explorer to landing page Replace the placeholder Results tab with an interactive Explainers tab that visualizes epistemological graphs of LLM agent reasoning traces. Key design decisions: Data pipeline (extract_traces.py): - Pulls annotated traces from HF dataset (jablonkagroup/corral-reasoning-annotations) - Strips raw messages field (~50-100KB/trace) since support quotes in nodes/edges already provide grounding text - Truncates node text (200 chars) and quotes (150 chars) - Caps pattern instance lists at 5 per pattern type to control file size - Selects 54 traces (18/model) via diversity scoring across 8 environments - Result: 1.2MB traces.js (vs ~150MB if all 619 traces with messages) Visualization (index.html): - Nested Alpine.js scope (traceExplorer) isolated from corralApp - D3.js v7 for graph rendering with two layout modes: - Temporal: X=message time, Y=node type lanes (H/T/E/J/U/C) - Force-directed: draggable physics simulation - 6 node types with distinct color palettes (light + dark mode) - 6 edge relations with unique stroke colors and dash patterns - Pattern highlighting: click productive/breakdown patterns to glow involved nodes (green/red halos) and dim unrelated nodes - MutationObserver re-renders graph on dark mode toggle - Cascading filters: model -> environment -> level - 3-column glass layout matching existing landing page design system * fix: repair graph toolbar buttons (zoom, fit, layout toggle) - Replace viewBox with explicit width/height so D3 zoom transforms are not visually cancelled by SVG auto-scaling - Fit button now computes bounding box and centers content - Layout toggle stops force simulation before switching - Add fill:none on temporal edge paths to prevent arc fill * feat: show full node text in collapsible panel, improve trace curation - Remove text/quote truncation — show full node text and support quotes - Reduce to 10 traces per model (30 total, 620KB) since full text fits - Guarantee environment coverage: pick 1 best per env before filling by score - Add collapsible Node Text section in detail panel (expanded by default) - Show all support quotes (was limited to 3), increase max-h for readability * feat: add node-type-adaptive color accent to text panel - Left border tinted to node type color (violet/blue/amber/cyan/emerald/rose) - Header background gets subtle node-color wash (8% light, 12% dark) - Improves visual connection between graph node and detail panel * feat: use descriptive environment display names Map raw env keys (afm, catalyst, md, etc.) to full display names (AFM Experiment Execution, Adsorption Surface Construction, etc.) in filter dropdown and trace list. * feat: rename level→scope in UI, guarantee scope coverage in curation - Rename all user-facing "Level" labels to "Scope" across both Environments and Explainers tabs (matches paper terminology) - Display level_1 as "scope 1" throughout - Curation now picks 1 best trace per (env, scope) pair before filling remaining slots by score — all 17 pairs covered per model - Increase to 20 traces/model (60 total, 1.2MB) to fit all pairs * feat: add hash-based permalinks for all tabs Navigate directly to tabs via URL hash: /index.html#explainers → Explainers tab /index.html#environments → Environments tab - Read hash on init to set active tab - Push hash to history on tab change - Handle browser back/forward via popstate * fix: point docs links to mkdocs site at /corral/docs/ All four docs links were pointing to the landing page itself (/corral/). Updated to point to the mkdocs-deployed documentation at /corral/docs/. * feat: embed trace annotator as 'Annotate' tab in landing page Integrates docs/trace-visualizer into the landing page as a new tab with 3-column layout (Details | Graph | Annotation), glassmorphism styling, dark mode support, and ann- namespace isolation to avoid D3/Alpine conflicts. * fix: include traces.js and annotator.js in CI site assembly The deploy workflow was only copying 4 files to _build/, missing the new traces.js and annotator.js needed by the Explainers and Annotate tabs. * fix: force-add annotator.js ignored by /site gitignore rule * chore: remove /site from gitignore The /site rule was a leftover from mkdocs defaults, but this project builds mkdocs to _build/docs. The site/ directory contains landing page source files that were all force-added — removing the rule avoids that. * chore: switch mkdocs palette from gruvbox_dark to dark Better visual consistency with the landing page's dark slate theme. * feat: add statistics + illustrative traces (#347) * feat: add statistics + illustrative traces * fix: update graph tables * feat: possible solution for the tool types (#333) * feat: possible solution for the tool types * fix: remove hardcoded code * feat: add corrections for all the environments + fix the tool implementation * fix: apply suggestions from code review * fix: change imports in MD --------- Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Nawaf Alampara <86834161+n0w0f@users.noreply.github.com> Co-authored-by: Chandan Gupta <chandan18386@iiitd.ac.in> Co-authored-by: Nawaf Alampara <pvt.nawaf@gmail.com> Co-authored-by: Sadra <139479461+aaaghajani@users.noreply.github.com> Co-authored-by: “imandal98” <indrajeetmandal.aaa@gmail.com> Co-authored-by: Indrajeet Mandal <143293460+imandal98@users.noreply.github.com> Co-authored-by: Kevin M Jablonka <32935233+kjappelbaum@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: add statistics + illustrative traces

759daf5

sourcery-ai Bot reviewed Apr 19, 2026

View reviewed changes

MrtinoRG added 2 commits April 20, 2026 19:33

Merge branch 'dev' into last_todoss

92b8607

fix: update graph tables

aa78aac

MrtinoRG merged commit 6ddd5a1 into dev Apr 21, 2026
8 of 11 checks passed

MrtinoRG deleted the last_todoss branch April 21, 2026 13:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add statistics + illustrative traces#347

feat: add statistics + illustrative traces#347
MrtinoRG merged 3 commits intodevfrom
last_todoss

MrtinoRG commented Apr 19, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented Apr 19, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai Bot commented Apr 19, 2026 •

edited

Loading

Review skipped

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MrtinoRG commented Apr 19, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Class diagram for compute_paper_stats module structure

Flow diagram for compute_paper_stats statistics pipeline

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai Bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MrtinoRG commented Apr 19, 2026 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented Apr 19, 2026 •

edited

Loading

coderabbitai Bot commented Apr 19, 2026 •

edited

Loading