Skip to content

Adding coverage for multi-session runner and LLMInterface hooks"#143

Draft
sator-labs wants to merge 4 commits intov1.1from
feat/multi-session-hooks
Draft

Adding coverage for multi-session runner and LLMInterface hooks"#143
sator-labs wants to merge 4 commits intov1.1from
feat/multi-session-hooks

Conversation

@sator-labs
Copy link
Copy Markdown
Collaborator

@sator-labs sator-labs commented Apr 20, 2026

Summary

A set of improvements to the conversation runner and LLM interface, developed while integrating a new provider type internally. All changes are generic and provider-agnostic.

Note: the custom LLM class will need to have knowledge of session requirements (e.g., an intake session is always mandated).

Multi-session support (runner.py, generate.py)

  • The runner now executes an ordered list of session types per conversation (via --sessions). A single session without --sessions follows the existing code path unchanged.
  • agent.setup() is called once before the session loop; agent.first_speaker is respected per session, overriding the global --provider-speaks-first flag when set.
  • New extensibility hooks on LLMInterface: prepare_sessions(), finish_and_reset_session(), first_speaker, setup().

Bespoke termination signals (llm_interface.py, conversation_simulator.py)

  • Providers can declare bespoke_termination_signals to end a conversation early on provider-specific strings (e.g. error tags).
  • _extract_signals() detects signals from the raw response before _post_process_response() strips artifacts.
  • _should_discard_response() lets a provider remove an error turn from the transcript entirely rather than marking it as early termination.

Debug output (conversation_simulator.py)

  • Turn-by-turn progress is printed via debug_print (only active with --debug).
  • Raw response is recorded in turn metadata for the provider side.

Test plan

  • python3 generate.py -u claude-sonnet-4-6 -p claude-sonnet-4-6 -t 6 -r 1 — single session, existing behaviour unchanged
  • python3 generate.py ... --sessions session_a,session_b — two sessions run in sequence, filenames suffixed
  • python3 generate.py ... --debug — turn-by-turn output visible

@sator-labs sator-labs requested review from jgieringer and nz-1 and removed request for nz-1 April 20, 2026 22:30
@sator-labs
Copy link
Copy Markdown
Collaborator Author

Important note: it currently fails on one (1) unit test

@jgieringer
Copy link
Copy Markdown
Collaborator

Since we're so close to shipping v1.1, can I trouble you @sator-labs to change merging from main to v1.1?
That way we'll know what conflicts arise and can solve them in this branch vs. merging to main & having to solve conflicts up the chain?

@sator-labs sator-labs changed the base branch from main to v1.1 April 22, 2026 22:09
@sator-labs sator-labs requested a review from Copilot April 22, 2026 22:09
@sator-labs
Copy link
Copy Markdown
Collaborator Author

Since we're so close to shipping v1.1, can I trouble you @sator-labs to change merging from main to v1.1? That way we'll know what conflicts arise and can solve them in this branch vs. merging to main & having to solve conflicts up the chain?

done

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the conversation generation pipeline to support multi-session runs and adds new extensibility hooks to LLMInterface for provider-specific session lifecycle and termination behavior.

Changes:

  • Add LLMInterface hooks for session lifecycle (setup, prepare_sessions, finish_and_reset_session, first_speaker) and bespoke early termination handling.
  • Update the conversation runner to execute multiple session types per “conversation” and return per-session filenames/log paths.
  • Add simulator support for bespoke termination signals, raw-response metadata capture (provider-side), and debug turn tracing.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
llm_clients/llm_interface.py Introduces new extensibility hooks (sessions, termination signals, response discard) on the base interface.
generate_conversations/runner.py Implements multi-session execution, per-session output files/logs, and session-aware first-speaker selection.
generate_conversations/conversation_simulator.py Adds bespoke termination/discard logic, stores provider raw responses in turn metadata, and emits debug progress output.
generate.py Adds --sessions CLI argument and wires it into the runner.
tests/unit/llm_clients/test_llm_interface.py Adds unit coverage for new LLMInterface hook defaults and signal extraction.
tests/unit/generate_conversations/test_conversation_simulator.py Adds unit coverage for bespoke termination/discard behavior and raw-response logging.
tests/integration/test_conversation_runner.py Updates integration tests for new multi-session outputs (filenames, log_files) and adds multi-session runner tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread generate_conversations/runner.py Outdated
Comment on lines +175 to +179
session_filename_base = (
f"{filename_base}_{i}_{session_type}" if self.session_types else filename_base
)
session_logger = setup_conversation_logger(session_filename_base, run_id=self.run_id)
session_start = time.time()
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-session logging uses a new session_logger (created here), but log_conversation_start(...) is only emitted to the earlier top-level logger. As a result, each session log file will contain turns/end but not the start/header metadata, while the top-level log contains start but no turns/end. Consider calling log_conversation_start for each session_logger (or otherwise consolidating logs per session).

Copilot uses AI. Check for mistakes.
Comment thread generate_conversations/runner.py Outdated
Comment on lines +163 to +166
if i > 1:
print(f" Session {i - 1} finished. Starting session {i}/{len(session_types)}: {session_type}")
else:
print(f" Starting session {i}/{len(session_types)}: {session_type}")
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Session loop status messages are printed unconditionally via print(...). This will add noisy stdout in normal runs even when --debug is not enabled, and it bypasses the existing debug_print pattern used elsewhere. Route these messages through debug_print or a logger gated by the debug flag instead.

Suggested change
if i > 1:
print(f" Session {i - 1} finished. Starting session {i}/{len(session_types)}: {session_type}")
else:
print(f" Starting session {i}/{len(session_types)}: {session_type}")
if getattr(self, "debug", False):
if i > 1:
print(f" Session {i - 1} finished. Starting session {i}/{len(session_types)}: {session_type}")
else:
print(f" Starting session {i}/{len(session_types)}: {session_type}")

Copilot uses AI. Check for mistakes.
Comment on lines 111 to 112
response = current_speaker._post_process_response(response)

Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generate_conversation: _post_process_response() is applied twice (once at line 106 and again at line 111). This can lead to double-stripping/mangling if the hook is not idempotent and will also skew word counts and stored transcript. Apply post-processing exactly once (and keep raw_response separately for logging/signals as you already do).

Suggested change
response = current_speaker._post_process_response(response)

Copilot uses AI. Check for mistakes.
Comment on lines 338 to 346
assert result["turns"] == 5
assert "filenames" in result
assert "log_files" in result
assert isinstance(result["filenames"], list)
assert isinstance(result["log_files"], list)
assert result["duration"] > 0
assert isinstance(result["conversation"], list)
assert len(result["conversation"]) == 4

Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These assertions are internally inconsistent: result['turns'] is expected to be 5 while len(result['conversation']) is asserted to be 4. In the implementation, turns is derived from len(all_conversations), so these should match. Update the test to assert consistent values (and ensure the expected turn count matches the actual speaker-order/max_turns behavior).

Copilot uses AI. Check for mistakes.
Comment thread generate_conversations/runner.py Outdated
Comment on lines +53 to +58
self.persona_speaks_first = persona_speaks_first
self.session_types = session_types

# Default persona_speaks_first based on agent type
if persona_speaks_first is None:
agent_model = agent_model_config.get("model", "").lower()
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConversationRunner.init: when persona_speaks_first is None, the code computes agent_model but never assigns self.persona_speaks_first. That leaves it as None, which later gets passed into ensure_provider_has_last_turn / generate_conversation and changes behavior (None is treated as False). Set an explicit boolean default here (e.g., preserve previous default True, or implement the intended agent-model-based default) so the single-session path remains unchanged.

Suggested change
self.persona_speaks_first = persona_speaks_first
self.session_types = session_types
# Default persona_speaks_first based on agent type
if persona_speaks_first is None:
agent_model = agent_model_config.get("model", "").lower()
self.session_types = session_types
# Normalize persona_speaks_first to an explicit boolean so later
# conversation-generation logic does not treat None as False.
# Preserve the existing default behavior when no override is provided.
if persona_speaks_first is None:
self.persona_speaks_first = True

Copilot uses AI. Check for mistakes.
Comment thread generate_conversations/runner.py Outdated
Comment on lines +174 to +179
# Use session-type suffix in filename only when --sessions was explicitly passed
session_filename_base = (
f"{filename_base}_{i}_{session_type}" if self.session_types else filename_base
)
session_logger = setup_conversation_logger(session_filename_base, run_id=self.run_id)
session_start = time.time()
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_single_conversation: for the common single-session case (self.session_types is None), session_filename_base equals filename_base, so setup_conversation_logger() is called twice with the same log_filename. Because setup_conversation_logger opens the file in mode='w' and clears handlers, the second call will truncate/overwrite the earlier "CONVERSATION STARTED" log entries and may leak/duplicate handlers. Consider reusing the top-level logger for the single-session case, or ensuring the session logger uses a distinct filename / append mode.

Copilot uses AI. Check for mistakes.
Comment on lines +168 to +169
await agent.finish_and_reset_session(session_type)

Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runner calls agent.finish_and_reset_session(session_type) before running each session, including the first. However, the hook’s docstring says it should "Finish the current session and reset for a new type", which implies calling it after a session completes (or only between sessions). Either adjust the call order (e.g., call after each session except the last) or rename / re-document the hook to match the actual lifecycle semantics.

Copilot uses AI. Check for mistakes.
# Check for exact phrase matches (case insensitive)
if re.search(re.escape(self.termination_signal), response, re.IGNORECASE):
# Bespoke signals extracted from raw response before post-processing
if extracted_signals and any([signal in speaker.bespoke_termination_signals for signal in extracted_signals]):
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_should_terminate_conversation: the bespoke-signal check builds a list inside any(...). This eagerly allocates an intermediate list; use a generator expression instead (any(signal in ... for signal in extracted_signals)) to keep it lazy and consistent with typical style.

Suggested change
if extracted_signals and any([signal in speaker.bespoke_termination_signals for signal in extracted_signals]):
if extracted_signals and any(
signal in speaker.bespoke_termination_signals
for signal in extracted_signals
):

Copilot uses AI. Check for mistakes.
Comment on lines +254 to +262
async def setup(self) -> None:
"""Set up any resources needed before the conversation starts.

Called by the simulator once before the turn loop begins.
Subclasses that need async initialization (e.g. session creation,
auth token acquisition) should override this method.
Default implementation does nothing.
"""
pass
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLMInterface.setup() docstring says it is "Called by the simulator", but the new call site in this PR is ConversationRunner (before the session loop). Update the docstring to match the actual lifecycle so implementers know where/when setup() is invoked.

Copilot uses AI. Check for mistakes.
Comment thread generate_conversations/runner.py Outdated
Comment on lines +101 to +103
workflow = self.agent_model_config.get("workflow", "")
workflow_part = f"_{workflow}" if workflow else ""
filename_base = f"{tag}_{persona_safe}_{model_short}{workflow_part}_run{run_number}"
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filename_base now includes agent_model_config['workflow'] verbatim. Unlike persona_name, this string is not sanitized, so values containing spaces, path separators ("/", ".."), or other filesystem-invalid characters can break file creation or potentially write outside the intended folder. Consider normalizing workflow to a safe filename component (e.g., allowlist [A-Za-z0-9_-] and replace others).

Copilot uses AI. Check for mistakes.
@jgieringer jgieringer requested a review from emily-vanark April 23, 2026 21:29
Copy link
Copy Markdown
Collaborator

@jgieringer jgieringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sessions could be desired by a non-spring entity, could you provide detail in the README (or its own md file) on exactly what it is doing?
It's multiple short "conversations" that get stitched together?
So maybe even all_conversations should be something like full_conversation?

writes transcript to self.folder_name, then cleans up logger and LLMs.
A single session without --sessions is treated as a one-element session list so
the same code path handles both cases. The agent's prepare_sessions() hook
normalises the list (e.g. prepending INTAKE for ray-backend).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a non-internal example be provided or is this truly spring-related?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I... am not clear what ray-backend is, or why I'd want to prepend INTAKE to it. This could use more explanation... maybe it's trying to say, "For example, if an agent has session types INTAKE and RETURN, each type might be added to a persona name, e.g. Ray, bcoming INTAKE-Ray and RETURN-Ray."? Some sort of explanation like this would help me.


output_filename = f"{session_filename_base}.txt"
simulator.save_conversation(output_filename, self.folder_name)
all_conversations.extend(conversation)
Copy link
Copy Markdown
Collaborator

@jgieringer jgieringer Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finding extending conversation, a list of turns, into list all_conversations is confusing.
Could you rename it to be all_turns or full_conversation?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that I assumed all_conversation was a list of conversation objects... if we're extending it so all_converastions is a single, longer-conversation from concatenated sessions, I agree with @jgieringer 's naming suggestions. (And I feel like that leans towards setting max_turns over all the sessions rather than per session.)

Comment on lines +403 to +405
log_files.append(
f"logging/{self.run_id}/{session_filename_base}.log"
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making note I will need to update #144 to store logging location

Two termination paths:
- Global signal (self.termination_signal): only the persona can trigger it,
checked against the post-processed response.
- Bespoke signals: passed in as extracted_signals, detected from the raw
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the term "bespoke" make sense to all our users? I think I'd go for "custom for specific LLM client" if I'm understanding correctly what it is? (More curiosity than blocking.)

# Check for exact phrase matches (case insensitive)
if re.search(re.escape(self.termination_signal), response, re.IGNORECASE):
# Bespoke signals extracted from raw response before post-processing
if extracted_signals and any([signal in speaker.bespoke_termination_signals for signal in extracted_signals]):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like it's looking for an exact match... for the global signals we do a regex search with an ignorecase... do we want to offer the same for the bespoke signals?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being called after _extract_signals in the llm_interface has already handled that?

next_speaker = self.persona

debug_print(
f"[SIM] Starting conversation: persona_speaks_first={persona_speaks_first}, "
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the "[SIM]" indicating?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also.... how is the debug_print turned on or off?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm wondering if we want all these debug_prints in the code base longer term.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And if we're keeping them, we should add to the README how to turn them on/ off? and maybe use that pattern more places?)

persona_config (dict): Must have "prompt" and "name". Persona LLM
identity comes from ``self.persona_model_config`` (including ``model``).
max_turns (int): Max conversation turns for a conversation.
max_turns (int): Max conversation turns per session.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to wrap my head around whether we'd want max_turns to be the max_turns per session or the max_turns across sessions. It feels like if the conversations are going to be concatenated together for judging, we'd want it to be across sessions for consistency with single-session evalautions. But if the sessions are going to be evaluated as independent conversations, then we'd want max_turns to apply on the session level...

Dict[str, Any]: index, llm1_model, llm1_prompt, run_number, turns,
filename, log_file, duration, early_termination, conversation.
filenames, log_files, duration, early_termination, conversation, and
optional filename/log_file for single-transcript compatibility.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused... what would be the difference between log_files and filename/log_file... and... is this taking into account the udpates coming in !144 ?

workflow = self.agent_model_config.get("workflow", "")
workflow_part = f"_{workflow}" if workflow else ""
filename_base = (
f"{tag}_{persona_name}_{model_name}{workflow_part}_run{run_number}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if introducing this naming scheme will break anything in our utils for parsing file names / need updates there to handle this approach?

"""Per-implementation early-termination signals.

Return a list of signal strings. When any signal is found in this
speaker's response, the conversation ends early.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it for both the agent as user and the agent as provider? I've been assuming only as provider, but this makes it sound like it could be either....

class ErrorAgent(MockLLM):
@property
def bespoke_termination_signals(self):
return ["[ERROR]"]
Copy link
Copy Markdown
Collaborator

@emily-vanark emily-vanark Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find myself wondering if / how this interacts with the retry_on_error additions Josh put in... maybe they're independent?

@emily-vanark
Copy link
Copy Markdown
Collaborator

If sessions could be desired by a non-spring entity, could you provide detail in the README (or its own md file) on exactly what it is doing? It's multiple short "conversations" that get stitched together? So maybe even all_conversations should be something like full_conversation?

If sessions could be desired by a non-spring entity, could you provide detail in the README (or its own md file) on exactly what it is doing? It's multiple short "conversations" that get stitched together? So maybe even all_conversations should be something like full_conversation?

Agree we need to update the README a couple places:

  • probably a note in the section about setting up your own llm_interface
  • also adding it to the parameters documentation (an example would be great for formatting... is it [INTAKE, RETURN] or INTAKE, RETURN or INTAKE RETURN or... something with quotes?
    (This interfaces with !145 as well.)

... and... did we get the argument for sessions in both generate.py and the pipeline? I feel like maybe we missed the pipeline... and also it'll belong (I think) in this once it merges...

Copy link
Copy Markdown
Collaborator

@emily-vanark emily-vanark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think this is a fine and good addition to the repo. Also, I'm a bit concerned at its overlap with several other open PRs right now... maybe we want to let those finalize in the next day and then mix this in with the rest?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants