Adding coverage for multi-session runner and LLMInterface hooks"#143
Adding coverage for multi-session runner and LLMInterface hooks"#143sator-labs wants to merge 4 commits intov1.1from
Conversation
|
Important note: it currently fails on one (1) unit test |
|
Since we're so close to shipping v1.1, can I trouble you @sator-labs to change merging from main to v1.1? |
done |
There was a problem hiding this comment.
Pull request overview
This PR enhances the conversation generation pipeline to support multi-session runs and adds new extensibility hooks to LLMInterface for provider-specific session lifecycle and termination behavior.
Changes:
- Add
LLMInterfacehooks for session lifecycle (setup,prepare_sessions,finish_and_reset_session,first_speaker) and bespoke early termination handling. - Update the conversation runner to execute multiple session types per “conversation” and return per-session filenames/log paths.
- Add simulator support for bespoke termination signals, raw-response metadata capture (provider-side), and debug turn tracing.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
llm_clients/llm_interface.py |
Introduces new extensibility hooks (sessions, termination signals, response discard) on the base interface. |
generate_conversations/runner.py |
Implements multi-session execution, per-session output files/logs, and session-aware first-speaker selection. |
generate_conversations/conversation_simulator.py |
Adds bespoke termination/discard logic, stores provider raw responses in turn metadata, and emits debug progress output. |
generate.py |
Adds --sessions CLI argument and wires it into the runner. |
tests/unit/llm_clients/test_llm_interface.py |
Adds unit coverage for new LLMInterface hook defaults and signal extraction. |
tests/unit/generate_conversations/test_conversation_simulator.py |
Adds unit coverage for bespoke termination/discard behavior and raw-response logging. |
tests/integration/test_conversation_runner.py |
Updates integration tests for new multi-session outputs (filenames, log_files) and adds multi-session runner tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| session_filename_base = ( | ||
| f"{filename_base}_{i}_{session_type}" if self.session_types else filename_base | ||
| ) | ||
| session_logger = setup_conversation_logger(session_filename_base, run_id=self.run_id) | ||
| session_start = time.time() |
There was a problem hiding this comment.
Per-session logging uses a new session_logger (created here), but log_conversation_start(...) is only emitted to the earlier top-level logger. As a result, each session log file will contain turns/end but not the start/header metadata, while the top-level log contains start but no turns/end. Consider calling log_conversation_start for each session_logger (or otherwise consolidating logs per session).
| if i > 1: | ||
| print(f" Session {i - 1} finished. Starting session {i}/{len(session_types)}: {session_type}") | ||
| else: | ||
| print(f" Starting session {i}/{len(session_types)}: {session_type}") |
There was a problem hiding this comment.
Session loop status messages are printed unconditionally via print(...). This will add noisy stdout in normal runs even when --debug is not enabled, and it bypasses the existing debug_print pattern used elsewhere. Route these messages through debug_print or a logger gated by the debug flag instead.
| if i > 1: | |
| print(f" Session {i - 1} finished. Starting session {i}/{len(session_types)}: {session_type}") | |
| else: | |
| print(f" Starting session {i}/{len(session_types)}: {session_type}") | |
| if getattr(self, "debug", False): | |
| if i > 1: | |
| print(f" Session {i - 1} finished. Starting session {i}/{len(session_types)}: {session_type}") | |
| else: | |
| print(f" Starting session {i}/{len(session_types)}: {session_type}") |
| response = current_speaker._post_process_response(response) | ||
|
|
There was a problem hiding this comment.
generate_conversation: _post_process_response() is applied twice (once at line 106 and again at line 111). This can lead to double-stripping/mangling if the hook is not idempotent and will also skew word counts and stored transcript. Apply post-processing exactly once (and keep raw_response separately for logging/signals as you already do).
| response = current_speaker._post_process_response(response) |
| assert result["turns"] == 5 | ||
| assert "filenames" in result | ||
| assert "log_files" in result | ||
| assert isinstance(result["filenames"], list) | ||
| assert isinstance(result["log_files"], list) | ||
| assert result["duration"] > 0 | ||
| assert isinstance(result["conversation"], list) | ||
| assert len(result["conversation"]) == 4 | ||
|
|
There was a problem hiding this comment.
These assertions are internally inconsistent: result['turns'] is expected to be 5 while len(result['conversation']) is asserted to be 4. In the implementation, turns is derived from len(all_conversations), so these should match. Update the test to assert consistent values (and ensure the expected turn count matches the actual speaker-order/max_turns behavior).
| self.persona_speaks_first = persona_speaks_first | ||
| self.session_types = session_types | ||
|
|
||
| # Default persona_speaks_first based on agent type | ||
| if persona_speaks_first is None: | ||
| agent_model = agent_model_config.get("model", "").lower() |
There was a problem hiding this comment.
ConversationRunner.init: when persona_speaks_first is None, the code computes agent_model but never assigns self.persona_speaks_first. That leaves it as None, which later gets passed into ensure_provider_has_last_turn / generate_conversation and changes behavior (None is treated as False). Set an explicit boolean default here (e.g., preserve previous default True, or implement the intended agent-model-based default) so the single-session path remains unchanged.
| self.persona_speaks_first = persona_speaks_first | |
| self.session_types = session_types | |
| # Default persona_speaks_first based on agent type | |
| if persona_speaks_first is None: | |
| agent_model = agent_model_config.get("model", "").lower() | |
| self.session_types = session_types | |
| # Normalize persona_speaks_first to an explicit boolean so later | |
| # conversation-generation logic does not treat None as False. | |
| # Preserve the existing default behavior when no override is provided. | |
| if persona_speaks_first is None: | |
| self.persona_speaks_first = True |
| # Use session-type suffix in filename only when --sessions was explicitly passed | ||
| session_filename_base = ( | ||
| f"{filename_base}_{i}_{session_type}" if self.session_types else filename_base | ||
| ) | ||
| session_logger = setup_conversation_logger(session_filename_base, run_id=self.run_id) | ||
| session_start = time.time() |
There was a problem hiding this comment.
run_single_conversation: for the common single-session case (self.session_types is None), session_filename_base equals filename_base, so setup_conversation_logger() is called twice with the same log_filename. Because setup_conversation_logger opens the file in mode='w' and clears handlers, the second call will truncate/overwrite the earlier "CONVERSATION STARTED" log entries and may leak/duplicate handlers. Consider reusing the top-level logger for the single-session case, or ensuring the session logger uses a distinct filename / append mode.
| await agent.finish_and_reset_session(session_type) | ||
|
|
There was a problem hiding this comment.
The runner calls agent.finish_and_reset_session(session_type) before running each session, including the first. However, the hook’s docstring says it should "Finish the current session and reset for a new type", which implies calling it after a session completes (or only between sessions). Either adjust the call order (e.g., call after each session except the last) or rename / re-document the hook to match the actual lifecycle semantics.
| # Check for exact phrase matches (case insensitive) | ||
| if re.search(re.escape(self.termination_signal), response, re.IGNORECASE): | ||
| # Bespoke signals extracted from raw response before post-processing | ||
| if extracted_signals and any([signal in speaker.bespoke_termination_signals for signal in extracted_signals]): |
There was a problem hiding this comment.
_should_terminate_conversation: the bespoke-signal check builds a list inside any(...). This eagerly allocates an intermediate list; use a generator expression instead (any(signal in ... for signal in extracted_signals)) to keep it lazy and consistent with typical style.
| if extracted_signals and any([signal in speaker.bespoke_termination_signals for signal in extracted_signals]): | |
| if extracted_signals and any( | |
| signal in speaker.bespoke_termination_signals | |
| for signal in extracted_signals | |
| ): |
| async def setup(self) -> None: | ||
| """Set up any resources needed before the conversation starts. | ||
|
|
||
| Called by the simulator once before the turn loop begins. | ||
| Subclasses that need async initialization (e.g. session creation, | ||
| auth token acquisition) should override this method. | ||
| Default implementation does nothing. | ||
| """ | ||
| pass |
There was a problem hiding this comment.
LLMInterface.setup() docstring says it is "Called by the simulator", but the new call site in this PR is ConversationRunner (before the session loop). Update the docstring to match the actual lifecycle so implementers know where/when setup() is invoked.
| workflow = self.agent_model_config.get("workflow", "") | ||
| workflow_part = f"_{workflow}" if workflow else "" | ||
| filename_base = f"{tag}_{persona_safe}_{model_short}{workflow_part}_run{run_number}" |
There was a problem hiding this comment.
filename_base now includes agent_model_config['workflow'] verbatim. Unlike persona_name, this string is not sanitized, so values containing spaces, path separators ("/", ".."), or other filesystem-invalid characters can break file creation or potentially write outside the intended folder. Consider normalizing workflow to a safe filename component (e.g., allowlist [A-Za-z0-9_-] and replace others).
jgieringer
left a comment
There was a problem hiding this comment.
If sessions could be desired by a non-spring entity, could you provide detail in the README (or its own md file) on exactly what it is doing?
It's multiple short "conversations" that get stitched together?
So maybe even all_conversations should be something like full_conversation?
| writes transcript to self.folder_name, then cleans up logger and LLMs. | ||
| A single session without --sessions is treated as a one-element session list so | ||
| the same code path handles both cases. The agent's prepare_sessions() hook | ||
| normalises the list (e.g. prepending INTAKE for ray-backend). |
There was a problem hiding this comment.
Can a non-internal example be provided or is this truly spring-related?
There was a problem hiding this comment.
I... am not clear what ray-backend is, or why I'd want to prepend INTAKE to it. This could use more explanation... maybe it's trying to say, "For example, if an agent has session types INTAKE and RETURN, each type might be added to a persona name, e.g. Ray, bcoming INTAKE-Ray and RETURN-Ray."? Some sort of explanation like this would help me.
|
|
||
| output_filename = f"{session_filename_base}.txt" | ||
| simulator.save_conversation(output_filename, self.folder_name) | ||
| all_conversations.extend(conversation) |
There was a problem hiding this comment.
I'm finding extending conversation, a list of turns, into list all_conversations is confusing.
Could you rename it to be all_turns or full_conversation?
There was a problem hiding this comment.
It's true that I assumed all_conversation was a list of conversation objects... if we're extending it so all_converastions is a single, longer-conversation from concatenated sessions, I agree with @jgieringer 's naming suggestions. (And I feel like that leans towards setting max_turns over all the sessions rather than per session.)
| log_files.append( | ||
| f"logging/{self.run_id}/{session_filename_base}.log" | ||
| ) |
There was a problem hiding this comment.
Making note I will need to update #144 to store logging location
| Two termination paths: | ||
| - Global signal (self.termination_signal): only the persona can trigger it, | ||
| checked against the post-processed response. | ||
| - Bespoke signals: passed in as extracted_signals, detected from the raw |
There was a problem hiding this comment.
Will the term "bespoke" make sense to all our users? I think I'd go for "custom for specific LLM client" if I'm understanding correctly what it is? (More curiosity than blocking.)
| # Check for exact phrase matches (case insensitive) | ||
| if re.search(re.escape(self.termination_signal), response, re.IGNORECASE): | ||
| # Bespoke signals extracted from raw response before post-processing | ||
| if extracted_signals and any([signal in speaker.bespoke_termination_signals for signal in extracted_signals]): |
There was a problem hiding this comment.
this looks like it's looking for an exact match... for the global signals we do a regex search with an ignorecase... do we want to offer the same for the bespoke signals?
There was a problem hiding this comment.
Is this being called after _extract_signals in the llm_interface has already handled that?
| next_speaker = self.persona | ||
|
|
||
| debug_print( | ||
| f"[SIM] Starting conversation: persona_speaks_first={persona_speaks_first}, " |
There was a problem hiding this comment.
What's the "[SIM]" indicating?
There was a problem hiding this comment.
Also.... how is the debug_print turned on or off?
There was a problem hiding this comment.
I guess I'm wondering if we want all these debug_prints in the code base longer term.
There was a problem hiding this comment.
(And if we're keeping them, we should add to the README how to turn them on/ off? and maybe use that pattern more places?)
| persona_config (dict): Must have "prompt" and "name". Persona LLM | ||
| identity comes from ``self.persona_model_config`` (including ``model``). | ||
| max_turns (int): Max conversation turns for a conversation. | ||
| max_turns (int): Max conversation turns per session. |
There was a problem hiding this comment.
I'm trying to wrap my head around whether we'd want max_turns to be the max_turns per session or the max_turns across sessions. It feels like if the conversations are going to be concatenated together for judging, we'd want it to be across sessions for consistency with single-session evalautions. But if the sessions are going to be evaluated as independent conversations, then we'd want max_turns to apply on the session level...
| Dict[str, Any]: index, llm1_model, llm1_prompt, run_number, turns, | ||
| filename, log_file, duration, early_termination, conversation. | ||
| filenames, log_files, duration, early_termination, conversation, and | ||
| optional filename/log_file for single-transcript compatibility. |
There was a problem hiding this comment.
I am confused... what would be the difference between log_files and filename/log_file... and... is this taking into account the udpates coming in !144 ?
| workflow = self.agent_model_config.get("workflow", "") | ||
| workflow_part = f"_{workflow}" if workflow else "" | ||
| filename_base = ( | ||
| f"{tag}_{persona_name}_{model_name}{workflow_part}_run{run_number}" |
There was a problem hiding this comment.
I wonder if introducing this naming scheme will break anything in our utils for parsing file names / need updates there to handle this approach?
| """Per-implementation early-termination signals. | ||
|
|
||
| Return a list of signal strings. When any signal is found in this | ||
| speaker's response, the conversation ends early. |
There was a problem hiding this comment.
Is it for both the agent as user and the agent as provider? I've been assuming only as provider, but this makes it sound like it could be either....
| class ErrorAgent(MockLLM): | ||
| @property | ||
| def bespoke_termination_signals(self): | ||
| return ["[ERROR]"] |
There was a problem hiding this comment.
I find myself wondering if / how this interacts with the retry_on_error additions Josh put in... maybe they're independent?
Agree we need to update the README a couple places:
... and... did we get the argument for sessions in both generate.py and the pipeline? I feel like maybe we missed the pipeline... and also it'll belong (I think) in this once it merges... |
Summary
A set of improvements to the conversation runner and LLM interface, developed while integrating a new provider type internally. All changes are generic and provider-agnostic.
Note: the custom LLM class will need to have knowledge of session requirements (e.g., an intake session is always mandated).
Multi-session support (
runner.py,generate.py)--sessions). A single session without--sessionsfollows the existing code path unchanged.agent.setup()is called once before the session loop;agent.first_speakeris respected per session, overriding the global--provider-speaks-firstflag when set.LLMInterface:prepare_sessions(),finish_and_reset_session(),first_speaker,setup().Bespoke termination signals (
llm_interface.py,conversation_simulator.py)bespoke_termination_signalsto end a conversation early on provider-specific strings (e.g. error tags)._extract_signals()detects signals from the raw response before_post_process_response()strips artifacts._should_discard_response()lets a provider remove an error turn from the transcript entirely rather than marking it as early termination.Debug output (
conversation_simulator.py)debug_print(only active with--debug).Test plan
python3 generate.py -u claude-sonnet-4-6 -p claude-sonnet-4-6 -t 6 -r 1— single session, existing behaviour unchangedpython3 generate.py ... --sessions session_a,session_b— two sessions run in sequence, filenames suffixedpython3 generate.py ... --debug— turn-by-turn output visible