Adding coverage for multi-session runner and LLMInterface hooks" by sator-labs · Pull Request #143 · SpringCare/VERA-MH

sator-labs · 2026-04-20T22:22:32Z

Summary

A set of improvements to the conversation runner and LLM interface, developed while integrating a new provider type internally. All changes are generic and provider-agnostic.

Note: the custom LLM class will need to have knowledge of session requirements (e.g., an intake session is always mandated).

Multi-session support (runner.py, generate.py)

The runner now executes an ordered list of session types per conversation (via --sessions). A single session without --sessions follows the existing code path unchanged.
agent.setup() is called once before the session loop; agent.first_speaker is respected per session, overriding the global --provider-speaks-first flag when set.
New extensibility hooks on LLMInterface: prepare_sessions(), finish_and_reset_session(), first_speaker, setup().

Bespoke termination signals (llm_interface.py, conversation_simulator.py)

Providers can declare bespoke_termination_signals to end a conversation early on provider-specific strings (e.g. error tags).
_extract_signals() detects signals from the raw response before _post_process_response() strips artifacts.
_should_discard_response() lets a provider remove an error turn from the transcript entirely rather than marking it as early termination.

Debug output (conversation_simulator.py)

Turn-by-turn progress is printed via debug_print (only active with --debug).
Raw response is recorded in turn metadata for the provider side.

Test plan

python3 generate.py -u claude-sonnet-4-6 -p claude-sonnet-4-6 -t 6 -r 1 — single session, existing behaviour unchanged
python3 generate.py ... --sessions session_a,session_b — two sessions run in sequence, filenames suffixed
python3 generate.py ... --debug — turn-by-turn output visible

sator-labs · 2026-04-20T22:31:25Z

Important note: it currently fails on one (1) unit test

jgieringer · 2026-04-21T19:14:56Z

Since we're so close to shipping v1.1, can I trouble you @sator-labs to change merging from main to v1.1?
That way we'll know what conflicts arise and can solve them in this branch vs. merging to main & having to solve conflicts up the chain?

sator-labs · 2026-04-22T22:09:59Z

Since we're so close to shipping v1.1, can I trouble you @sator-labs to change merging from main to v1.1? That way we'll know what conflicts arise and can solve them in this branch vs. merging to main & having to solve conflicts up the chain?

done

Copilot

Pull request overview

This PR enhances the conversation generation pipeline to support multi-session runs and adds new extensibility hooks to LLMInterface for provider-specific session lifecycle and termination behavior.

Changes:

Add LLMInterface hooks for session lifecycle (setup, prepare_sessions, finish_and_reset_session, first_speaker) and bespoke early termination handling.
Update the conversation runner to execute multiple session types per “conversation” and return per-session filenames/log paths.
Add simulator support for bespoke termination signals, raw-response metadata capture (provider-side), and debug turn tracing.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`llm_clients/llm_interface.py`	Introduces new extensibility hooks (sessions, termination signals, response discard) on the base interface.
`generate_conversations/runner.py`	Implements multi-session execution, per-session output files/logs, and session-aware first-speaker selection.
`generate_conversations/conversation_simulator.py`	Adds bespoke termination/discard logic, stores provider raw responses in turn metadata, and emits debug progress output.
`generate.py`	Adds `--sessions` CLI argument and wires it into the runner.
`tests/unit/llm_clients/test_llm_interface.py`	Adds unit coverage for new `LLMInterface` hook defaults and signal extraction.
`tests/unit/generate_conversations/test_conversation_simulator.py`	Adds unit coverage for bespoke termination/discard behavior and raw-response logging.
`tests/integration/test_conversation_runner.py`	Updates integration tests for new multi-session outputs (`filenames`, `log_files`) and adds multi-session runner tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-22T22:14:56Z

+                session_filename_base = (
+                    f"{filename_base}_{i}_{session_type}" if self.session_types else filename_base
+                )
+                session_logger = setup_conversation_logger(session_filename_base, run_id=self.run_id)
+                session_start = time.time() 


Per-session logging uses a new session_logger (created here), but log_conversation_start(...) is only emitted to the earlier top-level logger. As a result, each session log file will contain turns/end but not the start/header metadata, while the top-level log contains start but no turns/end. Consider calling log_conversation_start for each session_logger (or otherwise consolidating logs per session).

Copilot · 2026-04-22T22:14:57Z

+                if i > 1:
+                    print(f"  Session {i - 1} finished. Starting session {i}/{len(session_types)}: {session_type}")
+                else:
+                    print(f"  Starting session {i}/{len(session_types)}: {session_type}")


Session loop status messages are printed unconditionally via print(...). This will add noisy stdout in normal runs even when --debug is not enabled, and it bypasses the existing debug_print pattern used elsewhere. Route these messages through debug_print or a logger gated by the debug flag instead.

Suggested change

if i > 1:

print(f" Session {i - 1} finished. Starting session {i}/{len(session_types)}: {session_type}")

else:

print(f" Starting session {i}/{len(session_types)}: {session_type}")

if getattr(self, "debug", False):

if i > 1:

print(f" Session {i - 1} finished. Starting session {i}/{len(session_types)}: {session_type}")

else:

print(f" Starting session {i}/{len(session_types)}: {session_type}")

Copilot · 2026-04-22T22:14:57Z

            response = current_speaker._post_process_response(response)



generate_conversation: _post_process_response() is applied twice (once at line 106 and again at line 111). This can lead to double-stripping/mangling if the hook is not idempotent and will also skew word counts and stored transcript. Apply post-processing exactly once (and keep raw_response separately for logging/signals as you already do).

Suggested change

response = current_speaker._post_process_response(response)

Copilot · 2026-04-22T22:14:58Z

+        assert result["turns"] == 5
+        assert "filenames" in result
+        assert "log_files" in result
+        assert isinstance(result["filenames"], list)
+        assert isinstance(result["log_files"], list)
        assert result["duration"] > 0
        assert isinstance(result["conversation"], list)
        assert len(result["conversation"]) == 4



These assertions are internally inconsistent: result['turns'] is expected to be 5 while len(result['conversation']) is asserted to be 4. In the implementation, turns is derived from len(all_conversations), so these should match. Update the test to assert consistent values (and ensure the expected turn count matches the actual speaker-order/max_turns behavior).

Copilot · 2026-04-22T22:14:58Z

        self.persona_speaks_first = persona_speaks_first
+        self.session_types = session_types
+
+        # Default persona_speaks_first based on agent type
+        if persona_speaks_first is None:
+            agent_model = agent_model_config.get("model", "").lower()


ConversationRunner.init: when persona_speaks_first is None, the code computes agent_model but never assigns self.persona_speaks_first. That leaves it as None, which later gets passed into ensure_provider_has_last_turn / generate_conversation and changes behavior (None is treated as False). Set an explicit boolean default here (e.g., preserve previous default True, or implement the intended agent-model-based default) so the single-session path remains unchanged.

Suggested change

self.persona_speaks_first = persona_speaks_first

self.session_types = session_types

# Default persona_speaks_first based on agent type

if persona_speaks_first is None:

agent_model = agent_model_config.get("model", "").lower()

self.session_types = session_types

# Normalize persona_speaks_first to an explicit boolean so later

# conversation-generation logic does not treat None as False.

# Preserve the existing default behavior when no override is provided.

if persona_speaks_first is None:

self.persona_speaks_first = True

Copilot · 2026-04-22T22:14:58Z

+                # Use session-type suffix in filename only when --sessions was explicitly passed
+                session_filename_base = (
+                    f"{filename_base}_{i}_{session_type}" if self.session_types else filename_base
+                )
+                session_logger = setup_conversation_logger(session_filename_base, run_id=self.run_id)
+                session_start = time.time() 


run_single_conversation: for the common single-session case (self.session_types is None), session_filename_base equals filename_base, so setup_conversation_logger() is called twice with the same log_filename. Because setup_conversation_logger opens the file in mode='w' and clears handlers, the second call will truncate/overwrite the earlier "CONVERSATION STARTED" log entries and may leak/duplicate handlers. Consider reusing the top-level logger for the single-session case, or ensuring the session logger uses a distinct filename / append mode.

Copilot · 2026-04-22T22:14:59Z

+                await agent.finish_and_reset_session(session_type)
+


The runner calls agent.finish_and_reset_session(session_type) before running each session, including the first. However, the hook’s docstring says it should "Finish the current session and reset for a new type", which implies calling it after a session completes (or only between sessions). Either adjust the call order (e.g., call after each session except the last) or rename / re-document the hook to match the actual lifecycle semantics.

Copilot · 2026-04-22T22:14:59Z

-        # Check for exact phrase matches (case insensitive)
-        if re.search(re.escape(self.termination_signal), response, re.IGNORECASE):
+        # Bespoke signals extracted from raw response before post-processing
+        if extracted_signals and any([signal in speaker.bespoke_termination_signals for signal in extracted_signals]):


_should_terminate_conversation: the bespoke-signal check builds a list inside any(...). This eagerly allocates an intermediate list; use a generator expression instead (any(signal in ... for signal in extracted_signals)) to keep it lazy and consistent with typical style.

Suggested change

if extracted_signals and any([signal in speaker.bespoke_termination_signals for signal in extracted_signals]):

if extracted_signals and any(

signal in speaker.bespoke_termination_signals

for signal in extracted_signals

):

Copilot · 2026-04-22T22:14:59Z

+    async def setup(self) -> None:
+        """Set up any resources needed before the conversation starts.
+
+        Called by the simulator once before the turn loop begins.
+        Subclasses that need async initialization (e.g. session creation,
+        auth token acquisition) should override this method.
+        Default implementation does nothing.
+        """
+        pass


LLMInterface.setup() docstring says it is "Called by the simulator", but the new call site in this PR is ConversationRunner (before the session loop). Update the docstring to match the actual lifecycle so implementers know where/when setup() is invoked.

Copilot · 2026-04-22T22:15:00Z

+        workflow = self.agent_model_config.get("workflow", "")
+        workflow_part = f"_{workflow}" if workflow else ""
+        filename_base = f"{tag}_{persona_safe}_{model_short}{workflow_part}_run{run_number}"


filename_base now includes agent_model_config['workflow'] verbatim. Unlike persona_name, this string is not sanitized, so values containing spaces, path separators ("/", ".."), or other filesystem-invalid characters can break file creation or potentially write outside the intended folder. Consider normalizing workflow to a safe filename component (e.g., allowlist [A-Za-z0-9_-] and replace others).

jgieringer

If sessions could be desired by a non-spring entity, could you provide detail in the README (or its own md file) on exactly what it is doing?
It's multiple short "conversations" that get stitched together?
So maybe even all_conversations should be something like full_conversation?

jgieringer · 2026-04-23T22:21:54Z

-        writes transcript to self.folder_name, then cleans up logger and LLMs.
+        A single session without --sessions is treated as a one-element session list so
+        the same code path handles both cases. The agent's prepare_sessions() hook
+        normalises the list (e.g. prepending INTAKE for ray-backend).


Can a non-internal example be provided or is this truly spring-related?

I... am not clear what ray-backend is, or why I'd want to prepend INTAKE to it. This could use more explanation... maybe it's trying to say, "For example, if an agent has session types INTAKE and RETURN, each type might be added to a persona name, e.g. Ray, bcoming INTAKE-Ray and RETURN-Ray."? Some sort of explanation like this would help me.

jgieringer · 2026-04-23T22:53:54Z

+
+                    output_filename = f"{session_filename_base}.txt"
+                    simulator.save_conversation(output_filename, self.folder_name)
+                    all_conversations.extend(conversation)


I'm finding extending conversation, a list of turns, into list all_conversations is confusing.
Could you rename it to be all_turns or full_conversation?

It's true that I assumed all_conversation was a list of conversation objects... if we're extending it so all_converastions is a single, longer-conversation from concatenated sessions, I agree with @jgieringer 's naming suggestions. (And I feel like that leans towards setting max_turns over all the sessions rather than per session.)

jgieringer · 2026-04-23T23:16:14Z

+                    log_files.append(
+                        f"logging/{self.run_id}/{session_filename_base}.log"
+                    )


Making note I will need to update #144 to store logging location

emily-vanark · 2026-04-24T00:23:41Z

+        Two termination paths:
+        - Global signal (self.termination_signal): only the persona can trigger it,
+          checked against the post-processed response.
+        - Bespoke signals: passed in as extracted_signals, detected from the raw


Will the term "bespoke" make sense to all our users? I think I'd go for "custom for specific LLM client" if I'm understanding correctly what it is? (More curiosity than blocking.)

emily-vanark · 2026-04-24T00:25:52Z

-        # Check for exact phrase matches (case insensitive)
-        if re.search(re.escape(self.termination_signal), response, re.IGNORECASE):
+        # Bespoke signals extracted from raw response before post-processing
+        if extracted_signals and any([signal in speaker.bespoke_termination_signals for signal in extracted_signals]):


this looks like it's looking for an exact match... for the global signals we do a regex search with an ignorecase... do we want to offer the same for the bespoke signals?

Is this being called after _extract_signals in the llm_interface has already handled that?

emily-vanark · 2026-04-24T00:29:40Z

            next_speaker = self.persona

+        debug_print(
+            f"[SIM] Starting conversation: persona_speaks_first={persona_speaks_first}, "


What's the "[SIM]" indicating?

Also.... how is the debug_print turned on or off?

I guess I'm wondering if we want all these debug_prints in the code base longer term.

(And if we're keeping them, we should add to the README how to turn them on/ off? and maybe use that pattern more places?)

emily-vanark · 2026-04-24T00:37:07Z

            persona_config (dict): Must have "prompt" and "name". Persona LLM
                identity comes from ``self.persona_model_config`` (including ``model``).
-            max_turns (int): Max conversation turns for a conversation.
+            max_turns (int): Max conversation turns per session.


I'm trying to wrap my head around whether we'd want max_turns to be the max_turns per session or the max_turns across sessions. It feels like if the conversations are going to be concatenated together for judging, we'd want it to be across sessions for consistency with single-session evalautions. But if the sessions are going to be evaluated as independent conversations, then we'd want max_turns to apply on the session level...

emily-vanark · 2026-04-24T00:39:15Z

            Dict[str, Any]: index, llm1_model, llm1_prompt, run_number, turns,
-            filename, log_file, duration, early_termination, conversation.
+            filenames, log_files, duration, early_termination, conversation, and
+            optional filename/log_file for single-transcript compatibility.


I am confused... what would be the difference between log_files and filename/log_file... and... is this taking into account the udpates coming in !144 ?

emily-vanark · 2026-04-24T00:40:16Z

+        workflow = self.agent_model_config.get("workflow", "")
+        workflow_part = f"_{workflow}" if workflow else ""
+        filename_base = (
+            f"{tag}_{persona_name}_{model_name}{workflow_part}_run{run_number}"


I wonder if introducing this naming scheme will break anything in our utils for parsing file names / need updates there to handle this approach?

emily-vanark · 2026-04-24T00:48:50Z

+        """Per-implementation early-termination signals.
+
+        Return a list of signal strings. When any signal is found in this
+        speaker's response, the conversation ends early.


Is it for both the agent as user and the agent as provider? I've been assuming only as provider, but this makes it sound like it could be either....

emily-vanark · 2026-04-24T01:06:26Z

+        class ErrorAgent(MockLLM):
+            @property
+            def bespoke_termination_signals(self):
+                return ["[ERROR]"]


I find myself wondering if / how this interacts with the retry_on_error additions Josh put in... maybe they're independent?

emily-vanark · 2026-04-24T01:13:53Z

If sessions could be desired by a non-spring entity, could you provide detail in the README (or its own md file) on exactly what it is doing? It's multiple short "conversations" that get stitched together? So maybe even all_conversations should be something like full_conversation?

Agree we need to update the README a couple places:

probably a note in the section about setting up your own llm_interface
also adding it to the parameters documentation (an example would be great for formatting... is it [INTAKE, RETURN] or INTAKE, RETURN or INTAKE RETURN or... something with quotes?
(This interfaces with !145 as well.)

... and... did we get the argument for sessions in both generate.py and the pipeline? I feel like maybe we missed the pipeline... and also it'll belong (I think) in this once it merges...

emily-vanark

In general, I think this is a fine and good addition to the repo. Also, I'm a bit concerned at its overlap with several other open PRs right now... maybe we want to let those finalize in the next day and then mix this in with the rest?

Adding coverage for multi-session runner and LLMInterface hooks"

34bc374

sator-labs requested review from jgieringer and nz-1 and removed request for nz-1 April 20, 2026 22:30

fixing tests?

8e9dd5a

sator-labs changed the base branch from main to v1.1 April 22, 2026 22:09

sator-labs requested a review from Copilot April 22, 2026 22:09

Copilot started reviewing on behalf of sator-labs April 22, 2026 22:10 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

sator-labs added 2 commits April 22, 2026 22:07

solving merge conflicts

d64a6e8

fixing tests

d9d8de1

jgieringer requested a review from emily-vanark April 23, 2026 21:29

jgieringer requested changes Apr 23, 2026

View reviewed changes

emily-vanark reviewed Apr 24, 2026

View reviewed changes

Conversation

sator-labs commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

sator-labs commented Apr 20, 2026

Uh oh!

jgieringer commented Apr 21, 2026

Uh oh!

sator-labs commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

jgieringer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgieringer Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sator-labs commented Apr 20, 2026 •

edited

Loading

jgieringer Apr 23, 2026 •

edited

Loading

emily-vanark Apr 24, 2026 •

edited

Loading

emily-vanark left a comment •

edited

Loading