Skip to content

Proposal: eval_defaults on Step — single source of truth for dynamic prompts #5

@justi

Description

@justi

Proposal: eval_defaults on Step

Problem

Steps with dynamic prompts (system input[:system_message]) force evals to provide the full system_message in default_input. This creates two sources of truth — the production method that builds the system_message (in a service/concern) and a hardcoded copy in the eval. When the prompt changes, the eval drifts silently.

Example incident: an eval for a link-insertion step had a stripped-down system_message missing "when to skip" rules. The model always inserted links, even in unrelated comments. Live optimize reported 0.00 while manual tests with the production prompt passed 5/5. Root cause took 30 minutes to find — the eval prompt had drifted from production.

Proposed API

```ruby
class InsertLink < RubyLLM::Contract::Step::Base
prompt do |input|
system input[:system_message]
user input[:prompt_text]
end

eval_defaults do
{ system_message: MyApp::Prompts.link_insertion_system_message }
end
end
```

Eval definitions inherit eval_defaults merged into default_input:

```ruby
InsertLink.define_eval("smoke") do

system_message automatically provided by eval_defaults — no duplication

default_input({
prompt_text: "[ORIGINAL COMMENT]\n...",
original_comment: "...",
allowed_urls: ["https://example.com/page"]
})

sample_response({ comment: "...", link_inserted: true, ... })
verify "link inserted", expect: ->(o) { o[:link_inserted] }
end
```

Eval can still override system_message in default_input if needed (explicit wins over default).

When this helps

  • Step has `system input[:system_message]` — prompt comes from a service, not from the step itself. The service builds it from persona, language, voice rules, etc. Eval needs the same prompt but has no access to the service.
  • Multiple evals per step — each eval would otherwise duplicate the same system_message. With eval_defaults, it's defined once on the step.
  • Prompt iteration — when you change the production prompt, evals automatically pick up the change. No manual sync.

When this is unnecessary

  • Step has a static prompt — `system "You classify tickets..."` or `system RUBRIC_CONSTANT`. The prompt lives on the step, not in external services. Eval already tests the real prompt without needing eval_defaults.
  • Step has `prompt "Classify: {input}"` — simple string prompt, no system_message in input. Nothing to default.
  • One eval per step — the duplication cost is low. A support module (current workaround) is fine.

Data from a real project

11 steps total. Prompt patterns:

pattern count eval_defaults needed?
`system input[:system_message]` (dynamic from service) 4 yes
`system <<~SYS` (inline static) 3 no
`system CONSTANT` 2 no
`system "string"` (one-liner static) 1 no

4/11 steps would benefit. The 3 that already have evals use a workaround — a support module that includes the production prompts concern and delegates. It works but is boilerplate that eval_defaults would eliminate.

Current workaround

```ruby
module EvalSupport
class PromptHost
include MyApp::Prompts

def self.system_message
  new.system_message_for_link_insertion
end

end
end

In eval:

InsertLink.define_eval("smoke") do
default_input({ system_message: EvalSupport::PromptHost.system_message, ... })
end
```

Works, but eval authors must know to use it instead of hardcoding. Easy to forget — as the incident showed.

Implementation sketch

```ruby

In Step::Base

def self.eval_defaults(&block)
@eval_defaults_block = block
end

def self.resolved_eval_defaults
@eval_defaults_block&.call || {}
end

In EvalDefinition#build_dataset

def effective_default_input
step.resolved_eval_defaults.merge(@default_input || {})
end
```

Lazy evaluation (block, not hash) so production methods are called at eval time, not at class load time.

Decision

Not blocking — workaround exists and is used in production. Consider for 0.7 if more projects report the same drift issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions