Problem
Completion responses can be longer than needed, causing avoidable token spend.
Proposal
Introduce token-aware completion guidance:
- Instruct responses to stay concise (e.g., max words/compact format).
- Keep factual completeness and citation/grounding requirements.
Note: current RLM config already sets max_tokens=4096 per subcall; this change targets actual verbosity, not just hard caps.
Expected Impact
- Estimated token reduction: ~5-10% (likely marginal in practice)
Risk
Acceptance Criteria
- Conciseness guidance is configurable and documented.
- Measurable reduction in average completion length/tokens.
- No meaningful regression in correctness/grounding.
- Report includes token savings vs baseline.
Problem
Completion responses can be longer than needed, causing avoidable token spend.
Proposal
Introduce token-aware completion guidance:
Note: current RLM config already sets
max_tokens=4096per subcall; this change targets actual verbosity, not just hard caps.Expected Impact
Risk
Acceptance Criteria