Skip to content

Improve token estimation for comparison prompts #22

@ArjenSchwarz

Description

@ArjenSchwarz

Context

From code review of #21 (multi-spec comparison feature).

Current Behavior

The estimatePromptTokens function in internal/comparison/prompt.go:87-89 uses a simple heuristic of 4 characters per token:

func estimatePromptTokens(prompt string) int {
    return len(prompt) / 4
}

Issue

This is a rough estimate. Claude's tokenizer typically uses ~3.5-4 chars/token for English prose, but code diffs may have different characteristics (more symbols, varied line lengths).

Suggested Improvement

Options:

  1. Use a more conservative estimate (e.g., len(prompt) / 3) for safety margin near the 150k token limit
  2. Use a proper tokenizer library for accurate estimates
  3. Add different heuristics for code vs prose content

Priority

Low - current implementation works but could fail on edge cases near the context limit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions