Skip to content

feat(providers): explicit prompt cache for DashScope/Qwen#1127

Open
nguyennguyenit wants to merge 2 commits intodevfrom
feat/qwen-dashscope-prompt-cache
Open

feat(providers): explicit prompt cache for DashScope/Qwen#1127
nguyennguyenit wants to merge 2 commits intodevfrom
feat/qwen-dashscope-prompt-cache

Conversation

@nguyennguyenit
Copy link
Copy Markdown
Contributor

Summary

Extend OpenAI-compat provider path with Anthropic-style cache_control:ephemeral inline blocks for Alibaba DashScope endpoints. Live verified: 99.7% cache hit rate on 6K-token prefix for qwen3.6-plus (model production agents tra and van-anh use), yielding ~89% input-token cost reduction on cached prefix (90% Alibaba discount × 99.7% hit).

Changes

  • isDashScope() 3-source detection (URL + providerType + name) on OpenAIProvider, mirrors existing dashScopePassthroughKeys() pattern; covers both dashscope and bailian provider_type
  • buildRequestBody wraps system content using SplitSystemPromptForCache (exported from anthropic_request.go) with shared <!-- GOCLAW_CACHE_BOUNDARY --> marker — single source of truth across Anthropic + DashScope paths
  • Tool prefix cache: cache_control injected on last tool definition with 4-marker budget guard
  • Parse cache_creation_input_tokens from prompt_tokens_details into Usage.CacheCreationTokens; propagates via existing span metadata writers (loop_run.go:228, loop_tracing.go, tools/subagent_tracing.go) — no pipeline changes needed
  • Runtime escape hatch: GOCLAW_DISABLE_DASHSCOPE_CACHE=true
  • Live integration smoke test (build tag integration, env-gated by DASHSCOPE_API_KEY)

Live Validation

Model Cache supported Hit rate Goclaw consumer
qwen3-coder-plus 98.7% -
qwen3.5-plus 99.7% -
qwen3.6-plus 99.7% agents tra, van-anh
qwen3-max-2026-01-23 ❌ silent no-op (server cache not enabled) 0% -
qwen3-coder-next ❌ silent no-op 0% -

Wire format silently accepted by all models (no 400). For unsupported variants the wrap is a no-op (cost: 0 tokens, microseconds CPU). Observability via span cached_tokens gives self-correcting signal — no per-model capability flag needed.

Backward Compatibility

  • Anthropic provider: unchanged (already had own cache path)
  • Native OpenAI: unchanged (prompt_cache_key middleware)
  • Other OpenAI-compat (DeepSeek, Together, OpenRouter, Gemini): wrap skipped via isDashScope() guard
  • DB schema: no migration

Test plan

  • Unit: 25 new tests in providers package (detection, wrap, tool cache, usage parse) — all pass
  • Build: go build ./... (PG) + go build -tags sqliteonly ./... (SQLite) clean
  • Vet: go vet ./... clean
  • Integration smoke: live qwen3.6-plus 99.7% hit verified
  • CI: pending green check
  • Post-deploy 24h: query span metadata avg_cached_tokens > 0 for qwen models
  • Rollback rehearsal: confirm GOCLAW_DISABLE_DASHSCOPE_CACHE=true reverts to string content

Estimated Impact

Based on 5 baseline traces (research §"Target Saving"), aggregate input drops from ~6,023K → ~1,500K tokens (-75%) for mid-conversation Qwen calls. Combined with companion plan pancake-pos-mcp/plans/260508-compact-response/ → -85% total.

Extend OpenAI-compat path with Anthropic-style cache_control:ephemeral
inline blocks for Alibaba DashScope endpoints (provider_type=bailian or
URL contains "dashscope"). Verified live: qwen3.6-plus / qwen3.5-plus /
qwen3-coder-plus all return 99.7% cache hit rate on 6K-token prefix,
yielding ~89% token cost reduction on cached prefix (90% Alibaba discount).

- isDashScope(): 3-source detection (URL + providerType + name) handles
  reverse-proxied endpoints; covers both "dashscope" and "bailian"
- buildRequestBody wraps system content with SplitSystemPromptForCache
  (exported from anthropic_request.go) using <!-- GOCLAW_CACHE_BOUNDARY -->
- Tool prefix cache: cache_control on last tool definition, with 4-marker
  budget guard
- Parse cache_creation_input_tokens from prompt_tokens_details into
  Usage.CacheCreationTokens; propagates via existing span metadata writers
- Runtime escape hatch: GOCLAW_DISABLE_DASHSCOPE_CACHE=true
- Live integration smoke test (build tag integration, env-gated)
@clark-cant
Copy link
Copy Markdown

🔍 Code Review — feat(providers): explicit prompt cache for DashScope/Qwen

🎯 Tổng quan

Extend OpenAI-compat provider với Anthropic-style cache_control:ephemeral inline blocks cho DashScope/Bailian endpoints. Live verified: 99.7% cache hit rate trên 6K-token prefix cho qwen3.6-plus, giảm ~89% input token cost (90% Alibaba discount × 99.7% hit).

Scope: 13 files, +367/-13, 1 commit


🤖 CI Status + Merge Conflicts

CI: ⚠️ go: PENDING, web: ✅ PASSED (52s)
Merge conflicts: ⚠️ UNSTABLE (mergeable: MERGEABLE) — cần rebase


✅ Điểm Tốt

  1. Live validation thực tế. 99.7% cache hit rate verified trên production agents (tra, van-anh) — không phải đoán mò. Bảng validation với 5 models (3 support, 2 silent no-op) rất thorough.

  2. 3-source detection robust. isDashScope() check URL + providerType + name — handles reverse-proxied endpoints, covers cả dashscopebailian provider_type. Smart defensive design.

  3. Shared cache boundary marker. SplitSystemPromptForCache exported từ anthropic_request.go, dùng chung <!-- GOCLAW_CACHE_BOUNDARY --> — single source of truth cho cả Anthropic + DashScope paths. DRY tốt.

  4. Tool prefix cache. cache_control trên last tool definition với 4-marker budget guard — caches entire tools array (~5-10K tokens). Combined với system block cache: 2/4 markers used, efficient.

  5. Usage parsing đầy đủ. CacheCreationInputTokens parse từ prompt_tokens_details, propagate qua existing span metadata writers — no pipeline changes needed. Observability tự động.

  6. Runtime escape hatch. GOCLAW_DISABLE_DASHSCOPE_CACHE=true — rollback không cần redeploy. Good production hygiene.

  7. Backward compatibility clean. Anthropic unchanged, native OpenAI unchanged, other OpenAI-compat providers skipped via isDashScope() guard. DB schema không đổi.

  8. Integration smoke test. TestQwenCacheSmoke — per-run salt, call 1 create + call 2 hit verification, summary table output. Excellent test design.

  9. Estimated impact documented. ~75% input token reduction for mid-conversation Qwen calls. Combined với companion plan → -85% total. Clear ROI.


🟡 MEDIUM

  1. openAIPromptDetails.CachedTokens renamed từ CachedTokensCachedTokens (same name, was missing Cached prefix?). Wait — looking at diff: CachedTokens intCachedTokens int + CacheCreationInputTokens int. The old field was CachedTokens which mapped to cached_tokens JSON. But the struct field was named CachedTokens which is correct. No issue here — just adding the new field.

  2. BuildRequestBodyForTest exported helper. Test-only export trên OpenAIProvider — acceptable pattern, nhưng comment nên note rõ đây không phải public API.


🟢 LOW

  1. repeat() helper trong integration test. Có thể dùng strings.Repeat thay vì custom repeat() function — minor cleanup.

  2. Non-DashScope models silent no-op. qwen3-max-2026-01-23qwen3-coder-next không support cache → wrap is no-op (0 tokens cost). Good design, nhưng nên document trong PR body rõ hơn về fallback behavior.


📊 Summary

Severity Count Issues
🔴 CRITICAL 0
🟡 MEDIUM 0
🟢 LOW 2 repeat() helper, silent no-op documentation

💡 Recommendation

🟢 APPROVED

Solid feature, live verified, ~75% token cost reduction cho Qwen models. Clean backward compatibility, good test coverage, runtime escape hatch.

Wait for CI go:complete before merging!

Great work @nguyennguyenit — đây là PR tiết kiệm tiền thật sự! 💰🚀

- Use strings.Repeat instead of custom repeat() helper in smoke test
- Clarify BuildRequestBodyForTest is test-only, not public API
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants