feat(web_fetch): partial content, Readability.js extraction, split size limits#86
Conversation
…it size limits Replace max_bytes with max_fetch_bytes (512KB) and max_output_bytes (64KB). Oversized responses now return partial content instead of erroring. Integrate Mozilla Readability.js via readability-js crate for article extraction with htmd fallback.
…y output limit semantics Ensure content validation scans the full output body by normalizing max_scan_bytes to max_output_bytes when underspecified. Clarify that max_output_bytes applies to body content only (warnings excluded). Fix content_length docs to reflect null on absent header.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (2)
Walkthroughweb_fetchツールを大幅刷新。取得と出力の上限を分離し、従来のエラー返却をパーシャルコンテンツ返却に変更。Readability.jsでHTML本文を抽出し、抽出方式と打ち切りフラグを結果詳細に追加。 Changesweb_fetchの取得/出力打ち切り化と本文抽出改善
Sequence DiagramsequenceDiagram
participant API Caller
participant WebFetch as web_fetch::execute
participant HTTPClient as HTTP Client
participant BodyReader as Streaming Reader
participant HTMLProc as HTML Processing
participant Output as Output Truncate
participant Validate as Content Validate
participant Result as Tool Result
API Caller->>WebFetch: URL + max_output_bytes
WebFetch->>HTTPClient: GET request
HTTPClient-->>WebFetch: response stream
WebFetch->>BodyReader: read until max_fetch_bytes
alt Exceeded max_fetch_bytes
BodyReader-->>WebFetch: partial body + response_truncated=true
else Within limit
BodyReader-->>WebFetch: complete body
end
WebFetch->>WebFetch: UTF-8 boundary normalization
WebFetch->>HTMLProc: body + content_type + url
HTMLProc-->>WebFetch: {text, extraction_method}
WebFetch->>Output: text + max_output_bytes
alt Exceeded max_output_bytes
Output-->>WebFetch: truncated + output_truncated=true
else Within limit
Output-->>WebFetch: complete text
end
WebFetch->>Validate: final text
alt Partial flags set
Validate-->>WebFetch: add PARTIAL_CONTENT_WARNING
end
WebFetch->>Result: {text, extraction, content_length,<br/>fetched_bytes, response_truncated,<br/>output_truncated, max_fetch_bytes, max_output_bytes}
Result-->>API Caller: success_with_details
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/config/loader.rs (1)
117-126:⚠️ Potential issue | 🟠 Major | ⚡ Quick win古い設定キーがサイレント無視されて危ないね
max_bytesからmax_fetch_bytesとmax_output_bytesへの移行で、古い設定ファイルにmax_bytesが残ってても、パース時に無視されて既定値へ落ちる。これは要素的な移行ミスに気づけないリスク。fail-fast にするため#[serde(deny_unknown_fields)]を付けたほうが安全だね。差分案
#[derive(Debug, Deserialize, Default)] +#[serde(deny_unknown_fields)] struct FileWebFetchConfig { allowed_schemes: Option<Vec<String>>, timeout_secs: Option<u64>,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/config/loader.rs` around lines 117 - 126, The FileWebFetchConfig struct silently ignores unknown/old keys like max_bytes; add serde strict deserialization by annotating the struct (FileWebFetchConfig) with #[serde(deny_unknown_fields)] so any unknown fields in incoming config cause a parse error (fail-fast) and force migration; apply same deny_unknown_fields to related config structs (e.g., FileWebFetchContentValidationConfig) if they also must reject unknown keys.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/tools/web_fetch/html_processing.rs`:
- Around line 66-70: The is_html_content function uses a case-sensitive contains
check; update it to perform a case-insensitive check by normalizing the header
string (e.g., call to_lowercase or equivalent on the content_type before
checking contains "text/html") so values like "Text/HTML; charset=UTF-8" are
detected as HTML; ensure you still return true for None and only change the
matching logic inside is_html_content.
In `@src/tools/web_fetch/mod.rs`:
- Around line 271-278: The current truncate_to_utf8_boundary pops bytes until
the whole buffer is valid UTF-8, which can remove valid data; change it to only
drop an incomplete trailing UTF-8 sequence. In truncate_to_utf8_boundary,
inspect at most the last 4 bytes: scan backward to find the nearest potential
UTF-8 leading byte (by testing byte patterns: 0xxxxxxx, 110xxxxx, 1110xxxx,
11110xxx) and count how many bytes the sequence should have, then if the buffer
ends with fewer bytes than that expected length, truncate those trailing
continuation bytes (bytes with pattern 10xxxxxx) only; otherwise leave the
buffer untouched so interior invalid bytes remain for later UTF-8 error
handling.
---
Outside diff comments:
In `@src/config/loader.rs`:
- Around line 117-126: The FileWebFetchConfig struct silently ignores
unknown/old keys like max_bytes; add serde strict deserialization by annotating
the struct (FileWebFetchConfig) with #[serde(deny_unknown_fields)] so any
unknown fields in incoming config cause a parse error (fail-fast) and force
migration; apply same deny_unknown_fields to related config structs (e.g.,
FileWebFetchContentValidationConfig) if they also must reject unknown keys.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 473453fc-5f45-4780-acf7-e129c8429c22
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (8)
Cargo.tomldocs/config.mddocs/tools.mdsrc/config/loader.rssrc/config/persist.rssrc/config/web_fetch.rssrc/tools/web_fetch/html_processing.rssrc/tools/web_fetch/mod.rs
…ncation Make is_html_content case-insensitive for Content-Type headers like 'Text/HTML'. Rewrite truncate_to_utf8_boundary to only drop the incomplete trailing UTF-8 sequence instead of rescanning the entire buffer.
概要
web_fetchツールを現代的なWebページで実用できる水準に引き上げる。Plan:
docs/plan-web-fetch-partial-readability.md変更内容
Partial Content(部分取得)
max_fetch_bytesを超過してもエラーにせず、取得済み範囲を返すresponse_truncated/output_truncatedを details に含めて success 返却max_bytes→max_fetch_bytes/max_output_bytes分離max_fetch_bytes(デフォルト 512KB): ネットワークフェッチ上限(config only)max_output_bytes(デフォルト 64KB): 本文の最大バイト数(tool parameter で上書き可、config 上限で clamp)max_bytesは完全削除)Mozilla Readability.js 本文抽出
readability-js = "0.1.5"クレートを導入Readability::parse_with_url()で本文抽出 →htmdで Markdown 化htmd直接変換にフォールバックstd::panic::catch_unwindで panic 安全対策セキュリティ不変条件
content_validation.max_scan_bytes >= max_output_bytesをnormalize()で保証max_output_bytes拡大時も本文全体が injection スキャン対象になるmax_output_bytesの定義明確化details schema
{ "final_url": "...", "content_type": "...", "content_length": 181374, "fetched_bytes": 524288, "response_truncated": true, "output_truncated": true, "max_fetch_bytes": 524288, "max_output_bytes": 65536, "extraction": "readability-js" }変更ファイル(9ファイル、+603 / -121)
Cargo.tomlreadability-js依存追加src/config/web_fetch.rssrc/config/loader.rssrc/config/persist.rssrc/tools/web_fetch/html_processing.rssrc/tools/web_fetch/mod.rsdocs/config.mddocs/tools.md検証
cargo fmt --check✓cargo check✓cargo test— 1011 passed ✓cargo clippy -D warnings✓Summary by CodeRabbit
新機能
ドキュメント
Chores
テスト