Skip to content

feat(web_fetch): partial content, Readability.js extraction, split size limits#86

Merged
endo-ly merged 4 commits into
mainfrom
feat/web-fetch-partial-readability
May 16, 2026
Merged

feat(web_fetch): partial content, Readability.js extraction, split size limits#86
endo-ly merged 4 commits into
mainfrom
feat/web-fetch-partial-readability

Conversation

@endo-ly
Copy link
Copy Markdown
Owner

@endo-ly endo-ly commented May 16, 2026

概要

web_fetch ツールを現代的なWebページで実用できる水準に引き上げる。

Plan: docs/plan-web-fetch-partial-readability.md

変更内容

Partial Content(部分取得)

  • レスポンスが max_fetch_bytes を超過してもエラーにせず、取得済み範囲を返す
  • UTF-8 境界を安全に保持するストリーム打ち切り
  • response_truncated / output_truncated を details に含めて success 返却

max_bytesmax_fetch_bytes / max_output_bytes 分離

  • max_fetch_bytes(デフォルト 512KB): ネットワークフェッチ上限(config only)
  • max_output_bytes(デフォルト 64KB): 本文の最大バイト数(tool parameter で上書き可、config 上限で clamp)
  • 後方互換エイリアスなし(max_bytes は完全削除)

Mozilla Readability.js 本文抽出

  • readability-js = "0.1.5" クレートを導入
  • Readability::parse_with_url() で本文抽出 → htmd で Markdown 化
  • 失敗時は既存 htmd 直接変換にフォールバック
  • std::panic::catch_unwind で panic 安全対策

セキュリティ不変条件

  • content_validation.max_scan_bytes >= max_output_bytesnormalize() で保証
  • max_output_bytes 拡大時も本文全体が injection スキャン対象になる

max_output_bytes の定義明確化

  • 本文の最大バイト数(warning は上限外の安全メタ情報)

details schema

{
  "final_url": "...",
  "content_type": "...",
  "content_length": 181374,
  "fetched_bytes": 524288,
  "response_truncated": true,
  "output_truncated": true,
  "max_fetch_bytes": 524288,
  "max_output_bytes": 65536,
  "extraction": "readability-js"
}

変更ファイル(9ファイル、+603 / -121)

ファイル 変更
Cargo.toml readability-js 依存追加
src/config/web_fetch.rs フィールド分離 + 正規化 + テスト
src/config/loader.rs loader 経路の対応
src/config/persist.rs シリアライズ対応
src/tools/web_fetch/html_processing.rs Readability 統合 + ProcessedBody/ExtractionMethod
src/tools/web_fetch/mod.rs partial fetch + output truncation + 新 details
docs/config.md 設定仕様更新
docs/tools.md ツール仕様更新

検証

  • cargo fmt --check
  • cargo check
  • cargo test — 1011 passed ✓
  • cargo clippy -D warnings

Summary by CodeRabbit

  • 新機能

    • ウェブフェッチのサイズ制御を改善し、取得上限(max_fetch_bytes)と出力上限(max_output_bytes)を分離
    • サイズ超過時はエラーではなく取得済みの部分を返す(部分取得扱い)ように変更
    • HTML本文抽出を優先的にReadability.jsで実行し、抽出メタデータ(抽出方式・バイト数・切り詰めフラグ等)を結果に追加
  • ドキュメント

    • web_fetch仕様と設定例を新仕様へ更新(出力/取得上限や詳細フィールドを追記)
  • Chores

    • HTML処理用ライブラリを依存に追加
  • テスト

    • フェッチ/抽出/切り詰めに関するテストを追加・更新

endo-ly added 2 commits May 15, 2026 19:09
…it size limits

Replace max_bytes with max_fetch_bytes (512KB) and max_output_bytes
(64KB). Oversized responses now return partial content instead of
erroring. Integrate Mozilla Readability.js via readability-js crate
for article extraction with htmd fallback.
…y output limit semantics

Ensure content validation scans the full output body by normalizing
max_scan_bytes to max_output_bytes when underspecified. Clarify that
max_output_bytes applies to body content only (warnings excluded).
Fix content_length docs to reflect null on absent header.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 84693456-1690-40a5-813d-bb8609ab812a

📥 Commits

Reviewing files that changed from the base of the PR and between 53534aa and eaa616e.

📒 Files selected for processing (3)
  • deny.toml
  • src/tools/web_fetch/html_processing.rs
  • src/tools/web_fetch/mod.rs
✅ Files skipped from review due to trivial changes (1)
  • deny.toml
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/tools/web_fetch/html_processing.rs
  • src/tools/web_fetch/mod.rs

Walkthrough

web_fetchツールを大幅刷新。取得と出力の上限を分離し、従来のエラー返却をパーシャルコンテンツ返却に変更。Readability.jsでHTML本文を抽出し、抽出方式と打ち切りフラグを結果詳細に追加。

Changes

web_fetchの取得/出力打ち切り化と本文抽出改善

Layer / File(s) Summary
依存関係と型/定数更新
Cargo.toml, src/config/web_fetch.rs
readability-js を追加。WebFetchConfig の max_bytesmax_fetch_bytes(取得上限)と max_output_bytes(出力上限)に分割し、デフォルトを 512KB/64KB に設定。
設定の読み込み
src/config/loader.rs
FileWebFetchConfig に新フィールドを追加し、normalize_web_fetch で新デフォルトを割り当てるように更新。
設定の永続化/シリアライズ
src/config/persist.rs
SerializableWebFetchConfig に max_fetch_bytes/max_output_bytes を追加し、デフォルトスキップ判定を導入。Config→Serializable の変換ロジックを更新。
WebFetchConfig 実装とテスト
src/config/web_fetch.rs
デフォルト定数追加、Defaultnormalize の0値補完、content_validation.max_scan_bytes の調整、デシリアライズ/normalize 関連テスト更新。
HTML処理のreadability-js統合
src/tools/web_fetch/html_processing.rs
ProcessedBodyExtractionMethod を追加。process_response_body_with_metadata を導入し Readability.js を優先、失敗や空結果は HTML→Markdown にフォールバック。タイトル前置などの調整とテスト追加。
web_fetchツール実装(取得/出力打ち切り)
src/tools/web_fetch/mod.rs
FetchParamsmax_output_bytes を追加。execute を streaming 読取(config.max_fetch_bytes)→抽出→出力トランケーション(max_output_bytes)の流れに差し替え。response_truncated/output_truncated フラグと PARTIAL_CONTENT_WARNING を導入し、詳細JSONに各種メタデータを出力。多数のテストを更新・追加。
ドキュメント更新
docs/config.md, docs/tools.md
web_fetch の仕様と YAML 例を max_fetch_bytes/max_output_bytes と抽出/詳細フィールドに合わせて更新。従来の「超過でエラー」記述は削除。
ライセンス許可リスト更新
deny.toml
[licenses].allow"UPL-1.0" を追加。

Sequence Diagram

sequenceDiagram
  participant API Caller
  participant WebFetch as web_fetch::execute
  participant HTTPClient as HTTP Client
  participant BodyReader as Streaming Reader
  participant HTMLProc as HTML Processing
  participant Output as Output Truncate
  participant Validate as Content Validate
  participant Result as Tool Result
  API Caller->>WebFetch: URL + max_output_bytes
  WebFetch->>HTTPClient: GET request
  HTTPClient-->>WebFetch: response stream
  WebFetch->>BodyReader: read until max_fetch_bytes
  alt Exceeded max_fetch_bytes
    BodyReader-->>WebFetch: partial body + response_truncated=true
  else Within limit
    BodyReader-->>WebFetch: complete body
  end
  WebFetch->>WebFetch: UTF-8 boundary normalization
  WebFetch->>HTMLProc: body + content_type + url
  HTMLProc-->>WebFetch: {text, extraction_method}
  WebFetch->>Output: text + max_output_bytes
  alt Exceeded max_output_bytes
    Output-->>WebFetch: truncated + output_truncated=true
  else Within limit
    Output-->>WebFetch: complete text
  end
  WebFetch->>Validate: final text
  alt Partial flags set
    Validate-->>WebFetch: add PARTIAL_CONTENT_WARNING
  end
  WebFetch->>Result: {text, extraction, content_length,<br/>fetched_bytes, response_truncated,<br/>output_truncated, max_fetch_bytes, max_output_bytes}
  Result-->>API Caller: success_with_details
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • endo-ly/egopulse#85: このPRの前段階としてweb_fetchツール実装の基盤を整えており、このPRで取得/出力打ち切り化とreadability-js統合を実現している直接的な続編。

Poem

📚✨ 記事を読む旅、もっと賢く
Readabilityで本質をつかまえ
長すぎる応答も「切って返す」へ
エラーじゃなく、パーシャルで成功
メタデータが、その全てを語る 🎯

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed PRタイトルは、web_fetchツールの3つの主要な変更(partial content対応、Readability.js導入、サイズ制限の分割)を簡潔に表現しており、変更セットの中核を的確に要約している。
Docstring Coverage ✅ Passed Docstring coverage is 90.57% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/web-fetch-partial-readability

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/config/loader.rs (1)

117-126: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

古い設定キーがサイレント無視されて危ないね

max_bytes から max_fetch_bytesmax_output_bytes への移行で、古い設定ファイルに max_bytes が残ってても、パース時に無視されて既定値へ落ちる。これは要素的な移行ミスに気づけないリスク。fail-fast にするため #[serde(deny_unknown_fields)] を付けたほうが安全だね。

差分案
 #[derive(Debug, Deserialize, Default)]
+#[serde(deny_unknown_fields)]
 struct FileWebFetchConfig {
     allowed_schemes: Option<Vec<String>>,
     timeout_secs: Option<u64>,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/config/loader.rs` around lines 117 - 126, The FileWebFetchConfig struct
silently ignores unknown/old keys like max_bytes; add serde strict
deserialization by annotating the struct (FileWebFetchConfig) with
#[serde(deny_unknown_fields)] so any unknown fields in incoming config cause a
parse error (fail-fast) and force migration; apply same deny_unknown_fields to
related config structs (e.g., FileWebFetchContentValidationConfig) if they also
must reject unknown keys.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/tools/web_fetch/html_processing.rs`:
- Around line 66-70: The is_html_content function uses a case-sensitive contains
check; update it to perform a case-insensitive check by normalizing the header
string (e.g., call to_lowercase or equivalent on the content_type before
checking contains "text/html") so values like "Text/HTML; charset=UTF-8" are
detected as HTML; ensure you still return true for None and only change the
matching logic inside is_html_content.

In `@src/tools/web_fetch/mod.rs`:
- Around line 271-278: The current truncate_to_utf8_boundary pops bytes until
the whole buffer is valid UTF-8, which can remove valid data; change it to only
drop an incomplete trailing UTF-8 sequence. In truncate_to_utf8_boundary,
inspect at most the last 4 bytes: scan backward to find the nearest potential
UTF-8 leading byte (by testing byte patterns: 0xxxxxxx, 110xxxxx, 1110xxxx,
11110xxx) and count how many bytes the sequence should have, then if the buffer
ends with fewer bytes than that expected length, truncate those trailing
continuation bytes (bytes with pattern 10xxxxxx) only; otherwise leave the
buffer untouched so interior invalid bytes remain for later UTF-8 error
handling.

---

Outside diff comments:
In `@src/config/loader.rs`:
- Around line 117-126: The FileWebFetchConfig struct silently ignores
unknown/old keys like max_bytes; add serde strict deserialization by annotating
the struct (FileWebFetchConfig) with #[serde(deny_unknown_fields)] so any
unknown fields in incoming config cause a parse error (fail-fast) and force
migration; apply same deny_unknown_fields to related config structs (e.g.,
FileWebFetchContentValidationConfig) if they also must reject unknown keys.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 473453fc-5f45-4780-acf7-e129c8429c22

📥 Commits

Reviewing files that changed from the base of the PR and between b2175e7 and 53534aa.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • Cargo.toml
  • docs/config.md
  • docs/tools.md
  • src/config/loader.rs
  • src/config/persist.rs
  • src/config/web_fetch.rs
  • src/tools/web_fetch/html_processing.rs
  • src/tools/web_fetch/mod.rs

Comment thread src/tools/web_fetch/html_processing.rs
Comment thread src/tools/web_fetch/mod.rs
endo-ly added 2 commits May 16, 2026 03:10
…ncation

Make is_html_content case-insensitive for Content-Type headers like
'Text/HTML'. Rewrite truncate_to_utf8_boundary to only drop the
incomplete trailing UTF-8 sequence instead of rescanning the entire
buffer.
@endo-ly endo-ly merged commit 8df7385 into main May 16, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant