feat(web_fetch): partial content, Readability.js extraction, split size limits by endo-ly · Pull Request #86 · endo-ly/egopulse

endo-ly · 2026-05-16T02:39:13Z

概要

web_fetch ツールを現代的なWebページで実用できる水準に引き上げる。

Plan: docs/plan-web-fetch-partial-readability.md

変更内容

Partial Content（部分取得）

レスポンスが max_fetch_bytes を超過してもエラーにせず、取得済み範囲を返す
UTF-8 境界を安全に保持するストリーム打ち切り
response_truncated / output_truncated を details に含めて success 返却

`max_bytes` → `max_fetch_bytes` / `max_output_bytes` 分離

max_fetch_bytes（デフォルト 512KB）: ネットワークフェッチ上限（config only）
max_output_bytes（デフォルト 64KB）: 本文の最大バイト数（tool parameter で上書き可、config 上限で clamp）
後方互換エイリアスなし（max_bytes は完全削除）

Mozilla Readability.js 本文抽出

readability-js = "0.1.5" クレートを導入
Readability::parse_with_url() で本文抽出 → htmd で Markdown 化
失敗時は既存 htmd 直接変換にフォールバック
std::panic::catch_unwind で panic 安全対策

セキュリティ不変条件

content_validation.max_scan_bytes >= max_output_bytes を normalize() で保証
max_output_bytes 拡大時も本文全体が injection スキャン対象になる

`max_output_bytes` の定義明確化

本文の最大バイト数（warning は上限外の安全メタ情報）

details schema

{
  "final_url": "...",
  "content_type": "...",
  "content_length": 181374,
  "fetched_bytes": 524288,
  "response_truncated": true,
  "output_truncated": true,
  "max_fetch_bytes": 524288,
  "max_output_bytes": 65536,
  "extraction": "readability-js"
}

変更ファイル（9ファイル、+603 / -121）

ファイル	変更
`Cargo.toml`	`readability-js` 依存追加
`src/config/web_fetch.rs`	フィールド分離 + 正規化 + テスト
`src/config/loader.rs`	loader 経路の対応
`src/config/persist.rs`	シリアライズ対応
`src/tools/web_fetch/html_processing.rs`	Readability 統合 + ProcessedBody/ExtractionMethod
`src/tools/web_fetch/mod.rs`	partial fetch + output truncation + 新 details
`docs/config.md`	設定仕様更新
`docs/tools.md`	ツール仕様更新

検証

cargo fmt --check ✓
cargo check ✓
cargo test — 1011 passed ✓
cargo clippy -D warnings ✓

Summary by CodeRabbit

新機能
- ウェブフェッチのサイズ制御を改善し、取得上限（max_fetch_bytes）と出力上限（max_output_bytes）を分離
- サイズ超過時はエラーではなく取得済みの部分を返す（部分取得扱い）ように変更
- HTML本文抽出を優先的にReadability.jsで実行し、抽出メタデータ（抽出方式・バイト数・切り詰めフラグ等）を結果に追加
ドキュメント
- web_fetch仕様と設定例を新仕様へ更新（出力/取得上限や詳細フィールドを追記）
Chores
- HTML処理用ライブラリを依存に追加
テスト
- フェッチ／抽出／切り詰めに関するテストを追加・更新

…it size limits Replace max_bytes with max_fetch_bytes (512KB) and max_output_bytes (64KB). Oversized responses now return partial content instead of erroring. Integrate Mozilla Readability.js via readability-js crate for article extraction with htmd fallback.

…y output limit semantics Ensure content validation scans the full output body by normalizing max_scan_bytes to max_output_bytes when underspecified. Clarify that max_output_bytes applies to body content only (warnings excluded). Fix content_length docs to reflect null on absent header.

coderabbitai · 2026-05-16T02:39:24Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 84693456-1690-40a5-813d-bb8609ab812a

📥 Commits

Reviewing files that changed from the base of the PR and between 53534aa and eaa616e.

📒 Files selected for processing (3)

deny.toml
src/tools/web_fetch/html_processing.rs
src/tools/web_fetch/mod.rs

✅ Files skipped from review due to trivial changes (1)

deny.toml

🚧 Files skipped from review as they are similar to previous changes (2)

src/tools/web_fetch/html_processing.rs
src/tools/web_fetch/mod.rs

Walkthrough

web_fetchツールを大幅刷新。取得と出力の上限を分離し、従来のエラー返却をパーシャルコンテンツ返却に変更。Readability.jsでHTML本文を抽出し、抽出方式と打ち切りフラグを結果詳細に追加。

Changes

web_fetchの取得/出力打ち切り化と本文抽出改善

Layer / File(s)	Summary
依存関係と型/定数更新 `Cargo.toml`, `src/config/web_fetch.rs`	`readability-js` を追加。WebFetchConfig の `max_bytes` を `max_fetch_bytes`（取得上限）と `max_output_bytes`（出力上限）に分割し、デフォルトを 512KB/64KB に設定。
設定の読み込み `src/config/loader.rs`	FileWebFetchConfig に新フィールドを追加し、normalize_web_fetch で新デフォルトを割り当てるように更新。
設定の永続化/シリアライズ `src/config/persist.rs`	SerializableWebFetchConfig に `max_fetch_bytes`/`max_output_bytes` を追加し、デフォルトスキップ判定を導入。Config→Serializable の変換ロジックを更新。
WebFetchConfig 実装とテスト `src/config/web_fetch.rs`	デフォルト定数追加、`Default` と `normalize` の0値補完、`content_validation.max_scan_bytes` の調整、デシリアライズ／normalize 関連テスト更新。
HTML処理のreadability-js統合 `src/tools/web_fetch/html_processing.rs`	`ProcessedBody` と `ExtractionMethod` を追加。`process_response_body_with_metadata` を導入し Readability.js を優先、失敗や空結果は HTML→Markdown にフォールバック。タイトル前置などの調整とテスト追加。
web_fetchツール実装（取得/出力打ち切り） `src/tools/web_fetch/mod.rs`	`FetchParams` に `max_output_bytes` を追加。`execute` を streaming 読取（`config.max_fetch_bytes`）→抽出→出力トランケーション（`max_output_bytes`）の流れに差し替え。`response_truncated`/`output_truncated` フラグと `PARTIAL_CONTENT_WARNING` を導入し、詳細JSONに各種メタデータを出力。多数のテストを更新・追加。
ドキュメント更新 `docs/config.md`, `docs/tools.md`	`web_fetch` の仕様と YAML 例を `max_fetch_bytes`/`max_output_bytes` と抽出/詳細フィールドに合わせて更新。従来の「超過でエラー」記述は削除。
ライセンス許可リスト更新 `deny.toml`	`[licenses].allow` に `"UPL-1.0"` を追加。

Sequence Diagram

sequenceDiagram
  participant API Caller
  participant WebFetch as web_fetch::execute
  participant HTTPClient as HTTP Client
  participant BodyReader as Streaming Reader
  participant HTMLProc as HTML Processing
  participant Output as Output Truncate
  participant Validate as Content Validate
  participant Result as Tool Result
  API Caller->>WebFetch: URL + max_output_bytes
  WebFetch->>HTTPClient: GET request
  HTTPClient-->>WebFetch: response stream
  WebFetch->>BodyReader: read until max_fetch_bytes
  alt Exceeded max_fetch_bytes
    BodyReader-->>WebFetch: partial body + response_truncated=true
  else Within limit
    BodyReader-->>WebFetch: complete body
  end
  WebFetch->>WebFetch: UTF-8 boundary normalization
  WebFetch->>HTMLProc: body + content_type + url
  HTMLProc-->>WebFetch: {text, extraction_method}
  WebFetch->>Output: text + max_output_bytes
  alt Exceeded max_output_bytes
    Output-->>WebFetch: truncated + output_truncated=true
  else Within limit
    Output-->>WebFetch: complete text
  end
  WebFetch->>Validate: final text
  alt Partial flags set
    Validate-->>WebFetch: add PARTIAL_CONTENT_WARNING
  end
  WebFetch->>Result: {text, extraction, content_length,<br/>fetched_bytes, response_truncated,<br/>output_truncated, max_fetch_bytes, max_output_bytes}
  Result-->>API Caller: success_with_details

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

endo-ly/egopulse#85: このPRの前段階としてweb_fetchツール実装の基盤を整えており、このPRで取得/出力打ち切り化とreadability-js統合を実現している直接的な続編。

Poem

📚✨ 記事を読む旅、もっと賢く
Readabilityで本質をつかまえ
長すぎる応答も「切って返す」へ
エラーじゃなく、パーシャルで成功
メタデータが、その全てを語る 🎯

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	PRタイトルは、web_fetchツールの3つの主要な変更（partial content対応、Readability.js導入、サイズ制限の分割）を簡潔に表現しており、変更セットの中核を的確に要約している。
Docstring Coverage	✅ Passed	Docstring coverage is 90.57% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/web-fetch-partial-readability

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/config/loader.rs (1)
117-126: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

古い設定キーがサイレント無視されて危ないね

max_bytes から max_fetch_bytes と max_output_bytes への移行で、古い設定ファイルに max_bytes が残ってても、パース時に無視されて既定値へ落ちる。これは要素的な移行ミスに気づけないリスク。fail-fast にするため #[serde(deny_unknown_fields)] を付けたほうが安全だね。
差分案
 #[derive(Debug, Deserialize, Default)]
+#[serde(deny_unknown_fields)]
 struct FileWebFetchConfig {
     allowed_schemes: Option<Vec<String>>,
     timeout_secs: Option<u64>,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/config/loader.rs` around lines 117 - 126, The FileWebFetchConfig struct
silently ignores unknown/old keys like max_bytes; add serde strict
deserialization by annotating the struct (FileWebFetchConfig) with
#[serde(deny_unknown_fields)] so any unknown fields in incoming config cause a
parse error (fail-fast) and force migration; apply same deny_unknown_fields to
related config structs (e.g., FileWebFetchContentValidationConfig) if they also
must reject unknown keys.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/tools/web_fetch/html_processing.rs`:
- Around line 66-70: The is_html_content function uses a case-sensitive contains
check; update it to perform a case-insensitive check by normalizing the header
string (e.g., call to_lowercase or equivalent on the content_type before
checking contains "text/html") so values like "Text/HTML; charset=UTF-8" are
detected as HTML; ensure you still return true for None and only change the
matching logic inside is_html_content.

In `@src/tools/web_fetch/mod.rs`:
- Around line 271-278: The current truncate_to_utf8_boundary pops bytes until
the whole buffer is valid UTF-8, which can remove valid data; change it to only
drop an incomplete trailing UTF-8 sequence. In truncate_to_utf8_boundary,
inspect at most the last 4 bytes: scan backward to find the nearest potential
UTF-8 leading byte (by testing byte patterns: 0xxxxxxx, 110xxxxx, 1110xxxx,
11110xxx) and count how many bytes the sequence should have, then if the buffer
ends with fewer bytes than that expected length, truncate those trailing
continuation bytes (bytes with pattern 10xxxxxx) only; otherwise leave the
buffer untouched so interior invalid bytes remain for later UTF-8 error
handling.

---

Outside diff comments:
In `@src/config/loader.rs`:
- Around line 117-126: The FileWebFetchConfig struct silently ignores
unknown/old keys like max_bytes; add serde strict deserialization by annotating
the struct (FileWebFetchConfig) with #[serde(deny_unknown_fields)] so any
unknown fields in incoming config cause a parse error (fail-fast) and force
migration; apply same deny_unknown_fields to related config structs (e.g.,
FileWebFetchContentValidationConfig) if they also must reject unknown keys.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 473453fc-5f45-4780-acf7-e129c8429c22

📥 Commits

Reviewing files that changed from the base of the PR and between b2175e7 and 53534aa.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (8)

Cargo.toml
docs/config.md
docs/tools.md
src/config/loader.rs
src/config/persist.rs
src/config/web_fetch.rs
src/tools/web_fetch/html_processing.rs
src/tools/web_fetch/mod.rs

…ncation Make is_html_content case-insensitive for Content-Type headers like 'Text/HTML'. Rewrite truncate_to_utf8_boundary to only drop the incomplete trailing UTF-8 sequence instead of rescanning the entire buffer.

endo-ly added 2 commits May 15, 2026 19:09

coderabbitai Bot reviewed May 16, 2026

View reviewed changes

Comment thread src/tools/web_fetch/html_processing.rs

Comment thread src/tools/web_fetch/mod.rs

endo-ly added 2 commits May 16, 2026 03:10

fix(ci): add UPL-1.0 to allowed licenses for readability-js

eaa616e

endo-ly merged commit 8df7385 into main May 16, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(web_fetch): partial content, Readability.js extraction, split size limits#86

feat(web_fetch): partial content, Readability.js extraction, split size limits#86
endo-ly merged 4 commits into
mainfrom
feat/web-fetch-partial-readability

endo-ly commented May 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 16, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

endo-ly commented May 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

概要

変更内容

Partial Content（部分取得）

max_bytes → max_fetch_bytes / max_output_bytes 分離

Mozilla Readability.js 本文抽出

セキュリティ不変条件

max_output_bytes の定義明確化

details schema

変更ファイル（9ファイル、+603 / -121）

検証

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

endo-ly commented May 16, 2026 •

edited by coderabbitai Bot

Loading

`max_bytes` → `max_fetch_bytes` / `max_output_bytes` 分離

`max_output_bytes` の定義明確化

coderabbitai Bot commented May 16, 2026 •

edited

Loading