Skip to content

fix(splitter): 传递数据集级二阶段分片配置#286

Draft
ottercoconut wants to merge 1 commit into
devfrom
feature/link-226-stage-two-config
Draft

fix(splitter): 传递数据集级二阶段分片配置#286
ottercoconut wants to merge 1 commit into
devfrom
feature/link-226-stage-two-config

Conversation

@ottercoconut

Copy link
Copy Markdown
Collaborator

变更概述

修复数据集级分片配置没有完整传递到 splitter 装配的问题。关联 issue:#285

本次改动后,create_chunking_engine(config=...) 会使用数据集级 ChunkingConfig 中的:

  • stage_two_algorithm
  • protected_neighbor_overlap

此前工厂虽然已经消费了 overlap_tokensmin_candidate_chunk_tokensmax_chunk_tokenshard_max_tokensheading_break_level,但 StageTwoRouter 仍读取全局 settings.CHUNKING_STAGE_TWO_ALGORITHM,导致前端/数据集开启 semantic_depth_window 后,只要服务端全局默认仍是 noop,实际分片仍走 candidate_boundary + noop

根因

StageServices.run_chunking() 已经按数据集级 stage_two_algorithm 判断是否需要加载用户 EMBEDDING,但 src/core/splitter/factory.py 真正构造 StageTwoRouter 时又回到了全局 settings:

StageTwoRouter(
    algorithm_name=settings.CHUNKING_STAGE_TWO_ALGORITHM,
    ...
)

这造成“配置读取到了,但执行路由没用上配置”。

具体改动

  • src/core/splitter/factory.py

    • ChunkingConfig 解析 stage_two_algorithm,并传给 StageTwoRouter
    • ChunkingConfig 解析 protected_neighbor_overlap,并传给 StructuredSemanticChunker
    • 保持 config is None 时继续回退系统级 CHUNKING_* 配置。
  • tests/unit/core/splitter/test_factory.py

    • 覆盖数据集级 semantic_depth_window 覆盖全局 noop
    • 覆盖数据集级 noop 覆盖全局 semantic_depth_window
    • 覆盖运行时输出 metadata:全局为 noop 时,只要数据集级为 semantic_depth_window,最终 split_strategy 应为 candidate_boundary + semantic_depth_window
    • 同步覆盖分片配置字段传递:heading_break_levelmin_candidate_chunk_tokensoverlap_tokensmax_chunk_tokenshard_max_tokensstage_two_algorithmprotected_neighbor_overlap

验证

.venv/bin/python -m pytest tests/unit/core/splitter
.venv/bin/python scripts/quality/check_docs_sync.py --working
git diff --check

结果:

  • tests/unit/core/splitter:69 passed
  • docs sync:OK,2 changed files, no doc-sync issues
  • git diff --check:通过

风险与说明

  • 本 PR 只修复“数据集级配置未传递到 splitter 路由”的问题。
  • LINK-226 诊断中还观察到:二阶段启用后,neighbor overlap 在 hard max 校验之后追加,个别 chunk 仍可能超过用户配置的目标/硬上限。该问题与本 PR 的配置传播缺口不同,建议作为后续单独收敛。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant