feat: 对齐数据集配置 27 项 Java/Python 模型并补齐 MinIO 三桶 (LINK-219)#142
Merged
Conversation
- 数据集解析/检索配置 5 类 27 项与 Python 权威模型 (src/core/dataset_config/models.py) 逐项对齐: 字段名、类型、校验、默认回落语义保持一致 - 修复 ChunkingConfig.protected_neighbor_overlap:Integer → Boolean, 对齐 Python 的开关语义(含受保护元素的 chunk 是否参与 neighbor overlap) - LocalFileService 补 case RAW 分支,修复 OssSavePlaceEnum.RAW 未覆盖导致的编译失败 - 同步 MinIO 三桶(公开 / 原文件 tolink-rag-raw / 解析产物 tolink-rag-docs)相关文档与配置: .env.example、configuration.md、object_storage_module.md、document_file_module.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
背景
Python 端(toLink-Rag)已完成数据集级解析/检索配置的全部参数定义,权威源在
src/core/dataset_config/models.py。本 PR 把 Java 管理端与之逐项对齐,并顺带补齐 MinIO 三桶改造遗留的编译错误与文档同步。关联 issue:LINK-219。Java 端 ↔ Python 端 配置对齐(5 类 27 项)
字段名、类型、校验范围、默认回落语义全部与 Python Pydantic 模型保持一致。「未提交字段不落库、由 Python 消费时按系统
Settings(L1) 补默认」的分层语义两端统一。heading_break_level/min_candidate_chunk_tokens/overlap_tokens/max_chunk_tokens/hard_max_tokens/stage_two_algorithm/protected_neighbor_overlapenable_table_enhancement/enable_image_enhancement/enable_heading_hierarchypdf_parser_backendrecall_result_limit/recall_context_token_budget/bm25_top_k/sparse_top_k/sparse_score_threshold/dense_top_k/dense_score_threshold/recall_enabled_sources/recall_fusion_strategy/fusion_bm25_weight/fusion_sparse_weight/fusion_dense_weight/rerank_top_n/recall_strictsparse_embedding_config_id/dense_embedding_config_id(两个独立 BIGINT 列,绑定后不可修改,由 Java 侧强制)唯一的语义修正
ChunkingConfig.protected_neighbor_overlap:原 Java 实现为Integer(误当作"重叠 token 数"),Python 权威模型是Boolean(开关:含受保护元素——表格/代码块/公式——的 chunk 是否参与 neighbor overlap)。本 PR 改为Boolean,消除跨端 int→bool 强转导致的语义漂移。两处确认对齐、无需改动
auto / mineru / opendataloader / naive与 Python registry 实际注册后端一致。embedding_configJSON 列,实际两端都用两个独立 BIGINT 列,完全兼容。MinIO 三桶补齐
LocalFileService补case RAW分支,修复OssSavePlaceEnum.RAW未覆盖导致的 switch 编译失败。tolink-public/ 原文件tolink-rag-raw/ 解析产物tolink-rag-docs)相关文档与配置:.env.example、docs/ops/configuration.md、docs/internals/object_storage_module.md、docs/internals/document_file_module.md。测试
DatasetParseConfigServiceImplTest(13) /DatasetParseConfigControllerTest(27) /LocalFileServiceTest(7),共 47 个用例,0 失败。python3 scripts/check_docs_sync.py --working通过,无文档同步遗漏。收尾
至此 Java 管理端与 Python 执行端在数据集配置这一契约上完成对齐:两端共享同一张
dataset_parse_config表、同一套字段语义与默认回落规则,新增配置项只需在 Python 权威模型与本 PR 涉及的 Java DTO 同步追加即可。Web 端配置表单将基于这 27 项作为下一步对齐目标。🤖 Generated with Claude Code