Skip to content

feat: 支持模糊搜索指令#573

Open
gdm257 wants to merge 3 commits into
ZToolsCenter:mainfrom
gdm257:pr/token-search
Open

feat: 支持模糊搜索指令#573
gdm257 wants to merge 3 commits into
ZToolsCenter:mainfrom
gdm257:pr/token-search

Conversation

@gdm257

@gdm257 gdm257 commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Issues

Closes #495
Related #462 #352

TL;DR

  • 指令搜索允许 tm → Task Managerrenwu → 任务管理器 这类首字母 / 词首命中,而旧实现只支持首字母。
  • 尽量不破坏原有排序设计(偏好、系统应用、指令、频率 etc),可通过设置里的「分词搜索」开关,默认开启。
  • 核心思想「顺序匹配」(原有设计)+「分词排序过滤」(参考了 fzf、IDE 语法补全工具)。为最核心的分词、切分、匹配、排序提供了 tests。
  • 已在本地测试优化了几天,尽量减少已有文件的改动,并保持兼容性

Screenshots

PixPin_2026-07-04_08-57-36 PixPin_2026-07-04_21-52-41 PixPin_2026-07-04_21-53-38 PixPin_2026-07-04_21-52-13

权重排序

权重 直接分词 编码分词(拼音) query → item 示例
1000 multi tokens 完整项 ✅ multi tokens 完整项 ✅ task manager → Task Manager / renwuguanliqi → 任务管理器
900 multi tokens 连续词首 ✅ multi tokens 连续词首 ✅ tmTask Manager / rwgl任务管****理
800 single token 全词 ✅ single token 全词 ✅ tasktask manager / guan理器
600 single token 词首子串 ✅ tastask manager
400 multi tokens 非连续词首 ✅ multi tokens 非连续词首 ✅ vcVisual + Code(跳过 Studio)/ rlq理****器(跳过)
❌ single token 非词首子串、子序列 ❌ single token 词首子串、非词首子串、子序列 ana → manager / mgr → manager
❌ multi tokens 跨词混合、跨词全词中 ❌ multi tokens 跨词混合、跨词全词中 askan → task+manager
  • 系统应用略微加权:同档位内 direct+app 比指令略优先,但不会跨档压过更高档位(保留原有的设计)
  • 同权重高频高优先级:同档位同类型按使用频率降序排,常用项靠前(保留原有的设计)

MECE 分类

分类 定义 直接分词 编码分词(拼音) 示例(query → item)
multi tokens 完整项 Q = 待匹配项原文(覆盖全部 token,每段完整) task manager → Task Manager
multi tokens 连续词首 每段=各自 token 连续词首子串,token 相邻 tm → [task]+[manager];tasm → [tas]k+[m]anager;ren w g l → [任][务][管][理]器
single token 全词 Q = 某 token 完整(连续、词首、覆满) task → [task, manager]
single token 词首子串 Q = 某 token 的连续词首子串(tokenStart=0) man → [task, manager];m → manager
multi tokens 非连续词首 每段=各自 token 连续词首子串,token 有跳跃 vc → [visual]+[code](跳过 Studio)
single token 非词首子串 Q = 某 token 内连续子串(tokenStart>0) ana → [task, manager]
single token 子序列 Q = 某 token 内非连续有序字符 mgr → [task, manager]
multi tokens 跨词混合 跨 ≥2 token,含词首段且含词中段 askan → t[ask]+m[an]ager
multi tokens 跨词全词中 跨 ≥2 token,所有段都在词中(无词首段) skan → ta[sk]+m[an]ager
  • 拼音约束更严:音节必须完整,如音节 gu 不会命中音节 guan
  • 顺序匹配:query 的每个字符必须按其在 query 中的顺序出现在待匹配项中,不满足顺序则不命中。
  • 分词过滤与排序:将命令按 ASCII 边界切成 token(CJK 每个字为单独 token),再依据 query 相对 token 的落点判定匹配模式,「过滤」信息量低的无用模式,并按模式分档「排序」

已知限制 / 预期设计

  • query 不支持混合拼音与英文:要么全拼音,要么全英文(或原始 Unicode 字符)。混合编码会在搜索结果中引入其他语言的噪音,且算法复杂、影响性能。当前只支持英文与拼音,若未来加入双拼、五笔、罗马音等编码,混合输入的噪音问题会更严重。
  • query 不支持符号:符号未被算作 token。影响很小,例如 7-Zip File Manager 无法用 7-Zip 搜索,用 7z 即可;只有极少数应用(如 Dism++)以符号命名。

gdm257 added 2 commits July 4, 2026 19:04
fix(search): 软加权跨档回归修复与列表模式排序

- typeWeights.app 300→50:原 +300 在 modeTiers 间距 100 的体系里跨档回归,50 小于最小档距确保不跨档
- 删除 commandDataStore 未实现的 stableSortByAppWeight 后处理调用(软加权已在引擎 scoreByPattern 内)
- 列表模式 search-preference 修复:sortListModeResults 纯函数提取偏好置顶,store 补回 searchPreference 导出
- 列表模式从硬分组恢复为软加权(对齐 1554a27 原设计)
- listModeSort.test.ts 重写为软加权断言(8 用例),注释措辞改用主排序+tiebreaker 两层结构

style(tests): prettier 格式化修复与测试描述同步

- prettier 补齐 6 个测试文件的 trailing newline(pre-commit hook 残留)
- tokenSearchRegression: +300 语义 → typeWeights.app 软加权(对齐 300→50 改动)

test(tokenSearch): 补软加权不跨档引擎层锚点

本次修复的核心约束(typeWeights.app < 最小 modeTier 间距)此前在引擎层
没有测试锚定——只有同档位 app>plugin 正面断言,缺少 app 低档位不压过
plugin 高档位的负面约束。补配置约束 + 行为约束两个测试,确保改回 300 时
测试会失败。

test(tokenSearch): 移除测试中的 spec 编号引用,使断言自包含

artifacts 不随 PR 提交,req/Req/Task 编号与 design.md 等文件名
对 reviewer 不透明。改为内联语义说明,断言自解释。

- tokenSearchDualPath.test.ts: 6 处 (Req 1.1/3.4/5.2/5.4/6.1/6.2)
- tokenSearchIntegration.test.ts: 1 处 (Task 3.3)
- tokenSearch.test.ts: 2 处 (req 3.4, Req 8.5)
- tokenSearchRegression.test.ts: 1 处 (req 3.4)

fix(search): 修复列表模式完全空白(Pinia ref 解构 unwrap 导致 TypeError)

allListModeResults 中 const pref = searchPreference.value[query] 抛
TypeError:searchPreference 从 Pinia setup store 解构后已被自动
unwrap 为值对象,.value 为 undefined,undefined[query] 抛错导致
computed 求值失败,列表模式无论输入什么都不显示。聚合模式正常,
因其偏好置顶在 store.search 内部(那里 ref.value 正确)。

根因:base 版本 allListModeResults 从不直接读 searchPreference,
列表模式偏好置顶靠 store.search 把偏好项放 bestSearchResults 首位,
合并去重后自然居前。0eae037 画蛇添足在列表模式重做偏好置顶,引入 bug。

修复(回归 base 设计):
- 移除 sortListModeResults 的 pref 参数与 search-preference 置顶逻辑
- 移除 useSearchResults 解构中的 searchPreference 与 ListModeSortCtx.pref
- 回退 store return 中多余的 searchPreference 导出(base 无此导出)
- 列表模式偏好仍由 store.search 保证,无需重复处理
- 测试:删掉 search-preference 置顶 describe(职责已不在此函数)

docs(word-token-search): 同步 Phase 9 列表模式回归修复至 artifacts

- design.md: 列表模式排序描述修正——偏好置顶由 store.search 保证,
  allListModeResults 不重复处理,仅施加 token 档位排序
- tasks.md: 追加 Phase 9 Implementation Notes,记录 3ee0862 修复
  (Pinia ref 解构 unwrap 导致 TypeError),明确推翻 Phase 8 的
  search-preference 置顶方案

fix(search): 列表模式档位序对齐引擎 modeTier 权重表

listModeRank 旧实现把跨词词首(含非连续)统一归 rank 1,高于
全词 rank 2,违反权重表:非连续词首(400) 应低于全词(800)。
导致 query 'dance' 命中时,本地安全策略(拼音 multiTokensPrefix-
Discontinuous) 排到 MikuMikuDance(singleTokenExactitude) 前面。

修复:listModeRank 直接返回 DEFAULT_CONFIG.modeTiers 权重值(越大
越优先),消除与引擎并行且不一致的手写 rank 表。sortListModeResults
比较器改为降序(rankB - rankA)。

- 引擎层正确(MikuMikuDance 1086 > 本地安全策略 627),bug 仅在
  列表模式的 listModeRank 并行实现
- listModeRank.test.ts 断言重写为权重值语义
- listModeSort.test.ts 加回归锚点:非连续词首(400) 不应高于全词(800)

docs(word-token-search): 追加 Phase 10 listModeRank 档位序对齐 modeTier 修复记录

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new word token search feature (分词搜索) that supports cross-word abbreviation matching. It includes a new tokenizer, a token search engine, a setting toggle in the general settings UI, and updated highlighting and sorting logic. The review feedback highlights several critical improvements: ensuring Vue's reactivity tracks the search setting correctly, resolving a highlighting misalignment bug for mixed pinyin/number queries by including digits in character position mapping, preventing a potential infinite loop in pinyin segmentation when encountering empty syllables, and adding defensive checks in the character classification function to handle undefined inputs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

// 合并去重后仍居前,无需在此重复处理),然后按 token 档位排序。
// 同档位内 tiebreaker:系统应用软加权 → 频率。
// 开关 OFF: 旧比较器(完全匹配 → 前缀 → 系统应用 → 频率)
if (wordTokenEnabled) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Issue: Access commandDataStore.wordTokenEnabled directly here to ensure Vue's reactivity system tracks the dependency correctly and re-evaluates the computed property when the setting changes.

Suggested change
if (wordTokenEnabled) {
if (commandDataStore.wordTokenEnabled) {

Comment on lines +186 to +199
function getChineseCharPositions(text: string): number[] {
const positions: number[] = []
for (let i = 0; i < text.length; i++) {
const code = text.charCodeAt(i)
if (
(code >= 0x4e00 && code <= 0x9fff) ||
(code >= 0x3400 && code <= 0x4dbf) ||
(code >= 0xf900 && code <= 0xfaff)
) {
positions.push(i)
}
}
return positions
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

🔴 Critical Highlighting Alignment Bug\n\nIssue:\nThe getChineseCharPositions function only extracts the indices of Chinese characters. However, buildPinyinFields in commandDataStore.ts includes digits 0-9 in pinyinTokens to support mixed pinyin/number queries (e.g., "360极速" -> ['3', '6', '0', 'ji', 'su']).\n\nBecause digits are omitted from getChineseCharPositions, the mapped indices for pinyin matches containing numbers will be completely misaligned, leading to incorrect characters being highlighted (e.g., searching "360" on "360极速" highlights "极速" instead of "360").\n\nResolution:\nUpdate getChineseCharPositions to also include digits 0-9 (ASCII range 0x30 to 0x39) so that it aligns perfectly with the characters processed in pinyinTokens.

Suggested change
function getChineseCharPositions(text: string): number[] {
const positions: number[] = []
for (let i = 0; i < text.length; i++) {
const code = text.charCodeAt(i)
if (
(code >= 0x4e00 && code <= 0x9fff) ||
(code >= 0x3400 && code <= 0x4dbf) ||
(code >= 0xf900 && code <= 0xfaff)
) {
positions.push(i)
}
}
return positions
}
function getChineseCharPositions(text: string): number[] {
const positions: number[] = []
for (let i = 0; i < text.length; i++) {
const code = text.charCodeAt(i)
if (
(code >= 0x4e00 && code <= 0x9fff) ||
(code >= 0x3400 && code <= 0x4dbf) ||
(code >= 0xf900 && code <= 0xfaff) ||
(code >= 0x30 && code <= 0x39)
) {
positions.push(i)
}
}
return positions
}

Comment thread src/shared/tokenSearch.ts
Comment on lines +332 to +333
for (let i = cursor; i < syllables.length; i++) {
const syl = syllables[i]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

⚠️ Potential App-Freezing Infinite Loop\n\nIssue:\nIn segmentPinyin, if any element in syllables is empty or falsy, rest.startsWith(syl) will evaluate to true (since any string starts with ""), and qStart will be incremented by syl.length (which is 0). This results in an infinite loop that freezes the entire Electron application.\n\nResolution:\nAdd a defensive check if (!syl) continue at the beginning of the loop to skip empty syllables.

Suggested change
for (let i = cursor; i < syllables.length; i++) {
const syl = syllables[i]
for (let i = cursor; i < syllables.length; i++) {
const syl = syllables[i]
if (!syl) continue

Comment thread src/shared/tokenizer.ts
Comment on lines +5 to +7
function classifyChar(ch: string): CharKind {
if (ch === ' ' || ch === '\t' || ch === '\n' || ch === '\r' || ch === '_' || ch === '-')
return 'separator'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Issue: If classifyChar is ever called with undefined or an empty string, ch.codePointAt(0) will throw a TypeError: Cannot read properties of undefined (reading 'codePointAt'). Adding a defensive check at the beginning of the function makes it robust against unexpected inputs.

Suggested change
function classifyChar(ch: string): CharKind {
if (ch === ' ' || ch === '\t' || ch === '\n' || ch === '\r' || ch === '_' || ch === '-')
return 'separator'
function classifyChar(ch: string | undefined): CharKind {
if (!ch || ch === ' ' || ch === '\\t' || ch === '\\n' || ch === '\\r' || ch === '_' || ch === '-')
return 'separator'

- useSearchResults: access wordTokenEnabled via the commandDataStore proxy instead of destructuring the ref snapshot, so list-mode sort re-evaluates when the setting toggles

- highlight: include ASCII digits 0-9 in getChineseCharPositions to align with pinyinTokens, fixing highlight misalignment for mixed pinyin/number queries

- tokenSearch: skip empty syllables in segmentPinyin to avoid zero-length alignment pollution
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] 间隔匹配

1 participant