Skip to content

feat: add Latin-script Kazakh support to core stemmer#3

Draft
darkhanakh wants to merge 1 commit intomainfrom
cursor/latin-script-support
Draft

feat: add Latin-script Kazakh support to core stemmer#3
darkhanakh wants to merge 1 commit intomainfrom
cursor/latin-script-support

Conversation

@darkhanakh
Copy link
Copy Markdown
Owner

Summary

  • Add auto-detection and transliteration of official Kazakh Latin-script input to canonical Cyrillic, reusing the existing BFS stemmer engine so both scripts converge to the same stem output
  • Implement confidence gating to avoid false positives on English/ASCII tokens, Turkic-aware I/İ case folding, and ambiguous-letter branching (i→и/й, h→х/һ)
  • Expose ScriptMode config (Auto/CyrillicOnly) in CLI (--cyrillic-only), PostgreSQL dictionary (script_mode), and Elasticsearch bridge

Test plan

  • Cross-script equivalence: almalarалма, mektepterмектеп, qazaqtarқазақ
  • Uppercase Latin support: ALMALARалма, QAZAQTARқазақ
  • Turkic I/İ case folding: lower_kazakh("Iİ") == "ıi"
  • Mixed-script passthrough: алmaлар unchanged
  • ASCII safety: docker, solar pass through unchanged
  • CyrillicOnly mode disables Latin handling
  • All 27 existing Cyrillic stem tests still green
  • Parity tests (C vs Rust, with/without lexicon) still green
  • PostgreSQL smoke test via just test-ext (requires container)
  • Elasticsearch Java tests (requires JDK + native lib build)

Auto-detect official Kazakh Latin input (ä ö ü ū ğ ş ñ ı, q/w),
transliterate to canonical Cyrillic, and reuse the existing BFS
engine. Both scripts now converge to the same Cyrillic stem output,
keeping indexing unified across scripts.

Includes confidence gating to avoid false positives on English/ASCII
tokens, Turkic-aware I/İ case folding, ambiguous-letter branching
(i→и/й, h→х/һ), ScriptMode config (Auto/CyrillicOnly) exposed in
CLI, PostgreSQL dictionary, and Elastic bridge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant