Tokenizer
-
Description: The
strlibrary exposes two tokenizer functions for extracting grapheme clusters from text:str.chars/1: an experimental pure-Gleam implementation that approximates grapheme segmentation.str.chars_stdlib/1: a thin wrapper over the BEAM stdlibstring.to_graphemes/1and the recommended choice for production.
-
When to use which:
chars_stdlib/1(recommended): Use in production code. It uses the BEAM runtime's grapheme segmentation, is more accurate for edge cases (UAX #29) and typically faster.chars/1(experimental): Useful when you want a self-contained, pure-Gleam implementation (for debugging, learning, or portability guarantees within Gleam code). It approximates common grapheme rules (combining marks, variation selectors, skin tones, ZWJ sequences) but may differ on rare or exotic sequences.
-
Examples:
str.chars("café")->["c", "a", "f", "é"]str.chars_stdlib("👩\u{200D}👩")->["👩\u{200D}👩"]
-
Notes:
- Both functions return a
List(String)of grapheme clusters. - If you are writing performance-sensitive code that repeatedly scans long strings, prefer
chars_stdlib/1and avoid repeated full-tokenization where possible.
- Both functions return a
Guidance for library maintainers
- Keep
chars/1as an experimental reference implementation. If a user-reported bug shows a clear mismatch betweenchars/1and the BEAM stdlib for a case that matters, prefer fixing docs or recommendingchars_stdlib/1rather than changing the experimental algorithm in-place unless necessary.
Note (2.0+): This module is internal. Access these functions via the public
strmodule:str.chars(text)andstr.chars_stdlib(text).
This module contains a tokenizer implemented entirely in Gleam as a pedagogical reference. It is not intended to replace standard library APIs, but to show how to iterate grapheme clusters in pure Gleam without NIFs or native dependencies.
Key functions
-
chars(text: String) -> List(String): Returns the list of grapheme clusters for the input string. -
chars_stdlib(text: String) -> List(String): Uses the BEAM stdlib grapheme segmentation (more accurate).
Example
import str
let chars = str.chars("café")
// -> ["c", "a", "f", "é"]When to use this module
- Use when you need a pure-Gleam tokenizer for study, debugging, or environments that cannot rely on native libraries.