What
A new tab in the Statistics modal: Voice distinctiveness. Per character:
- Lemma vocabulary size (unique lemmas they use)
- Sentence-length distribution (median, p90)
- Register markers — formal vs colloquial, based on POS tags from mlmorph
- Preferred verb tenses / aspect markers
- A "voice similarity matrix" — how much character A's distribution overlaps with character B's
Why this matters
A common quality problem in screenplays: every character sounds the same — same vocabulary range, same sentence rhythm, same register. Writers often don't notice until a script reader points it out.
This panel surfaces it visually. "Your protagonist and antagonist share 92% of their lemma distribution — they sound alike." The writer can then deliberately differentiate.
Final Draft, Highland, etc. ship character dialogue counts. None ship grammatical-distinctiveness analysis.
Dependency
mlmorph FST integration. Once the analyzer runs, the per-line lemma + POS extraction is a straightforward walk.
Technical sketch
- Walk the active script, group lines by Character cue.
- For each character's accumulated dialogue:
- Tokenize via
malayalam-tokenizer (already a Rust crate at github.com/smc/malayalam-tokenizer, MIT).
- Analyze each token via mlmorph → lemma + tag set.
- Aggregate: Set of unique lemmas, histogram of sentence lengths, frequency of each POS tag.
- Render in StatisticsModal: a table of characters × metrics, plus a small heatmap for the similarity matrix.
- CSV export per the existing pattern.
Out of scope
English-side analysis (Latin tokens). English has its own NLP stack — out of scope for v1.0+. Only Malayalam dialogue is analyzed; characters who speak only English get a "—" in their row.
What
A new tab in the Statistics modal: Voice distinctiveness. Per character:
Why this matters
A common quality problem in screenplays: every character sounds the same — same vocabulary range, same sentence rhythm, same register. Writers often don't notice until a script reader points it out.
This panel surfaces it visually. "Your protagonist and antagonist share 92% of their lemma distribution — they sound alike." The writer can then deliberately differentiate.
Final Draft, Highland, etc. ship character dialogue counts. None ship grammatical-distinctiveness analysis.
Dependency
mlmorph FST integration. Once the analyzer runs, the per-line lemma + POS extraction is a straightforward walk.
Technical sketch
malayalam-tokenizer(already a Rust crate at github.com/smc/malayalam-tokenizer, MIT).Out of scope
English-side analysis (Latin tokens). English has its own NLP stack — out of scope for v1.0+. Only Malayalam dialogue is analyzed; characters who speak only English get a "—" in their row.