Computational linguistics pipeline for the Voynich Manuscript. Trains word embeddings on the EVA transcription and produces 57 interactive HTML analyses covering morphology, syntax, semantics, statistics, and document structure.
- Python 3.10 or later
- Install dependencies:
pip install -r requirements.txt
make all # parse → embed → reduce → visualize (core pipeline)
make portal # generate index.html — the searchable analysis portalOpen index.html in a browser for a searchable library of all analyses.
| Directory | Contents |
|---|---|
src/core/ |
Core pipeline: parse.py → embed.py → reduce.py → visualize.py |
src/analysis/ |
52 independent analysis scripts, each producing one or more HTML outputs |
src/viz/ |
Output renderers: visualize.py, visualize_dash.py, visualize_3d.py, report.py, portal.py |
src/cli/ |
Interactive tools: neighbors.py, analogy_discover.py |
data/ |
Generated artefacts (CSV, NPY, JSON) — created by make parse / make embed |
docs/ |
Design specs and implementation plans |
All scripts are run from the project root via make, so data/ paths resolve correctly.
The core pipeline must run in order:
make parse → data/sentences.json data/metadata.csv data/vocab.json
make embed → data/embeddings.npy data/w2v.model
make embed-ft → data/embeddings_ft.npy data/ft.model (FastText, optional)
make reduce → data/embeddings_2d.npy
make visualize → voynich_embeddings.html
make portal → index.html
After make embed, all src/analysis/ scripts are independent and can be run in any order.
| Target | Description |
|---|---|
all |
parse → embed → reduce → visualize |
parse |
Tokenise EVA transcription → sentences.json, metadata.csv |
embed |
Train word2vec skip-gram (64d, window=5) |
embed-ft |
Train FastText character-ngram embeddings |
reduce |
UMAP 2D projection |
reduce-tsne |
t-SNE 2D projection |
reduce-both |
Both UMAP and t-SNE |
reduce-3d |
UMAP 3D projection |
visualize |
Interactive 2D scatter (7 coloring modes) |
visualize-3d |
Interactive 3D scatter |
dash |
Dash app at http://127.0.0.1:8050 |
folio |
Folio-level embedding visualization |
report |
HTML summary dashboard |
portal |
Searchable HTML library (index.html) |
analyze |
EVA prefix/suffix pattern analysis |
bigrams |
Bigram frequency + heatmap + network |
similarity |
Pairwise cosine similarity heatmap |
section-vocab |
TF-IDF distinctive vocabulary per section |
vocab-drift |
Word frequency drift across folio windows |
nn-graph |
K-nearest-neighbour graph in UMAP space |
word-families |
Morphological word families via FastText clustering |
pmi |
Pointwise Mutual Information bigram analysis |
function-words |
Function-word candidates (initial rate × entropy) |
entropy-scatter |
Directional entropy scatter (prev vs next) |
analogy |
Morphological offset coherence |
char-ngrams |
EVA character n-gram analysis |
line-structure |
First/last word + line-length analysis |
positional-bigrams |
G² bigram enrichment by line zone |
cooccurrence |
Word co-occurrence network + communities |
hapax |
Zipf / Heap / hapax legomena analysis |
word-length |
Word-length by section / position / folio |
hmm |
Unsupervised HMM (K=6 latent POS-like states) |
word-transition |
Directed word transition probability network |
cluster-purity |
ARI/NMI: KMeans vs section/HMM/prefix/length |
context-profile |
Left/right context probability heatmap |
folio-drift |
Folio semantic trajectory (PCA of mean embeddings) |
morpheme |
EVA morpheme candidates via n-gram segmentation |
section-distance |
Pairwise section distance matrices (6 metrics) |
word-fingerprint word=<w> |
Multi-signal fingerprint card for a word |
word-roles |
GMM functional role clustering |
phonotactics |
EVA phonotactic patterns and CV shapes |
line-entropy |
Line-slot entropy, Jaccard, repetition rate |
line-clusters |
Line-embedding UMAP + KMeans section recovery |
lm-perplexity |
Kneser-Ney bigram LM perplexity per line |
embed-stability |
Bootstrap embedding stability (15 models) |
cross-section |
Chi² exclusivity, TF-IDF, vocabulary overlap |
folio-richness |
Per-folio TTR, MSTTR, hapax rate, entropy |
line-similarity |
Near-duplicate and exact-copy line detection |
paradigm-finder |
Suffix/prefix paradigm detection |
semantic-fields |
Hierarchical dendrogram of top-200 words |
entropy-rate |
Block entropy H(n) and entropy rate |
changepoint |
Folio change-point detection (MMD + CUSUM) |
topic-model |
NMF topic model (K=8) |
char-position-entropy |
Per-slot character entropy |
sif-embeddings |
SIF line embeddings |
word-sequence-model |
Successor entropy and boundary constraints |
word-burstiness |
Goh-Barabási burstiness coefficient |
affix-entropy |
Trie-based prefix/suffix continuation entropy |
folio-similarity-matrix |
Pairwise folio similarity (Ward-clustered) |
zipf-analysis |
Zipf / power-law MLE fit |
prefix-suffix-matrix |
Prefix × suffix PMI heatmap |
line-position-words |
G² slot affinity for line-position words |
cv-skeleton |
Consonant/Vowel skeleton analysis |
word-network-centrality |
PageRank / betweenness / degree centrality |
ppmi-vectors |
PPMI distributional vectors vs word2vec |
context-asymmetry |
Left-right context asymmetry A = H_left − H_right |
surprisal-map |
Per-line/folio bigram surprisal (KN LM) |
clean |
Remove all generated data/ and HTML files |
# Find nearest neighbours of a word in embedding space
python3 src/cli/neighbors.py daiin
# Cluster the vocabulary into K groups
python3 src/cli/neighbors.py --cluster 8
# Vector analogy: A − B + C = ?
python3 src/cli/neighbors.py --analogy daiin chedy ol
# Semantic interpolation: N steps from A to B
python3 src/cli/neighbors.py --interpolate daiin ol 6data/ is created by the pipeline and holds all intermediate artefacts:
| File | Created by | Contents |
|---|---|---|
sentences.json |
make parse |
List of tokenised lines |
metadata.csv |
make parse |
Line → folio/section mapping |
vocab.json |
make parse |
Word → index mapping |
embeddings.npy |
make embed |
word2vec vectors (V × 64) |
w2v.model |
make embed |
Full gensim word2vec model |
embeddings_2d.npy |
make reduce |
UMAP 2D projection |
embeddings_3d.npy |
make reduce-3d |
UMAP 3D projection |
embeddings_ft.npy |
make embed-ft |
FastText vectors |
*.csv |
various make targets |
Per-analysis statistics |