VoyMS — Voynich Manuscript Analysis Pipeline

Computational linguistics pipeline for the Voynich Manuscript. Trains word embeddings on the EVA transcription and produces 57 interactive HTML analyses covering morphology, syntax, semantics, statistics, and document structure.

Prerequisites

Python 3.10 or later
Install dependencies: pip install -r requirements.txt

Quick start

make all      # parse → embed → reduce → visualize (core pipeline)
make portal   # generate index.html — the searchable analysis portal

Open index.html in a browser for a searchable library of all analyses.

Project layout

Directory	Contents
`src/core/`	Core pipeline: `parse.py` → `embed.py` → `reduce.py` → `visualize.py`
`src/analysis/`	52 independent analysis scripts, each producing one or more HTML outputs
`src/viz/`	Output renderers: `visualize.py`, `visualize_dash.py`, `visualize_3d.py`, `report.py`, `portal.py`
`src/cli/`	Interactive tools: `neighbors.py`, `analogy_discover.py`
`data/`	Generated artefacts (CSV, NPY, JSON) — created by `make parse` / `make embed`
`docs/`	Design specs and implementation plans

All scripts are run from the project root via make, so data/ paths resolve correctly.

Pipeline

The core pipeline must run in order:

make parse     →  data/sentences.json  data/metadata.csv  data/vocab.json
make embed     →  data/embeddings.npy  data/w2v.model
make embed-ft  →  data/embeddings_ft.npy  data/ft.model   (FastText, optional)
make reduce    →  data/embeddings_2d.npy
make visualize →  voynich_embeddings.html
make portal    →  index.html

After make embed, all src/analysis/ scripts are independent and can be run in any order.

Make targets

Target	Description
`all`	parse → embed → reduce → visualize
`parse`	Tokenise EVA transcription → sentences.json, metadata.csv
`embed`	Train word2vec skip-gram (64d, window=5)
`embed-ft`	Train FastText character-ngram embeddings
`reduce`	UMAP 2D projection
`reduce-tsne`	t-SNE 2D projection
`reduce-both`	Both UMAP and t-SNE
`reduce-3d`	UMAP 3D projection
`visualize`	Interactive 2D scatter (7 coloring modes)
`visualize-3d`	Interactive 3D scatter
`dash`	Dash app at http://127.0.0.1:8050
`folio`	Folio-level embedding visualization
`report`	HTML summary dashboard
`portal`	Searchable HTML library (index.html)
`analyze`	EVA prefix/suffix pattern analysis
`bigrams`	Bigram frequency + heatmap + network
`similarity`	Pairwise cosine similarity heatmap
`section-vocab`	TF-IDF distinctive vocabulary per section
`vocab-drift`	Word frequency drift across folio windows
`nn-graph`	K-nearest-neighbour graph in UMAP space
`word-families`	Morphological word families via FastText clustering
`pmi`	Pointwise Mutual Information bigram analysis
`function-words`	Function-word candidates (initial rate × entropy)
`entropy-scatter`	Directional entropy scatter (prev vs next)
`analogy`	Morphological offset coherence
`char-ngrams`	EVA character n-gram analysis
`line-structure`	First/last word + line-length analysis
`positional-bigrams`	G² bigram enrichment by line zone
`cooccurrence`	Word co-occurrence network + communities
`hapax`	Zipf / Heap / hapax legomena analysis
`word-length`	Word-length by section / position / folio
`hmm`	Unsupervised HMM (K=6 latent POS-like states)
`word-transition`	Directed word transition probability network
`cluster-purity`	ARI/NMI: KMeans vs section/HMM/prefix/length
`context-profile`	Left/right context probability heatmap
`folio-drift`	Folio semantic trajectory (PCA of mean embeddings)
`morpheme`	EVA morpheme candidates via n-gram segmentation
`section-distance`	Pairwise section distance matrices (6 metrics)
`word-fingerprint word=<w>`	Multi-signal fingerprint card for a word
`word-roles`	GMM functional role clustering
`phonotactics`	EVA phonotactic patterns and CV shapes
`line-entropy`	Line-slot entropy, Jaccard, repetition rate
`line-clusters`	Line-embedding UMAP + KMeans section recovery
`lm-perplexity`	Kneser-Ney bigram LM perplexity per line
`embed-stability`	Bootstrap embedding stability (15 models)
`cross-section`	Chi² exclusivity, TF-IDF, vocabulary overlap
`folio-richness`	Per-folio TTR, MSTTR, hapax rate, entropy
`line-similarity`	Near-duplicate and exact-copy line detection
`paradigm-finder`	Suffix/prefix paradigm detection
`semantic-fields`	Hierarchical dendrogram of top-200 words
`entropy-rate`	Block entropy H(n) and entropy rate
`changepoint`	Folio change-point detection (MMD + CUSUM)
`topic-model`	NMF topic model (K=8)
`char-position-entropy`	Per-slot character entropy
`sif-embeddings`	SIF line embeddings
`word-sequence-model`	Successor entropy and boundary constraints
`word-burstiness`	Goh-Barabási burstiness coefficient
`affix-entropy`	Trie-based prefix/suffix continuation entropy
`folio-similarity-matrix`	Pairwise folio similarity (Ward-clustered)
`zipf-analysis`	Zipf / power-law MLE fit
`prefix-suffix-matrix`	Prefix × suffix PMI heatmap
`line-position-words`	G² slot affinity for line-position words
`cv-skeleton`	Consonant/Vowel skeleton analysis
`word-network-centrality`	PageRank / betweenness / degree centrality
`ppmi-vectors`	PPMI distributional vectors vs word2vec
`context-asymmetry`	Left-right context asymmetry A = H_left − H_right
`surprisal-map`	Per-line/folio bigram surprisal (KN LM)
`clean`	Remove all generated data/ and HTML files

CLI tools

# Find nearest neighbours of a word in embedding space
python3 src/cli/neighbors.py daiin

# Cluster the vocabulary into K groups
python3 src/cli/neighbors.py --cluster 8

# Vector analogy: A − B + C = ?
python3 src/cli/neighbors.py --analogy daiin chedy ol

# Semantic interpolation: N steps from A to B
python3 src/cli/neighbors.py --interpolate daiin ol 6

Data directory

data/ is created by the pipeline and holds all intermediate artefacts:

File	Created by	Contents
`sentences.json`	`make parse`	List of tokenised lines
`metadata.csv`	`make parse`	Line → folio/section mapping
`vocab.json`	`make parse`	Word → index mapping
`embeddings.npy`	`make embed`	word2vec vectors (V × 64)
`w2v.model`	`make embed`	Full gensim word2vec model
`embeddings_2d.npy`	`make reduce`	UMAP 2D projection
`embeddings_3d.npy`	`make reduce-3d`	UMAP 3D projection
`embeddings_ft.npy`	`make embed-ft`	FastText vectors
`*.csv`	various `make` targets	Per-analysis statistics

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
docs/superpowers		docs/superpowers
src		src
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
VT0e-n.txt		VT0e-n.txt
affix_entropy.html		affix_entropy.html
bigrams_heatmap.html		bigrams_heatmap.html
bigrams_network.html		bigrams_network.html
changepoint.html		changepoint.html
char_ngrams.html		char_ngrams.html
char_position_entropy.html		char_position_entropy.html
cluster_purity.html		cluster_purity.html
context_asymmetry.html		context_asymmetry.html
context_profile.html		context_profile.html
cooccurrence_network.html		cooccurrence_network.html
cross_section.html		cross_section.html
cv_skeleton.html		cv_skeleton.html
embed_stability.html		embed_stability.html
entropy_rate.html		entropy_rate.html
entropy_scatter.html		entropy_scatter.html
folio_drift.html		folio_drift.html
folio_embeddings.html		folio_embeddings.html
folio_richness.html		folio_richness.html
folio_similarity_matrix.html		folio_similarity_matrix.html
function_words.html		function_words.html
hapax_analysis.html		hapax_analysis.html
hmm_states.html		hmm_states.html
index.html		index.html
line_clusters.html		line_clusters.html
line_entropy.html		line_entropy.html
line_position_words.html		line_position_words.html
line_similarity.html		line_similarity.html
line_structure.html		line_structure.html
lm_perplexity.html		lm_perplexity.html
morpheme_inventory.html		morpheme_inventory.html
nn_graph.html		nn_graph.html
paradigm_finder.html		paradigm_finder.html
phonotactics.html		phonotactics.html
pmi_heatmap.html		pmi_heatmap.html
positional_bigrams.html		positional_bigrams.html
ppmi_vectors.html		ppmi_vectors.html
prefix_suffix_matrix.html		prefix_suffix_matrix.html
report.html		report.html
requirements.txt		requirements.txt
run.sh		run.sh
section_distance.html		section_distance.html
section_vocab.html		section_vocab.html
semantic_fields.html		semantic_fields.html
sif_embeddings.html		sif_embeddings.html
similarity_matrix.html		similarity_matrix.html
surprisal_map.html		surprisal_map.html
topic_model.html		topic_model.html
vocab_drift.html		vocab_drift.html
voynich_3d.html		voynich_3d.html
voynich_embeddings.html		voynich_embeddings.html
word_burstiness.html		word_burstiness.html
word_families.html		word_families.html
word_fingerprint_chedy.html		word_fingerprint_chedy.html
word_fingerprint_daiin.html		word_fingerprint_daiin.html
word_fingerprint_ol.html		word_fingerprint_ol.html
word_length_profile.html		word_length_profile.html
word_network_centrality.html		word_network_centrality.html
word_roles.html		word_roles.html
word_sequence_model.html		word_sequence_model.html
word_transition.html		word_transition.html
zipf_analysis.html		zipf_analysis.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoyMS — Voynich Manuscript Analysis Pipeline

Prerequisites

Quick start

Project layout

Pipeline

Make targets

CLI tools

Data directory

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VoyMS — Voynich Manuscript Analysis Pipeline

Prerequisites

Quick start

Project layout

Pipeline

Make targets

CLI tools

Data directory

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages