Skip to content

CoderXYZ7/VMSembedding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VoyMS — Voynich Manuscript Analysis Pipeline

Computational linguistics pipeline for the Voynich Manuscript. Trains word embeddings on the EVA transcription and produces 57 interactive HTML analyses covering morphology, syntax, semantics, statistics, and document structure.


Prerequisites

  • Python 3.10 or later
  • Install dependencies: pip install -r requirements.txt

Quick start

make all      # parse → embed → reduce → visualize (core pipeline)
make portal   # generate index.html — the searchable analysis portal

Open index.html in a browser for a searchable library of all analyses.


Project layout

Directory Contents
src/core/ Core pipeline: parse.pyembed.pyreduce.pyvisualize.py
src/analysis/ 52 independent analysis scripts, each producing one or more HTML outputs
src/viz/ Output renderers: visualize.py, visualize_dash.py, visualize_3d.py, report.py, portal.py
src/cli/ Interactive tools: neighbors.py, analogy_discover.py
data/ Generated artefacts (CSV, NPY, JSON) — created by make parse / make embed
docs/ Design specs and implementation plans

All scripts are run from the project root via make, so data/ paths resolve correctly.


Pipeline

The core pipeline must run in order:

make parse     →  data/sentences.json  data/metadata.csv  data/vocab.json
make embed     →  data/embeddings.npy  data/w2v.model
make embed-ft  →  data/embeddings_ft.npy  data/ft.model   (FastText, optional)
make reduce    →  data/embeddings_2d.npy
make visualize →  voynich_embeddings.html
make portal    →  index.html

After make embed, all src/analysis/ scripts are independent and can be run in any order.


Make targets

Target Description
all parse → embed → reduce → visualize
parse Tokenise EVA transcription → sentences.json, metadata.csv
embed Train word2vec skip-gram (64d, window=5)
embed-ft Train FastText character-ngram embeddings
reduce UMAP 2D projection
reduce-tsne t-SNE 2D projection
reduce-both Both UMAP and t-SNE
reduce-3d UMAP 3D projection
visualize Interactive 2D scatter (7 coloring modes)
visualize-3d Interactive 3D scatter
dash Dash app at http://127.0.0.1:8050
folio Folio-level embedding visualization
report HTML summary dashboard
portal Searchable HTML library (index.html)
analyze EVA prefix/suffix pattern analysis
bigrams Bigram frequency + heatmap + network
similarity Pairwise cosine similarity heatmap
section-vocab TF-IDF distinctive vocabulary per section
vocab-drift Word frequency drift across folio windows
nn-graph K-nearest-neighbour graph in UMAP space
word-families Morphological word families via FastText clustering
pmi Pointwise Mutual Information bigram analysis
function-words Function-word candidates (initial rate × entropy)
entropy-scatter Directional entropy scatter (prev vs next)
analogy Morphological offset coherence
char-ngrams EVA character n-gram analysis
line-structure First/last word + line-length analysis
positional-bigrams G² bigram enrichment by line zone
cooccurrence Word co-occurrence network + communities
hapax Zipf / Heap / hapax legomena analysis
word-length Word-length by section / position / folio
hmm Unsupervised HMM (K=6 latent POS-like states)
word-transition Directed word transition probability network
cluster-purity ARI/NMI: KMeans vs section/HMM/prefix/length
context-profile Left/right context probability heatmap
folio-drift Folio semantic trajectory (PCA of mean embeddings)
morpheme EVA morpheme candidates via n-gram segmentation
section-distance Pairwise section distance matrices (6 metrics)
word-fingerprint word=<w> Multi-signal fingerprint card for a word
word-roles GMM functional role clustering
phonotactics EVA phonotactic patterns and CV shapes
line-entropy Line-slot entropy, Jaccard, repetition rate
line-clusters Line-embedding UMAP + KMeans section recovery
lm-perplexity Kneser-Ney bigram LM perplexity per line
embed-stability Bootstrap embedding stability (15 models)
cross-section Chi² exclusivity, TF-IDF, vocabulary overlap
folio-richness Per-folio TTR, MSTTR, hapax rate, entropy
line-similarity Near-duplicate and exact-copy line detection
paradigm-finder Suffix/prefix paradigm detection
semantic-fields Hierarchical dendrogram of top-200 words
entropy-rate Block entropy H(n) and entropy rate
changepoint Folio change-point detection (MMD + CUSUM)
topic-model NMF topic model (K=8)
char-position-entropy Per-slot character entropy
sif-embeddings SIF line embeddings
word-sequence-model Successor entropy and boundary constraints
word-burstiness Goh-Barabási burstiness coefficient
affix-entropy Trie-based prefix/suffix continuation entropy
folio-similarity-matrix Pairwise folio similarity (Ward-clustered)
zipf-analysis Zipf / power-law MLE fit
prefix-suffix-matrix Prefix × suffix PMI heatmap
line-position-words G² slot affinity for line-position words
cv-skeleton Consonant/Vowel skeleton analysis
word-network-centrality PageRank / betweenness / degree centrality
ppmi-vectors PPMI distributional vectors vs word2vec
context-asymmetry Left-right context asymmetry A = H_left − H_right
surprisal-map Per-line/folio bigram surprisal (KN LM)
clean Remove all generated data/ and HTML files

CLI tools

# Find nearest neighbours of a word in embedding space
python3 src/cli/neighbors.py daiin

# Cluster the vocabulary into K groups
python3 src/cli/neighbors.py --cluster 8

# Vector analogy: A − B + C = ?
python3 src/cli/neighbors.py --analogy daiin chedy ol

# Semantic interpolation: N steps from A to B
python3 src/cli/neighbors.py --interpolate daiin ol 6

Data directory

data/ is created by the pipeline and holds all intermediate artefacts:

File Created by Contents
sentences.json make parse List of tokenised lines
metadata.csv make parse Line → folio/section mapping
vocab.json make parse Word → index mapping
embeddings.npy make embed word2vec vectors (V × 64)
w2v.model make embed Full gensim word2vec model
embeddings_2d.npy make reduce UMAP 2D projection
embeddings_3d.npy make reduce-3d UMAP 3D projection
embeddings_ft.npy make embed-ft FastText vectors
*.csv various make targets Per-analysis statistics

Releases

No releases published

Packages

 
 
 

Contributors

Languages