VMSembedding/report.html at main · CoderXYZ7/VMSembedding · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
<!DOCTYPE html>
<html lang="en">
<head><meta charset="utf-8">
<title>Voynich Embedding Pipeline — Report</title>
<style>
body{font-family:sans-serif;margin:0;background:#f5f5f5;color:#222}
.hero{background:#263238;color:#fff;padding:28px 36px}
.hero h1{margin:0 0 6px;font-size:22px}
.hero p{margin:0;color:#90a4ae;font-size:14px}
section{background:#fff;margin:20px 36px;padding:20px 24px;border-radius:4px;
        box-shadow:0 1px 3px rgba(0,0,0,.1)}
h2{margin:0 0 14px;font-size:16px;color:#263238;border-bottom:2px solid #e0e0e0;
   padding-bottom:6px}
.stat-grid{display:flex;flex-wrap:wrap;gap:16px;margin-bottom:4px}
.stat{background:#e8f5e9;border-radius:4px;padding:10px 18px;min-width:110px}
.stat b{display:block;font-size:22px;color:#2e7d32}
.stat span{font-size:12px;color:#555}
.quality{background:#e3f2fd;border-radius:4px;padding:10px 18px;min-width:140px}
.quality b{display:block;font-size:22px;color:#1565c0}
table{border-collapse:collapse;width:100%;font-size:13px}
th{background:#eceff1;text-align:left;padding:6px 10px;font-weight:600}
td{padding:5px 10px;border-bottom:1px solid #f0f0f0}
tr:hover td{background:#fafafa}
.viz-grid{display:flex;flex-wrap:wrap;gap:12px}
.viz-card{background:#f9f9f9;border:1px solid #e0e0e0;border-radius:4px;
          padding:10px 14px;width:280px}
.viz-card a{color:#1565c0;text-decoration:none;display:block;margin-bottom:4px}
.viz-card a:hover{text-decoration:underline}
.viz-card span{font-size:12px;color:#777}
.note{font-size:12px;color:#888;margin-top:12px}
</style></head>
<body>
<div class="hero">
  <h1>Voynich Manuscript · Embedding Pipeline Report</h1>
  <p>Generated 2026-05-14 &nbsp;·&nbsp;
     word2vec + FastText · 64d → UMAP / t-SNE &nbsp;·&nbsp;
     1,414 word types · 31,773 tokens</p>
</div>

<section>
<h2>Corpus statistics</h2>
<div class="stat-grid">
  <div class="stat"><b>31,773</b><span>total tokens</span></div>
  <div class="stat"><b>1,414</b><span>unique word types</span></div>
  <div class="stat"><b>4,183</b><span>lines</span></div>
  <div class="stat"><b>183</b><span>folios</span></div>
  <div class="stat"><b>7</b><span>sections</span></div>
  <div class="stat"><b>0.045</b><span>type/token ratio</span></div>
</div>
<p class="note">Section sizes: Stars 11,411 · Herbal 10,738 · Balneological 6,359 · Unknown 1,292 · Text-only 1,113 · Cosmological 476 · Zodiac 384</p>
</section>

<section>
<h2>Embedding quality — section separation (silhouette score)</h2>
<div class="stat-grid">
  <div class="quality"><b>-0.082</b><span>word2vec (cosine)</span></div>
  <div class="quality"><b>-0.054</b><span>FastText (cosine)</span></div>
</div>
<p class="note">Silhouette score ∈ [–1, 1]. Higher = sections more separated in 64-d embedding space.</p>
</section>

<section>
<h2>Top 10 most frequent words</h2>
<table><thead><tr><th>Word</th><th>Freq</th><th>Section</th><th>Initial%</th><th>Ctx entropy</th><th>Next entropy</th><th>Prev entropy</th></tr></thead><tbody><tr><td>daiin</td><td>744</td><td>Herbal</td><td>17.5%</td><td>9.32</td><td>8.09</td><td>7.99</td></tr>
<tr><td>chedy</td><td>465</td><td>Stars</td><td>1.3%</td><td>8.59</td><td>7.22</td><td>7.18</td></tr>
<tr><td>ol</td><td>457</td><td>Balneological</td><td>6.3%</td><td>8.65</td><td>7.08</td><td>7.61</td></tr>
<tr><td>shedy</td><td>403</td><td>Balneological</td><td>1.5%</td><td>8.60</td><td>6.93</td><td>7.12</td></tr>
<tr><td>aiin</td><td>362</td><td>Stars</td><td>0.0%</td><td>9.03</td><td>7.56</td><td>6.75</td></tr>
<tr><td>chol</td><td>351</td><td>Herbal</td><td>4.8%</td><td>8.65</td><td>6.97</td><td>7.35</td></tr>
<tr><td>chey</td><td>308</td><td>Stars</td><td>1.3%</td><td>8.60</td><td>7.25</td><td>7.16</td></tr>
<tr><td>qokeedy</td><td>300</td><td>Balneological</td><td>10.7%</td><td>8.18</td><td>6.89</td><td>6.45</td></tr>
<tr><td>qokeey</td><td>283</td><td>Stars</td><td>10.2%</td><td>8.41</td><td>7.01</td><td>6.67</td></tr>
<tr><td>or</td><td>283</td><td>Herbal</td><td>9.9%</td><td>8.78</td><td>6.49</td><td>7.31</td></tr>
</tbody></table>
</section>

<section>
<h2>Function-word candidates (high line-initial rate + high context entropy)</h2>
<table><thead><tr><th>Word</th><th>Freq</th><th>Initial%</th><th>Context entropy</th><th>Section</th></tr></thead><tbody><tr><td>sol</td><td>64</td><td>56.2%</td><td>7.30</td><td>Balneological</td></tr>
<tr><td>sain</td><td>61</td><td>50.8%</td><td>7.31</td><td>Stars</td></tr>
<tr><td>saiin</td><td>125</td><td>42.4%</td><td>8.19</td><td>Stars</td></tr>
<tr><td>sar</td><td>63</td><td>38.1%</td><td>7.57</td><td>Stars</td></tr>
<tr><td>qotchy</td><td>61</td><td>31.1%</td><td>7.30</td><td>Herbal</td></tr>
<tr><td>dor</td><td>65</td><td>30.8%</td><td>7.47</td><td>Herbal</td></tr>
<tr><td>y</td><td>139</td><td>30.2%</td><td>8.32</td><td>Herbal</td></tr>
<tr><td>o</td><td>70</td><td>30.0%</td><td>7.49</td><td>Herbal</td></tr>
</tbody></table>
<p class="note">
  High initial fraction = often starts a line (grammatical role?).
  High context entropy = appears in many different contexts.
</p>
</section>

<section>
<h2>Prefix-frame words — constrained successors (prev entropy &gt;&gt; next entropy)</h2>
<table><thead><tr><th>Word</th><th>Freq</th><th>Next entropy</th><th>Prev entropy</th><th>Asymmetry</th></tr></thead><tbody><tr><td>qokam</td><td>24</td><td>1.58</td><td>4.22</td><td>-2.64</td></tr>
<tr><td>am</td><td>58</td><td>3.58</td><td>5.56</td><td>-1.97</td></tr>
<tr><td>okam</td><td>21</td><td>2.58</td><td>4.30</td><td>-1.71</td></tr>
<tr><td>ldy</td><td>24</td><td>2.81</td><td>4.42</td><td>-1.61</td></tr>
<tr><td>oly</td><td>53</td><td>3.97</td><td>5.52</td><td>-1.56</td></tr>
<tr><td>sy</td><td>30</td><td>3.42</td><td>4.86</td><td>-1.44</td></tr>
<tr><td>dary</td><td>21</td><td>2.81</td><td>4.06</td><td>-1.25</td></tr>
<tr><td>otam</td><td>36</td><td>3.58</td><td>4.82</td><td>-1.23</td></tr>
</tbody></table>
<p class="note">
  Low next entropy = almost always followed by the same word(s).
  These may act as frame-words that introduce specific fixed expressions.
</p>
</section>

<section>
<h2>Top PMI collocations (adjacent words, count ≥ 5)</h2>
<table><thead><tr><th>Word 1</th><th>Word 2</th><th>NPMI</th><th>Count</th><th>Section</th></tr></thead><tbody><tr><td>k</td><td>x</td><td>0.716</td><td>5</td><td>Cosmological</td></tr>
<tr><td>l</td><td>o</td><td>0.389</td><td>7</td><td>Stars</td></tr>
<tr><td>chy</td><td>kchy</td><td>0.361</td><td>6</td><td>Herbal</td></tr>
<tr><td>qokchy</td><td>qotchy</td><td>0.339</td><td>5</td><td>Herbal</td></tr>
<tr><td>oteedy</td><td>qotain</td><td>0.328</td><td>6</td><td>Stars</td></tr>
<tr><td>qol</td><td>sheedy</td><td>0.319</td><td>10</td><td>Balneological</td></tr>
<tr><td>ain</td><td>sar</td><td>0.313</td><td>5</td><td>Stars</td></tr>
<tr><td>ain</td><td>r</td><td>0.304</td><td>6</td><td>Stars</td></tr>
<tr><td>chaiin</td><td>chy</td><td>0.295</td><td>5</td><td>Herbal</td></tr>
<tr><td>checthy</td><td>qokain</td><td>0.293</td><td>6</td><td>Balneological</td></tr>
</tbody></table>
</section>

<section>
<h2>Top morphological word families (FastText, threshold 0.85)</h2>
<table><thead><tr><th>Family core</th><th>Members</th><th>Top members</th></tr></thead><tbody><tr><td>ke*</td><td>37</td><td>keeody, keody, lkeeody, okeeo, okeeody, okeeol</td></tr>
<tr><td>cho*</td><td>31</td><td>chod, chodar, chody, cholody, cholor, chor</td></tr>
<tr><td>ched*</td><td>27</td><td>chedy, dchedy, fchedy, lched, lchedy, lpchedy</td></tr>
<tr><td>ed*</td><td>12</td><td>eedy, eeedy, lteedy, oeedy, okeedy, okeeedy</td></tr>
<tr><td>aiin*</td><td>11</td><td>chaiin, cphaiin, cthaiin, laiin, olaiin, opaiin</td></tr>
<tr><td>shedy*</td><td>10</td><td>dshedy, lshedy, olshedy, otshedy, qokshedy, rshedy</td></tr>
<tr><td>shey*</td><td>9</td><td>kshey, lshey, okshey, olshey, oshey, shey</td></tr>
<tr><td>chdy*</td><td>8</td><td>chdy, dchdy, lchdy, olchdy, otchdy, qotchdy</td></tr>
<tr><td>hed*</td><td>8</td><td>ched, chedaiin, chedain, chedal, chedam, chedar</td></tr>
<tr><td>cheol*</td><td>7</td><td>cheol, cheoly, dcheol, lcheol, olcheol, tcheol</td></tr>
</tbody></table>
</section>

<section>
<h2>Visualizations</h2>
<div class="viz-grid"><div class="viz-card">  <a href="voynich_embeddings.html" target="_blank"><b>Word embedding scatter</b></a>  <span>7 coloring modes, UMAP 2D</span></div>
<div class="viz-card">  <a href="similarity_matrix.html" target="_blank"><b>Similarity matrix</b></a>  <span>Pairwise cosine sim, hierarchically clustered</span></div>
<div class="viz-card">  <a href="nn_graph.html" target="_blank"><b>Nearest-neighbour graph</b></a>  <span>Top-200 words, 4-NN edges in 64d space</span></div>
<div class="viz-card">  <a href="bigrams_heatmap.html" target="_blank"><b>Bigram heatmap</b></a>  <span>Top prefix × top suffix pairs</span></div>
<div class="viz-card">  <a href="bigrams_network.html" target="_blank"><b>Bigram network</b></a>  <span>Force-directed bigram co-occurrence graph</span></div>
<div class="viz-card">  <a href="folio_embeddings.html" target="_blank"><b>Folio embeddings</b></a>  <span>Per-folio mean-pooled embeddings, UMAP 2D</span></div>
<div class="viz-card">  <a href="section_vocab.html" target="_blank"><b>Section-distinctive vocabulary</b></a>  <span>TF-IDF-style heatmap, sections × words</span></div>
<div class="viz-card">  <a href="vocab_drift.html" target="_blank"><b>Vocabulary drift</b></a>  <span>Word frequency across folio windows</span></div>
<div class="viz-card">  <a href="pmi_heatmap.html" target="_blank"><b>NPMI heatmap</b></a>  <span>Normalized PMI for top-40 words</span></div>
<div class="viz-card">  <a href="word_families.html" target="_blank"><b>Morphological word families</b></a>  <span>FastText clustering treemap</span></div>
<div class="viz-card">  <a href="function_words.html" target="_blank"><b>Function-word scatter</b></a>  <span>Initial fraction vs context entropy</span></div>
<div class="viz-card">  <a href="entropy_scatter.html" target="_blank"><b>Directional entropy scatter</b></a>  <span>prev-entropy vs next-entropy asymmetry</span></div>
<div class="viz-card">  <a href="char_ngrams.html" target="_blank"><b>Character n-gram analysis</b></a>  <span>EVA bigram transition matrix + log-odds surprise</span></div>
<div class="viz-card">  <a href="line_structure.html" target="_blank"><b>Line structure analysis</b></a>  <span>First/last words, length distribution, section comparison</span></div>
<div class="viz-card">  <a href="positional_bigrams.html" target="_blank"><b>Positional bigrams</b></a>  <span>G² enrichment of word bigrams by line zone (begin/mid/end)</span></div>
<div class="viz-card">  <a href="cooccurrence_network.html" target="_blank"><b>Co-occurrence network</b></a>  <span>Word co-occurrence graph, window=3, greedy community detection</span></div>
<div class="viz-card">  <a href="hapax_analysis.html" target="_blank"><b>Hapax & Zipf analysis</b></a>  <span>Zipf/Heap law fit, hapax ratio by section, frequency tiers</span></div>
<div class="viz-card">  <a href="word_length_profile.html" target="_blank"><b>Word-length profile</b></a>  <span>Length by section/position, folio trend, char-type composition</span></div>
<div class="viz-card">  <a href="hmm_states.html" target="_blank"><b>HMM latent states</b></a>  <span>Unsupervised Baum-Welch HMM, K=6 POS-like states on UMAP</span></div>
<div class="viz-card">  <a href="voynich_3d.html" target="_blank"><b>3D embedding scatter</b></a>  <span>UMAP 3D — section / frequency / prefix / HMM coloring</span></div>
<div class="viz-card">  <a href="word_transition.html" target="_blank"><b>Word transition network</b></a>  <span>Directed P(B|A) graph, top-120 words, spring layout</span></div>
<div class="viz-card">  <a href="cluster_purity.html" target="_blank"><b>Cluster purity analysis</b></a>  <span>ARI/NMI of KMeans vs section/HMM/prefix/length labels</span></div>
<div class="viz-card">  <a href="context_profile.html" target="_blank"><b>Context profile heatmap</b></a>  <span>Left/right conditional context probabilities for top-40 words</span></div>
<div class="viz-card">  <a href="folio_drift.html" target="_blank"><b>Folio semantic trajectory</b></a>  <span>PCA 2D of per-folio mean embeddings — semantic drift across folios</span></div>
<div class="viz-card">  <a href="morpheme_inventory.html" target="_blank"><b>Morpheme inventory</b></a>  <span>EVA morpheme candidates via char n-gram log-odds segmentation</span></div>
</div>
<p class="note">Greyed-out links have not been generated yet.
  Run the corresponding <code>make</code> target to create them.</p>
</section>

</body></html>