Skip to content

LittleStinkerGuy/word2Vec

Repository files navigation

W2V Search Engine — Usage Guide

A terminal-based semantic search engine powered by a from-scratch Word2Vec (skip-gram) implementation. Index .txt documents, then search them interactively with ranked results and live previews.

Quick Start

pip install -r requirements.txt
python main.py --dir test_docs

This launches the TUI pointed at the test_docs/ directory. On first run you'll see a prompt to index — press i to begin.

Pretrained Embeddings (Recommended)

For best results, use pretrained embeddings to search your documents. This gives the search engine a broad vocabulary — words like "squid" will work even if they don't appear in your target documents.

Three pretrained models are included in example_models/, all trained on WikiText:

Model Description
example_models/wikitext_Actual Default model — trained with subsampling and vocabulary pruning
example_models/wikitext_NoSubsampling Trained without subsampling of frequent words
example_models/wikitext_NoPruning Trained without vocabulary pruning

Step 1: Index your documents

python main.py --cli --index \
  --dir test_docs/ \
  --model-path example_models/wikitext_Actual

This uses the pretrained embeddings to vectorize your documents — no training needed.

Step 2: Search

# TUI mode
python main.py --dir test_docs/ --model-path example_models/wikitext_Actual

# CLI mode
python main.py --cli --search "squid" --dir test_docs/

Running the TUI

# Default (indexes current directory)
python main.py

# Specify a document directory
python main.py --dir /path/to/documents

# With pretrained model
python main.py --dir test_docs/ --model-path example_models/wikitext_Actual

When launched with --model-path, pressing i will index documents using the pretrained model (fast, no training). Without it, pressing i trains and indexes in one step (legacy mode).

Layout

┌──────────────────────────────────────────────┐
│  W2V Search Engine                           │  Title bar
├──────────────────────────────────────────────┤
│ Dir: /path/to/documents                      │
│ Search: your query here                      │  Search input
│         ────────────────                     │
├──────────────────────────────────────────────┤
│ Results (N)                                  │
│  1. [+0.8542] document1.txt                  │  Ranked results
│  2. [+0.7231] document2.txt                  │  with similarity
│  3. [+0.6892] document3.txt  ◄── selected   │  scores
│                                              │
│ Preview: document3.txt                       │
│ Lorem ipsum dolor sit amet, consectetur...   │  Scrollable
│ adipiscing elit. Sed do eiusmod tempor...    │  preview
├──────────────────────────────────────────────┤
│ [i] Index  [Enter] Search  [↑↓] Navigate    │  Status bar
│ [PgUp/Dn] Scroll  [Esc] Quit                │
└──────────────────────────────────────────────┘

Keybindings

Search Mode

Key Action
Any printable character Insert into query and search
Backspace Delete character before cursor
Delete Delete character at cursor
Enter Execute search
/ Move cursor within query
Home / Ctrl+A Jump to start of query
End / Ctrl+E Jump to end of query
Ctrl+U Clear entire query
/ Navigate results list
PgUp / PgDn Scroll document preview
i Start indexing (only when query is empty)
q Quit (only when query is empty)
Esc Quit

Indexing Mode

Key Action
q Quit

All other input is ignored while indexing is in progress.

Workflow

1. Indexing

Before you can search, the documents must be indexed. If no index exists, the status bar will prompt you to press i.

When indexing starts:

  • Documents in the target directory are scanned (.txt files only)
  • If using a pretrained model: documents are vectorized immediately
  • If no pretrained model: vocabulary is built, Word2Vec trains with live progress, then documents are vectorized
  • The index is saved to .w2v_index/ for reuse

Once complete, the status updates to "Ready to search!" and you're returned to search mode.

2. Searching

Type your query and results appear instantly, ranked by cosine similarity. Each result shows:

 1. [+0.8542] machine_learning.txt
  • Rank — position in results (top 10 shown)
  • Score — cosine similarity between query and document vectors
  • Filename — the matched document

Use / to select a result and see its preview below.

3. Previewing

The bottom section shows the full text of the selected result, word-wrapped to your terminal width. Use PgUp/PgDn to scroll through longer documents.

CLI Mode

Index with pretrained model

python main.py --cli --index --dir test_docs/ --model-path example_models/wikitext_Actual

Index without pretrained model (legacy)

python main.py --cli --index --dir test_docs

Search

python main.py --cli --search "ocean creatures" --dir test_docs/ --top 5

Alternative CLI entry point

python -m w2v.cli index --dir test_docs/ --model-path example_models/wikitext_Actual
python -m w2v.cli search "ocean creatures" --dir test_docs/ --top 5

Training Parameters

Flag Default Description
--embed-dim 100 Embedding dimensions
--window 5 Context window size
--epochs 5 Training epochs
--min-count 1 Minimum word frequency
--neg-samples 5 Negative samples

Storage

Pretrained model (example_models/<name>/)

File Contents
embeddings.npy Trained word embeddings (NumPy binary)
vocab.json Word-to-index mappings and word counts
config.json Hyperparameters and corpus info

Document index (<doc_dir>/.w2v_index/)

File Contents
doc_vectors.json Precomputed document vectors
config.json Model path reference, IDF values

When using legacy mode (no pretrained model), .w2v_index/ also contains embeddings.npy and vocab.json.

Delete .w2v_index/ to force a re-index.

Search Tips and Limitations

Multi-word queries work best. Single ambiguous words may produce surprising results because the embedding reflects whatever context that word appeared in most during training:

# Weak: single word, ambiguous context
python main.py --cli --search "fire" --dir test_docs/
# → automotive.txt (not cooking)

# Strong: multiple topic-specific words
python main.py --cli --search "fire grill roast cooking" --dir test_docs/
# → cooking_recipes.txt

Some known limitations with the wikitext pretrained model:

Query Expected Actual #1 Why
fire cooking_recipes.txt automotive.txt "fire" appears in engine/military contexts in wikitext
squid ocean_marine_life.txt python_programming.txt Rare word, noisy embedding — ocean is a close #2
crypto cybersecurity.txt ancient_history.txt "crypto" co-occurs with Greek/historical roots in wikitext

To work around this, use more specific queries or train on a domain-relevant corpus.

Testing

pip install pytest
python -m pytest tests/ -v

The test suite (208 tests) covers:

File Tests What it covers
test_tokenizer.py 17 Preprocessing, vocabulary building, token-to-id conversion, subsampling
test_word2vec.py 25 Model init, unigram table, training pairs, sigmoid, training loop, save/load
test_indexer.py 22 Document scanning, SIF weights, vectorization, PCA removal, full indexing pipeline
test_searcher.py 18 Searcher init, query vectorization, search ranking, model path resolution
test_progress.py 9 Progress bar output, train/index callbacks
test_cli.py 10 Both CLI entry points (main.py and w2v.cli)
test_integration.py 11 End-to-end pipelines, reindexing, edge cases, separate model/index dirs
test_search_relevance.py 63 Search quality against test_docs/ — topic queries, natural language, score properties, CLI

Run a single file:

python -m pytest tests/test_search_relevance.py -v

Run a single test class:

python -m pytest tests/test_search_relevance.py::TestDirectTopicQueries -v

The relevance tests in test_search_relevance.py require test_docs/ to have a pre-built index. If missing, the fixture rebuilds it automatically.

Requirements

  • Python 3
  • numpy
  • pytest (for running tests)
  • A terminal that supports curses (minimum 40x10)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages