W2V Search Engine — Usage Guide

A terminal-based semantic search engine powered by a from-scratch Word2Vec (skip-gram) implementation. Index .txt documents, then search them interactively with ranked results and live previews.

Quick Start

pip install -r requirements.txt
python main.py --dir test_docs

This launches the TUI pointed at the test_docs/ directory. On first run you'll see a prompt to index — press i to begin.

Pretrained Embeddings (Recommended)

For best results, use pretrained embeddings to search your documents. This gives the search engine a broad vocabulary — words like "squid" will work even if they don't appear in your target documents.

Three pretrained models are included in example_models/, all trained on WikiText:

Model	Description
`example_models/wikitext_Actual`	Default model — trained with subsampling and vocabulary pruning
`example_models/wikitext_NoSubsampling`	Trained without subsampling of frequent words
`example_models/wikitext_NoPruning`	Trained without vocabulary pruning

Step 1: Index your documents

python main.py --cli --index \
  --dir test_docs/ \
  --model-path example_models/wikitext_Actual

This uses the pretrained embeddings to vectorize your documents — no training needed.

Step 2: Search

# TUI mode
python main.py --dir test_docs/ --model-path example_models/wikitext_Actual

# CLI mode
python main.py --cli --search "squid" --dir test_docs/

Running the TUI

# Default (indexes current directory)
python main.py

# Specify a document directory
python main.py --dir /path/to/documents

# With pretrained model
python main.py --dir test_docs/ --model-path example_models/wikitext_Actual

When launched with --model-path, pressing i will index documents using the pretrained model (fast, no training). Without it, pressing i trains and indexes in one step (legacy mode).

Layout

┌──────────────────────────────────────────────┐
│  W2V Search Engine                           │  Title bar
├──────────────────────────────────────────────┤
│ Dir: /path/to/documents                      │
│ Search: your query here                      │  Search input
│         ────────────────                     │
├──────────────────────────────────────────────┤
│ Results (N)                                  │
│  1. [+0.8542] document1.txt                  │  Ranked results
│  2. [+0.7231] document2.txt                  │  with similarity
│  3. [+0.6892] document3.txt  ◄── selected   │  scores
│                                              │
│ Preview: document3.txt                       │
│ Lorem ipsum dolor sit amet, consectetur...   │  Scrollable
│ adipiscing elit. Sed do eiusmod tempor...    │  preview
├──────────────────────────────────────────────┤
│ [i] Index  [Enter] Search  [↑↓] Navigate    │  Status bar
│ [PgUp/Dn] Scroll  [Esc] Quit                │
└──────────────────────────────────────────────┘

Keybindings

Search Mode

Key	Action
Any printable character	Insert into query and search
`Backspace`	Delete character before cursor
`Delete`	Delete character at cursor
`Enter`	Execute search
`←` / `→`	Move cursor within query
`Home` / `Ctrl+A`	Jump to start of query
`End` / `Ctrl+E`	Jump to end of query
`Ctrl+U`	Clear entire query
`↑` / `↓`	Navigate results list
`PgUp` / `PgDn`	Scroll document preview
`i`	Start indexing (only when query is empty)
`q`	Quit (only when query is empty)
`Esc`	Quit

Indexing Mode

Key	Action
`q`	Quit

All other input is ignored while indexing is in progress.

Workflow

1. Indexing

Before you can search, the documents must be indexed. If no index exists, the status bar will prompt you to press i.

When indexing starts:

Documents in the target directory are scanned (.txt files only)
If using a pretrained model: documents are vectorized immediately
If no pretrained model: vocabulary is built, Word2Vec trains with live progress, then documents are vectorized
The index is saved to .w2v_index/ for reuse

Once complete, the status updates to "Ready to search!" and you're returned to search mode.

2. Searching

Type your query and results appear instantly, ranked by cosine similarity. Each result shows:

 1. [+0.8542] machine_learning.txt

Rank — position in results (top 10 shown)
Score — cosine similarity between query and document vectors
Filename — the matched document

Use ↑/↓ to select a result and see its preview below.

3. Previewing

The bottom section shows the full text of the selected result, word-wrapped to your terminal width. Use PgUp/PgDn to scroll through longer documents.

CLI Mode

Index with pretrained model

python main.py --cli --index --dir test_docs/ --model-path example_models/wikitext_Actual

Index without pretrained model (legacy)

python main.py --cli --index --dir test_docs

Search

python main.py --cli --search "ocean creatures" --dir test_docs/ --top 5

Alternative CLI entry point

python -m w2v.cli index --dir test_docs/ --model-path example_models/wikitext_Actual
python -m w2v.cli search "ocean creatures" --dir test_docs/ --top 5

Training Parameters

Flag	Default	Description
`--embed-dim`	100	Embedding dimensions
`--window`	5	Context window size
`--epochs`	5	Training epochs
`--min-count`	1	Minimum word frequency
`--neg-samples`	5	Negative samples

Storage

Pretrained model (`example_models/<name>/`)

File	Contents
`embeddings.npy`	Trained word embeddings (NumPy binary)
`vocab.json`	Word-to-index mappings and word counts
`config.json`	Hyperparameters and corpus info

Document index (`<doc_dir>/.w2v_index/`)

File	Contents
`doc_vectors.json`	Precomputed document vectors
`config.json`	Model path reference, IDF values

When using legacy mode (no pretrained model), .w2v_index/ also contains embeddings.npy and vocab.json.

Delete .w2v_index/ to force a re-index.

Search Tips and Limitations

Multi-word queries work best. Single ambiguous words may produce surprising results because the embedding reflects whatever context that word appeared in most during training:

# Weak: single word, ambiguous context
python main.py --cli --search "fire" --dir test_docs/
# → automotive.txt (not cooking)

# Strong: multiple topic-specific words
python main.py --cli --search "fire grill roast cooking" --dir test_docs/
# → cooking_recipes.txt

Some known limitations with the wikitext pretrained model:

Query	Expected	Actual #1	Why
`fire`	cooking_recipes.txt	automotive.txt	"fire" appears in engine/military contexts in wikitext
`squid`	ocean_marine_life.txt	python_programming.txt	Rare word, noisy embedding — ocean is a close #2
`crypto`	cybersecurity.txt	ancient_history.txt	"crypto" co-occurs with Greek/historical roots in wikitext

To work around this, use more specific queries or train on a domain-relevant corpus.

Testing

pip install pytest
python -m pytest tests/ -v

The test suite (208 tests) covers:

File	Tests	What it covers
`test_tokenizer.py`	17	Preprocessing, vocabulary building, token-to-id conversion, subsampling
`test_word2vec.py`	25	Model init, unigram table, training pairs, sigmoid, training loop, save/load
`test_indexer.py`	22	Document scanning, SIF weights, vectorization, PCA removal, full indexing pipeline
`test_searcher.py`	18	Searcher init, query vectorization, search ranking, model path resolution
`test_progress.py`	9	Progress bar output, train/index callbacks
`test_cli.py`	10	Both CLI entry points (`main.py` and `w2v.cli`)
`test_integration.py`	11	End-to-end pipelines, reindexing, edge cases, separate model/index dirs
`test_search_relevance.py`	63	Search quality against `test_docs/` — topic queries, natural language, score properties, CLI

Run a single file:

python -m pytest tests/test_search_relevance.py -v

Run a single test class:

python -m pytest tests/test_search_relevance.py::TestDirectTopicQueries -v

The relevance tests in test_search_relevance.py require test_docs/ to have a pre-built index. If missing, the fixture rebuilds it automatically.

Requirements

Python 3
numpy
pytest (for running tests)
A terminal that supports curses (minimum 40x10)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

W2V Search Engine — Usage Guide

Quick Start

Pretrained Embeddings (Recommended)

Step 1: Index your documents

Step 2: Search

Running the TUI

Layout

Keybindings

Search Mode

Indexing Mode

Workflow

1. Indexing

2. Searching

3. Previewing

CLI Mode

Index with pretrained model

Index without pretrained model (legacy)

Search

Alternative CLI entry point

Training Parameters

Storage

Pretrained model (`example_models/<name>/`)

Document index (`<doc_dir>/.w2v_index/`)

Search Tips and Limitations

Testing

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
example_models/wikitext_NoSubsampling		example_models/wikitext_NoSubsampling
test_docs		test_docs
tests		tests
w2v		w2v
README.md		README.md
explore_vectors.py		explore_vectors.py
main.py		main.py
requirements.txt		requirements.txt
show_vectors.py		show_vectors.py

Folders and files

Latest commit

History

Repository files navigation

W2V Search Engine — Usage Guide

Quick Start

Pretrained Embeddings (Recommended)

Step 1: Index your documents

Step 2: Search

Running the TUI

Layout

Keybindings

Search Mode

Indexing Mode

Workflow

1. Indexing

2. Searching

3. Previewing

CLI Mode

Index with pretrained model

Index without pretrained model (legacy)

Search

Alternative CLI entry point

Training Parameters

Storage

Pretrained model (example_models/<name>/)

Document index (<doc_dir>/.w2v_index/)

Search Tips and Limitations

Testing

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Pretrained model (`example_models/<name>/`)

Document index (`<doc_dir>/.w2v_index/`)

Packages