A terminal-based semantic search engine powered by a from-scratch Word2Vec (skip-gram) implementation. Index .txt documents, then search them interactively with ranked results and live previews.
pip install -r requirements.txt
python main.py --dir test_docsThis launches the TUI pointed at the test_docs/ directory. On first run you'll see a prompt to index — press i to begin.
For best results, use pretrained embeddings to search your documents. This gives the search engine a broad vocabulary — words like "squid" will work even if they don't appear in your target documents.
Three pretrained models are included in example_models/, all trained on WikiText:
| Model | Description |
|---|---|
example_models/wikitext_Actual |
Default model — trained with subsampling and vocabulary pruning |
example_models/wikitext_NoSubsampling |
Trained without subsampling of frequent words |
example_models/wikitext_NoPruning |
Trained without vocabulary pruning |
python main.py --cli --index \
--dir test_docs/ \
--model-path example_models/wikitext_ActualThis uses the pretrained embeddings to vectorize your documents — no training needed.
# TUI mode
python main.py --dir test_docs/ --model-path example_models/wikitext_Actual
# CLI mode
python main.py --cli --search "squid" --dir test_docs/# Default (indexes current directory)
python main.py
# Specify a document directory
python main.py --dir /path/to/documents
# With pretrained model
python main.py --dir test_docs/ --model-path example_models/wikitext_ActualWhen launched with --model-path, pressing i will index documents using the pretrained model (fast, no training). Without it, pressing i trains and indexes in one step (legacy mode).
┌──────────────────────────────────────────────┐
│ W2V Search Engine │ Title bar
├──────────────────────────────────────────────┤
│ Dir: /path/to/documents │
│ Search: your query here │ Search input
│ ──────────────── │
├──────────────────────────────────────────────┤
│ Results (N) │
│ 1. [+0.8542] document1.txt │ Ranked results
│ 2. [+0.7231] document2.txt │ with similarity
│ 3. [+0.6892] document3.txt ◄── selected │ scores
│ │
│ Preview: document3.txt │
│ Lorem ipsum dolor sit amet, consectetur... │ Scrollable
│ adipiscing elit. Sed do eiusmod tempor... │ preview
├──────────────────────────────────────────────┤
│ [i] Index [Enter] Search [↑↓] Navigate │ Status bar
│ [PgUp/Dn] Scroll [Esc] Quit │
└──────────────────────────────────────────────┘
| Key | Action |
|---|---|
| Any printable character | Insert into query and search |
Backspace |
Delete character before cursor |
Delete |
Delete character at cursor |
Enter |
Execute search |
← / → |
Move cursor within query |
Home / Ctrl+A |
Jump to start of query |
End / Ctrl+E |
Jump to end of query |
Ctrl+U |
Clear entire query |
↑ / ↓ |
Navigate results list |
PgUp / PgDn |
Scroll document preview |
i |
Start indexing (only when query is empty) |
q |
Quit (only when query is empty) |
Esc |
Quit |
| Key | Action |
|---|---|
q |
Quit |
All other input is ignored while indexing is in progress.
Before you can search, the documents must be indexed. If no index exists, the status bar will prompt you to press i.
When indexing starts:
- Documents in the target directory are scanned (
.txtfiles only) - If using a pretrained model: documents are vectorized immediately
- If no pretrained model: vocabulary is built, Word2Vec trains with live progress, then documents are vectorized
- The index is saved to
.w2v_index/for reuse
Once complete, the status updates to "Ready to search!" and you're returned to search mode.
Type your query and results appear instantly, ranked by cosine similarity. Each result shows:
1. [+0.8542] machine_learning.txt
- Rank — position in results (top 10 shown)
- Score — cosine similarity between query and document vectors
- Filename — the matched document
Use ↑/↓ to select a result and see its preview below.
The bottom section shows the full text of the selected result, word-wrapped to your terminal width. Use PgUp/PgDn to scroll through longer documents.
python main.py --cli --index --dir test_docs/ --model-path example_models/wikitext_Actualpython main.py --cli --index --dir test_docspython main.py --cli --search "ocean creatures" --dir test_docs/ --top 5python -m w2v.cli index --dir test_docs/ --model-path example_models/wikitext_Actual
python -m w2v.cli search "ocean creatures" --dir test_docs/ --top 5| Flag | Default | Description |
|---|---|---|
--embed-dim |
100 | Embedding dimensions |
--window |
5 | Context window size |
--epochs |
5 | Training epochs |
--min-count |
1 | Minimum word frequency |
--neg-samples |
5 | Negative samples |
| File | Contents |
|---|---|
embeddings.npy |
Trained word embeddings (NumPy binary) |
vocab.json |
Word-to-index mappings and word counts |
config.json |
Hyperparameters and corpus info |
| File | Contents |
|---|---|
doc_vectors.json |
Precomputed document vectors |
config.json |
Model path reference, IDF values |
When using legacy mode (no pretrained model), .w2v_index/ also contains embeddings.npy and vocab.json.
Delete .w2v_index/ to force a re-index.
Multi-word queries work best. Single ambiguous words may produce surprising results because the embedding reflects whatever context that word appeared in most during training:
# Weak: single word, ambiguous context
python main.py --cli --search "fire" --dir test_docs/
# → automotive.txt (not cooking)
# Strong: multiple topic-specific words
python main.py --cli --search "fire grill roast cooking" --dir test_docs/
# → cooking_recipes.txtSome known limitations with the wikitext pretrained model:
| Query | Expected | Actual #1 | Why |
|---|---|---|---|
fire |
cooking_recipes.txt | automotive.txt | "fire" appears in engine/military contexts in wikitext |
squid |
ocean_marine_life.txt | python_programming.txt | Rare word, noisy embedding — ocean is a close #2 |
crypto |
cybersecurity.txt | ancient_history.txt | "crypto" co-occurs with Greek/historical roots in wikitext |
To work around this, use more specific queries or train on a domain-relevant corpus.
pip install pytest
python -m pytest tests/ -vThe test suite (208 tests) covers:
| File | Tests | What it covers |
|---|---|---|
test_tokenizer.py |
17 | Preprocessing, vocabulary building, token-to-id conversion, subsampling |
test_word2vec.py |
25 | Model init, unigram table, training pairs, sigmoid, training loop, save/load |
test_indexer.py |
22 | Document scanning, SIF weights, vectorization, PCA removal, full indexing pipeline |
test_searcher.py |
18 | Searcher init, query vectorization, search ranking, model path resolution |
test_progress.py |
9 | Progress bar output, train/index callbacks |
test_cli.py |
10 | Both CLI entry points (main.py and w2v.cli) |
test_integration.py |
11 | End-to-end pipelines, reindexing, edge cases, separate model/index dirs |
test_search_relevance.py |
63 | Search quality against test_docs/ — topic queries, natural language, score properties, CLI |
Run a single file:
python -m pytest tests/test_search_relevance.py -vRun a single test class:
python -m pytest tests/test_search_relevance.py::TestDirectTopicQueries -vThe relevance tests in test_search_relevance.py require test_docs/ to have a pre-built index. If missing, the fixture rebuilds it automatically.
- Python 3
- numpy
- pytest (for running tests)
- A terminal that supports curses (minimum 40x10)