Seeklet is a minimal educational web search engine written in Python.
It exists to make the core ideas of crawling, indexing, and ranking easy to read in a small codebase. The project deliberately favors simplicity, readability, and contributor friendliness over production-scale complexity.
Current MVP behavior:
- seeded website crawling from one or more URLs
- same-host crawl scoping
robots.txtsupport- HTML title, visible text, and link extraction
- normalized URL handling and tokenization
- SQLite-backed local persistence
- inverted index rebuilding on crawl
- BM25 ranking
- result snippet generation
- CLI commands for
crawl,search,stats, andreset - tests with
pytest - linting and formatting with
ruff - GitHub Actions CI
These are deliberate non-goals for the MVP:
- JavaScript rendering
- asynchronous or distributed crawling
- PageRank or link-analysis ranking
- phrase, boolean, fuzzy, or vector search
- a REST API
- a browser UI
Seeklet follows a small set of stable rules:
- keep the code easy to inspect
- prefer straightforward solutions over clever ones
- add dependencies only when they clearly help
- preserve educational value when making improvements
- prefer behavior-preserving refactors over broad rewrites
The project now keeps its documentation in four primary files:
README.md— overview, setup, and day-one usagedocs/spec.md— product scope, constraints, and quality bardocs/architecture.md— module boundaries, data flow, and tradeoffsdocs/roadmap.md— current status and next milestones
- Python 3.12
- Linux or macOS
- internet access for crawling live websites
git clone https://github.com/0xklkuo/seeklet.git
cd seeklet
python3.12 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"If your editor reports missing imports such as bs4, point it at
.venv/bin/python.
seeklet crawl https://example.com --max-pages 20 --max-depth 1
seeklet stats
seeklet search "example domain"
seeklet reset --yesExample search output shape:
1. Example Domain
URL: https://example.com/
Score: 1.2345
Snippet: This domain is for use in illustrative examples...
Exact scores and snippets depend on the crawled site.
Crawl and index one or more seed URLs.
seeklet crawl SEED_URL [SEED_URL ...] [--db PATH] [--max-pages N] [--max-depth N] [--delay-seconds N]Options:
--db— path to the SQLite database--max-pages— maximum number of pages to crawl--max-depth— maximum crawl depth from the seed URLs--delay-seconds— delay between requests in seconds
Search the local index.
seeklet search "query text" [--db PATH] [--top-k N]Options:
--db— path to the SQLite database--top-k— maximum number of results to return
Show index statistics.
seeklet stats [--db PATH]Delete local index data.
seeklet reset [--db PATH] [--yes]seed URLs
-> crawl allowed pages
-> fetch HTML
-> extract title, text, and links
-> normalize URLs and tokenize text
-> rebuild SQLite index
-> execute BM25 search
-> print ranked CLI results
For deeper detail, see docs/spec.md and docs/architecture.md.
Run the project from an activated virtual environment:
source .venv/bin/activate
make checkCommon targets:
make check— run lint, format check, and testsmake format— apply Ruff formattingmake lint— run Ruff lint checksmake test— run the pytest suite
Run the CLI directly:
python -m seeklet --helpsrc/seeklet/
__init__.py
__main__.py
cli.py
config.py
crawl.py
extract.py
index.py
models.py
normalize.py
ranking.py
search.py
snippet.py
storage.py
tests/
docs/
.github/workflows/
Seeklet is at the educational MVP stage. It is ready to crawl a small site, rebuild a local index, and perform BM25-based keyword search from the CLI. It is not intended for large-scale crawling or advanced retrieval features yet.
See docs/roadmap.md for the current refactor decisions and follow-up work.