Seeklet

Seeklet is a minimal educational web search engine written in Python.

It exists to make the core ideas of crawling, indexing, and ranking easy to read in a small codebase. The project deliberately favors simplicity, readability, and contributor friendliness over production-scale complexity.

What Seeklet includes

Current MVP behavior:

seeded website crawling from one or more URLs
same-host crawl scoping
robots.txt support
HTML title, visible text, and link extraction
normalized URL handling and tokenization
SQLite-backed local persistence
inverted index rebuilding on crawl
BM25 ranking
result snippet generation
CLI commands for crawl, search, stats, and reset
tests with pytest
linting and formatting with ruff
GitHub Actions CI

What Seeklet does not include

These are deliberate non-goals for the MVP:

JavaScript rendering
asynchronous or distributed crawling
PageRank or link-analysis ranking
phrase, boolean, fuzzy, or vector search
a REST API
a browser UI

Project principles

Seeklet follows a small set of stable rules:

keep the code easy to inspect
prefer straightforward solutions over clever ones
add dependencies only when they clearly help
preserve educational value when making improvements
prefer behavior-preserving refactors over broad rewrites

Core documentation

The project now keeps its documentation in four primary files:

README.md — overview, setup, and day-one usage
docs/spec.md — product scope, constraints, and quality bar
docs/architecture.md — module boundaries, data flow, and tradeoffs
docs/roadmap.md — current status and next milestones

Requirements

Python 3.12
Linux or macOS
internet access for crawling live websites

Installation

git clone https://github.com/0xklkuo/seeklet.git
cd seeklet
python3.12 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

If your editor reports missing imports such as bs4, point it at .venv/bin/python.

Quickstart

seeklet crawl https://example.com --max-pages 20 --max-depth 1
seeklet stats
seeklet search "example domain"
seeklet reset --yes

Example search output shape:

1. Example Domain
   URL: https://example.com/
   Score: 1.2345
   Snippet: This domain is for use in illustrative examples...

Exact scores and snippets depend on the crawled site.

CLI reference

`seeklet crawl`

Crawl and index one or more seed URLs.

seeklet crawl SEED_URL [SEED_URL ...] [--db PATH] [--max-pages N] [--max-depth N] [--delay-seconds N]

Options:

--db — path to the SQLite database
--max-pages — maximum number of pages to crawl
--max-depth — maximum crawl depth from the seed URLs
--delay-seconds — delay between requests in seconds

`seeklet search`

Search the local index.

seeklet search "query text" [--db PATH] [--top-k N]

Options:

--db — path to the SQLite database
--top-k — maximum number of results to return

`seeklet stats`

Show index statistics.

seeklet stats [--db PATH]

`seeklet reset`

Delete local index data.

seeklet reset [--db PATH] [--yes]

Architecture at a glance

seed URLs
  -> crawl allowed pages
  -> fetch HTML
  -> extract title, text, and links
  -> normalize URLs and tokenize text
  -> rebuild SQLite index
  -> execute BM25 search
  -> print ranked CLI results

For deeper detail, see docs/spec.md and docs/architecture.md.

Development

Run the project from an activated virtual environment:

source .venv/bin/activate
make check

Common targets:

make check — run lint, format check, and tests
make format — apply Ruff formatting
make lint — run Ruff lint checks
make test — run the pytest suite

Run the CLI directly:

python -m seeklet --help

Project layout

src/seeklet/
    __init__.py
    __main__.py
    cli.py
    config.py
    crawl.py
    extract.py
    index.py
    models.py
    normalize.py
    ranking.py
    search.py
    snippet.py
    storage.py

tests/
docs/
.github/workflows/

Current status

Seeklet is at the educational MVP stage. It is ready to crawl a small site, rebuild a local index, and perform BM25-based keyword search from the CLI. It is not intended for large-scale crawling or advanced retrieval features yet.

See docs/roadmap.md for the current refactor decisions and follow-up work.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github		.github
docs		docs
src/seeklet		src/seeklet
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seeklet

What Seeklet includes

What Seeklet does not include

Project principles

Core documentation

Requirements

Installation

Quickstart

CLI reference

`seeklet crawl`

`seeklet search`

`seeklet stats`

`seeklet reset`

Architecture at a glance

Development

Project layout

Current status

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Seeklet

What Seeklet includes

What Seeklet does not include

Project principles

Core documentation

Requirements

Installation

Quickstart

CLI reference

seeklet crawl

seeklet search

seeklet stats

seeklet reset

Architecture at a glance

Development

Project layout

Current status

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`seeklet crawl`

`seeklet search`

`seeklet stats`

`seeklet reset`

Packages