Skip to content

0xklkuo/seeklet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Seeklet

CI License: MIT Python 3.12

Seeklet is a minimal educational web search engine written in Python.

It exists to make the core ideas of crawling, indexing, and ranking easy to read in a small codebase. The project deliberately favors simplicity, readability, and contributor friendliness over production-scale complexity.

What Seeklet includes

Current MVP behavior:

  • seeded website crawling from one or more URLs
  • same-host crawl scoping
  • robots.txt support
  • HTML title, visible text, and link extraction
  • normalized URL handling and tokenization
  • SQLite-backed local persistence
  • inverted index rebuilding on crawl
  • BM25 ranking
  • result snippet generation
  • CLI commands for crawl, search, stats, and reset
  • tests with pytest
  • linting and formatting with ruff
  • GitHub Actions CI

What Seeklet does not include

These are deliberate non-goals for the MVP:

  • JavaScript rendering
  • asynchronous or distributed crawling
  • PageRank or link-analysis ranking
  • phrase, boolean, fuzzy, or vector search
  • a REST API
  • a browser UI

Project principles

Seeklet follows a small set of stable rules:

  • keep the code easy to inspect
  • prefer straightforward solutions over clever ones
  • add dependencies only when they clearly help
  • preserve educational value when making improvements
  • prefer behavior-preserving refactors over broad rewrites

Core documentation

The project now keeps its documentation in four primary files:

  • README.md — overview, setup, and day-one usage
  • docs/spec.md — product scope, constraints, and quality bar
  • docs/architecture.md — module boundaries, data flow, and tradeoffs
  • docs/roadmap.md — current status and next milestones

Requirements

  • Python 3.12
  • Linux or macOS
  • internet access for crawling live websites

Installation

git clone https://github.com/0xklkuo/seeklet.git
cd seeklet
python3.12 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

If your editor reports missing imports such as bs4, point it at .venv/bin/python.

Quickstart

seeklet crawl https://example.com --max-pages 20 --max-depth 1
seeklet stats
seeklet search "example domain"
seeklet reset --yes

Example search output shape:

1. Example Domain
   URL: https://example.com/
   Score: 1.2345
   Snippet: This domain is for use in illustrative examples...

Exact scores and snippets depend on the crawled site.

CLI reference

seeklet crawl

Crawl and index one or more seed URLs.

seeklet crawl SEED_URL [SEED_URL ...] [--db PATH] [--max-pages N] [--max-depth N] [--delay-seconds N]

Options:

  • --db — path to the SQLite database
  • --max-pages — maximum number of pages to crawl
  • --max-depth — maximum crawl depth from the seed URLs
  • --delay-seconds — delay between requests in seconds

seeklet search

Search the local index.

seeklet search "query text" [--db PATH] [--top-k N]

Options:

  • --db — path to the SQLite database
  • --top-k — maximum number of results to return

seeklet stats

Show index statistics.

seeklet stats [--db PATH]

seeklet reset

Delete local index data.

seeklet reset [--db PATH] [--yes]

Architecture at a glance

seed URLs
  -> crawl allowed pages
  -> fetch HTML
  -> extract title, text, and links
  -> normalize URLs and tokenize text
  -> rebuild SQLite index
  -> execute BM25 search
  -> print ranked CLI results

For deeper detail, see docs/spec.md and docs/architecture.md.

Development

Run the project from an activated virtual environment:

source .venv/bin/activate
make check

Common targets:

  • make check — run lint, format check, and tests
  • make format — apply Ruff formatting
  • make lint — run Ruff lint checks
  • make test — run the pytest suite

Run the CLI directly:

python -m seeklet --help

Project layout

src/seeklet/
    __init__.py
    __main__.py
    cli.py
    config.py
    crawl.py
    extract.py
    index.py
    models.py
    normalize.py
    ranking.py
    search.py
    snippet.py
    storage.py

tests/
docs/
.github/workflows/

Current status

Seeklet is at the educational MVP stage. It is ready to crawl a small site, rebuild a local index, and perform BM25-based keyword search from the CLI. It is not intended for large-scale crawling or advanced retrieval features yet.

See docs/roadmap.md for the current refactor decisions and follow-up work.

About

A minimal educational web search engine in Python.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors