Skip to content

leann watch hard-crashes on media/binary files, and over-scans when a root-level path is indexed #345

@ArtifexSystems

Description

@ArtifexSystems

Version: leann-core 0.3.7
Env: WSL2, Python 3.12

Problem A — leann watch hard-crashes on media/binary files

leann watch's change detection hashes file contents via a LlamaIndex SimpleDirectoryReader with the full default extractor set:

cli.py _detect_build_changes → sync.py detect_changes → sync.py generate_file_hashes
  → reader.iter_data() → SimpleDirectoryReader.load_file → readers/file/video_audio/base.py
  → ImportError: Please install OpenAI whisper ...

So a single audio/video file anywhere in the scanned roots takes the whole watcher down with an unhandled ImportError (image/binary files also emit load errors). Notably, leann build survives the same tree because it honors --file-types (required_exts), but the watch/sync hash scan does not apply that filter — an inconsistency. Change detection should restrict to the index's configured file types (or at least skip unreadable files gracefully) rather than instantiate media readers.

Problem B — indexing a root-level file expands the scan to the whole repo

If --docs includes any loose file at the repo root (e.g. README.md), the watcher's resolved sync roots collapse to the repo root, so it then crawls everything — node_modules/, build dirs, vendored mirrors — and trips Problem A on whatever binary/media it finds:

📂 Indexing 15 paths:
    1. /home/.../my-repo          ← entire repo
Failed to load .../assets/img/icon/wasm.png ...
ImportError: Please install OpenAI whisper ...

Repro

  1. Build an index whose --docs includes a repo-root file plus subdirs, in a repo that also contains any media/binary file (images count), e.g. leann build x --docs ./src README.md --file-types .py,.md.
  2. Start leann watch x → it scans the repo root and crashes on the first media file.

Expected

Watch honors the index's configured file types / scan scope, and never crashes on files it wouldn't index anyway.

Workaround

Pass only clean subdirectories to --docs (no root-level files), and keep media/binaries out of indexed dirs.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions