Skip to content

izag8216/dedux

Repository files navigation

dedux - Semantic Overlap Detector

Find what your notes are really saying twice.

License Python 3.10+ Tests Zero Dependencies Japanese README


dedux detects semantically overlapping notes in your knowledge base -- Obsidian vaults, Zettelkasten, or any markdown directory. No embeddings, no API keys, no cloud. Just pure local text analysis.

While hash-based deduplication tools find exact duplicates, dedux finds semantic overlap: notes covering the same topic with different words, from different angles, without you realizing.

How It Works

dedux uses character n-gram shingling and word n-gram analysis with Jaccard similarity to detect when two files discuss the same concepts, even with different wording.

  1. Extract key phrases from each markdown file (headings, body, tags)
  2. Compute similarity between all file pairs
  3. Flag pairs above your threshold
  4. Show overlapping phrases so you can decide whether to merge

Installation

pip install dedux

Or install from source:

git clone https://github.com/izag8216/dedux.git
cd dedux
pip install -e .

Usage

Scan a directory

# Find overlapping notes (default 50% threshold)
dedux scan ./vault/

# Only high-overlap pairs
dedux scan ./vault/ --threshold 0.7

# Output as JSON
dedux scan ./vault/ --format json -o results.json

# Output as CSV
dedux scan ./vault/ --format csv -o overlaps.csv

Compare two files

dedux diff note1.md note2.md

Export results

dedux export ./vault/ --format json -o results.json
dedux export ./vault/ --format markdown -o report.md

Output Formats

Format Flag Use Case
Text --format text Terminal review (default)
JSON --format json Programmatic processing
CSV --format csv Spreadsheet analysis
Markdown --format markdown Wikis, documentation

Example Output

dedux scan results
----------------------------------------
  Files scanned: 42
  Threshold:     50%
  Overlaps found: 3
  Time:          0.34s

  Top overlapping pairs:

   1. [78.0%] ████████████████░░░░
      python-notes.md
      python-overview.md

   2. [62.0%] ████████████░░░░░░░░
      cli-design.md
      terminal-tools.md

Why dedux?

dedux md-dedupe Embeddings
Semantic overlap Exact hash match Semantic similarity
No API key No API key Requires API
Zero dependencies Zero dependencies Heavy ML stack
Offline Offline Often online
Markdown-aware Any file Any text

Tech Stack

  • Python 3.10+ -- No external dependencies
  • argparse -- CLI interface
  • difflib -- Sequence matching
  • collections -- Frequency analysis

Development

# Install dev dependencies
pip install -e .
pip install pytest

# Run tests
pytest tests/ -v

# Run CLI directly
python -m dedux.cli scan ./tests/fixtures/

License

MIT License -- see LICENSE for details.

About

Semantic Overlap Detector for Knowledge Bases -- find what your notes are really saying twice. Zero deps, offline, markdown-aware.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages