Skip to content

Conversation

@dubadub
Copy link
Member

@dubadub dubadub commented Nov 20, 2025

Added detailed research document analyzing issue #5 (duplicate search results).

Key findings:

  • Root cause: per-feed deduplication allows same recipe from different sources
  • Current UNIQUE(feed_id, external_id) constraint insufficient
  • No content-based or cross-source deduplication exists

Proposed solutions:

  • Phase 1 (quick): fuzzy title-based post-search deduplication
  • Phase 2 (medium): content hash system for accurate matching
  • Phase 3 (long-term): canonical recipe system with source tracking

Document includes:

  • Detailed architecture analysis
  • Code references with line numbers
  • Multiple implementation approaches with pros/cons
  • Phased implementation plan
  • Sample code for each approach
  • Testing strategies and monitoring recommendations

Ready for implementation discussion and prioritization.

Added detailed research document analyzing issue #5 (duplicate search results).

Key findings:
- Root cause: per-feed deduplication allows same recipe from different sources
- Current UNIQUE(feed_id, external_id) constraint insufficient
- No content-based or cross-source deduplication exists

Proposed solutions:
- Phase 1 (quick): fuzzy title-based post-search deduplication
- Phase 2 (medium): content hash system for accurate matching
- Phase 3 (long-term): canonical recipe system with source tracking

Document includes:
- Detailed architecture analysis
- Code references with line numbers
- Multiple implementation approaches with pros/cons
- Phased implementation plan
- Sample code for each approach
- Testing strategies and monitoring recommendations

Ready for implementation discussion and prioritization.
Updated research document with critical finding: the same recipe ID
appears multiple times in search results due to a bug in the indexing logic.

Root cause:
- When recipes are updated, Tantivy index_recipe() adds new document
- Old document with same recipe_id is NEVER deleted first
- Result: N updates = N duplicate documents with same ID

Evidence from user report:
- /recipes/2473 appears multiple times
- /recipes/2457 appears multiple times
- HTML inspection confirms same IDs, not just similar content

The fix (2 lines):
- In src/indexer/search.rs:89, add delete_term() before add_document()
- This ensures only one document per recipe_id exists
- Estimated effort: 30 minutes + reindex time

Updated implementation plan:
- Phase 0 (URGENT): Fix indexing bug
- Phase 1 (Optional): Post-search deduplication for content
- Phase 2-3: Content hash system and canonical recipes

This bug fix should resolve the immediate user-reported issue.
The content-based deduplication remains a separate enhancement.
Detailed plan covering two phases:

Phase 0 (Critical):
- Delete-before-add logic to fix same recipe_id duplicates
- Simple 2-line fix in search indexer
- Estimated: 1-2 hours

Phase 2 (Content Hash):
- SHA-256 based content hashing
- Database migration for content_hash column
- Deduplication at search query time
- Backfill script for existing recipes
- Estimated: 12-15 hours

Includes code examples, testing strategies, deployment steps,
monitoring metrics, and rollback procedures.
This commit adds comprehensive failing tests for both Phase 0 and Phase 2
of the deduplication implementation plan:

Phase 0 - Delete-Before-Add Logic:
- Added test_index_recipe_deletes_before_adding in src/indexer/search.rs
- Tests that updating a recipe removes old entry before adding new one
- Currently FAILS: old title remains in index (proves bug exists)

Phase 2 - Content Hash Deduplication:
- Added content hash calculation functions to src/db/recipes.rs:
  * calculate_content_hash() - SHA-256 hash of normalized title+content
  * normalize_title() - lowercase, trim, collapse whitespace
  * normalize_cooklang_content() - remove comments, normalize formatting
  * find_recipe_by_content_hash() - query by hash
  * find_duplicate_recipes() - get all recipes with same hash

- Added 7 unit tests for content hash normalization:
  * test_normalize_title
  * test_same_content_produces_same_hash
  * test_whitespace_differences_produce_same_hash
  * test_comments_dont_affect_hash
  * test_different_content_produces_different_hash
  * test_title_case_differences_produce_same_hash
  * test_block_comments_dont_affect_hash (FAILS - reveals bug)

- Added 3 integration tests in tests/deduplication_test.rs:
  * test_duplicate_detection_by_hash
  * test_different_recipes_have_different_hashes
  * test_find_recipe_by_content_hash
  * All FAIL with "no such column: content_hash" (expected - migration needed)

Dependencies:
- Added sha2 = "0.10" to Cargo.toml for hash calculation

These tests follow TDD principles - they are written before implementation
to guide development and ensure correctness. They will pass once the
features described in plan.md are implemented.
This commit implements both Phase 0 and Phase 2 of the deduplication plan
to eliminate duplicate recipes in search results.

Phase 0 (Critical - Delete-Before-Add Logic):
- Modified index_recipe() in src/indexer/search.rs to delete existing
  documents before adding new ones
- Added delete_term() call before add_document() to prevent duplicate
  entries for the same recipe_id
- Fixed search index schema to make ID field INDEXED (not just FAST) to
  enable proper deletion via delete_term()
- Added comprehensive unit test to verify delete-before-add behavior

Phase 2 (Content Hash Deduplication):
- Created database migration 006_add_content_hash.sql to add content_hash
  column with index
- Added content_hash field to Recipe and NewRecipe models
- Updated create_recipe() to store content_hash in database
- Implemented hash calculation functions:
  * calculate_content_hash() - SHA-256 hash of normalized title+content
  * normalize_title() - lowercase, trim, collapse whitespace
  * normalize_cooklang_content() - remove comments, normalize formatting
  * find_recipe_by_content_hash() - query by hash
  * find_duplicate_recipes() - get all recipes with same hash
- Updated GitHub indexer to calculate and set content_hash
- Added comprehensive unit tests for hash normalization
- Added integration tests for duplicate detection

All Tests Passing:
- 55 library tests pass
- 3 integration tests pass
- Delete-before-add test verifies no duplicate documents
- Hash normalization tests verify correct behavior
- Duplicate detection tests verify database queries work

Next Steps:
- Apply migration to production database
- Update RSS crawler to calculate content_hash (TODO)
- Implement search-time deduplication using content_hash (future enhancement)
- Add backfill script for existing recipes (future enhancement)

Resolves duplicate search results issue by ensuring each recipe appears
only once in the search index, even after updates.
- Add missing content_hash field to NewRecipe in parser integration test
- Refactor loop to while let to satisfy clippy::while_let_loop lint
- Apply rustfmt formatting fixes across codebase
@claude
Copy link

claude bot commented Nov 20, 2025

Pull Request Review: Deduplication Implementation

Reviewed all 15 changed files addressing duplicate search results (Issue #5).

Summary

Strong Points:

  • Excellent research documentation
  • Two-phase approach (critical bug fix + content hashing)
  • Comprehensive test coverage (11 unit + 3 integration tests)
  • Delete-before-add fix correctly prevents duplicates

Critical Issues:

  • Schema breaking change requires full index rebuild
  • Content hash calculated but NOT used in search
  • RSS crawler integration missing (marked TODO)
  • No backfill strategy for existing recipes

Key Findings

1. Delete-Before-Add Fix ✅ (src/indexer/search.rs:91-97)

Correctly prevents same recipe_id appearing multiple times. Production-ready.

2. Content Normalization ⚠️ (src/db/recipes.rs:54-99)

Issues:

  • Block comment removal fragile (can panic on UTF-8)
  • Doesn't strip metadata sections (>> servings: 4)
  • Multiple string allocations

3. Missing Implementation ⚠️

Hash stored in DB but search doesn't deduplicate by hash. Adds overhead without benefit.

4. Migration Incomplete ⚠️ (migrations/006_add_content_hash.sql)

Adds column but doesn't populate it. No backfill script.

Recommendations

Must Fix:

  1. Document index rebuild requirement
  2. Either implement search deduplication OR remove hash from this PR
  3. Add backfill script for existing recipes
  4. Complete RSS crawler or remove from scope

Should Fix:
5. Improve normalization (use regex, handle metadata)
6. Add NULL hash handling tests
7. Add progress logging for rebuild

Consider:
8. Split into 2 PRs: bug fix (ready) + enhancement (needs work)

Verdict

  • Phase 0 (delete-before-add): Production-ready ✅
  • Phase 2 (content hashing): Incomplete ⚠️

Grades:

  • Code Quality: B+
  • Test Coverage: B
  • Documentation: A-
  • Architecture: B (good design, incomplete)

The delete-before-add fix solves the critical bug immediately. Content hashing is well-designed but needs completion before providing value.

Great TDD approach and analysis!

@cooklang cooklang deleted a comment from claude bot Nov 20, 2025
@cooklang cooklang deleted a comment from claude bot Nov 20, 2025
@cooklang cooklang deleted a comment from claude bot Nov 20, 2025
@cooklang cooklang deleted a comment from claude bot Nov 20, 2025
@dubadub dubadub merged commit 320c134 into main Nov 20, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants