-
Notifications
You must be signed in to change notification settings - Fork 1
Research federation issue fix and document findings #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research federation issue fix and document findings #6
Conversation
Added detailed research document analyzing issue #5 (duplicate search results). Key findings: - Root cause: per-feed deduplication allows same recipe from different sources - Current UNIQUE(feed_id, external_id) constraint insufficient - No content-based or cross-source deduplication exists Proposed solutions: - Phase 1 (quick): fuzzy title-based post-search deduplication - Phase 2 (medium): content hash system for accurate matching - Phase 3 (long-term): canonical recipe system with source tracking Document includes: - Detailed architecture analysis - Code references with line numbers - Multiple implementation approaches with pros/cons - Phased implementation plan - Sample code for each approach - Testing strategies and monitoring recommendations Ready for implementation discussion and prioritization.
Updated research document with critical finding: the same recipe ID appears multiple times in search results due to a bug in the indexing logic. Root cause: - When recipes are updated, Tantivy index_recipe() adds new document - Old document with same recipe_id is NEVER deleted first - Result: N updates = N duplicate documents with same ID Evidence from user report: - /recipes/2473 appears multiple times - /recipes/2457 appears multiple times - HTML inspection confirms same IDs, not just similar content The fix (2 lines): - In src/indexer/search.rs:89, add delete_term() before add_document() - This ensures only one document per recipe_id exists - Estimated effort: 30 minutes + reindex time Updated implementation plan: - Phase 0 (URGENT): Fix indexing bug - Phase 1 (Optional): Post-search deduplication for content - Phase 2-3: Content hash system and canonical recipes This bug fix should resolve the immediate user-reported issue. The content-based deduplication remains a separate enhancement.
Detailed plan covering two phases: Phase 0 (Critical): - Delete-before-add logic to fix same recipe_id duplicates - Simple 2-line fix in search indexer - Estimated: 1-2 hours Phase 2 (Content Hash): - SHA-256 based content hashing - Database migration for content_hash column - Deduplication at search query time - Backfill script for existing recipes - Estimated: 12-15 hours Includes code examples, testing strategies, deployment steps, monitoring metrics, and rollback procedures.
This commit adds comprehensive failing tests for both Phase 0 and Phase 2 of the deduplication implementation plan: Phase 0 - Delete-Before-Add Logic: - Added test_index_recipe_deletes_before_adding in src/indexer/search.rs - Tests that updating a recipe removes old entry before adding new one - Currently FAILS: old title remains in index (proves bug exists) Phase 2 - Content Hash Deduplication: - Added content hash calculation functions to src/db/recipes.rs: * calculate_content_hash() - SHA-256 hash of normalized title+content * normalize_title() - lowercase, trim, collapse whitespace * normalize_cooklang_content() - remove comments, normalize formatting * find_recipe_by_content_hash() - query by hash * find_duplicate_recipes() - get all recipes with same hash - Added 7 unit tests for content hash normalization: * test_normalize_title * test_same_content_produces_same_hash * test_whitespace_differences_produce_same_hash * test_comments_dont_affect_hash * test_different_content_produces_different_hash * test_title_case_differences_produce_same_hash * test_block_comments_dont_affect_hash (FAILS - reveals bug) - Added 3 integration tests in tests/deduplication_test.rs: * test_duplicate_detection_by_hash * test_different_recipes_have_different_hashes * test_find_recipe_by_content_hash * All FAIL with "no such column: content_hash" (expected - migration needed) Dependencies: - Added sha2 = "0.10" to Cargo.toml for hash calculation These tests follow TDD principles - they are written before implementation to guide development and ensure correctness. They will pass once the features described in plan.md are implemented.
This commit implements both Phase 0 and Phase 2 of the deduplication plan to eliminate duplicate recipes in search results. Phase 0 (Critical - Delete-Before-Add Logic): - Modified index_recipe() in src/indexer/search.rs to delete existing documents before adding new ones - Added delete_term() call before add_document() to prevent duplicate entries for the same recipe_id - Fixed search index schema to make ID field INDEXED (not just FAST) to enable proper deletion via delete_term() - Added comprehensive unit test to verify delete-before-add behavior Phase 2 (Content Hash Deduplication): - Created database migration 006_add_content_hash.sql to add content_hash column with index - Added content_hash field to Recipe and NewRecipe models - Updated create_recipe() to store content_hash in database - Implemented hash calculation functions: * calculate_content_hash() - SHA-256 hash of normalized title+content * normalize_title() - lowercase, trim, collapse whitespace * normalize_cooklang_content() - remove comments, normalize formatting * find_recipe_by_content_hash() - query by hash * find_duplicate_recipes() - get all recipes with same hash - Updated GitHub indexer to calculate and set content_hash - Added comprehensive unit tests for hash normalization - Added integration tests for duplicate detection All Tests Passing: - 55 library tests pass - 3 integration tests pass - Delete-before-add test verifies no duplicate documents - Hash normalization tests verify correct behavior - Duplicate detection tests verify database queries work Next Steps: - Apply migration to production database - Update RSS crawler to calculate content_hash (TODO) - Implement search-time deduplication using content_hash (future enhancement) - Add backfill script for existing recipes (future enhancement) Resolves duplicate search results issue by ensuring each recipe appears only once in the search index, even after updates.
- Add missing content_hash field to NewRecipe in parser integration test - Refactor loop to while let to satisfy clippy::while_let_loop lint - Apply rustfmt formatting fixes across codebase
Pull Request Review: Deduplication ImplementationReviewed all 15 changed files addressing duplicate search results (Issue #5). SummaryStrong Points:
Critical Issues:
Key Findings1. Delete-Before-Add Fix ✅ (src/indexer/search.rs:91-97)Correctly prevents same recipe_id appearing multiple times. Production-ready. 2. Content Normalization
|
Added detailed research document analyzing issue #5 (duplicate search results).
Key findings:
Proposed solutions:
Document includes:
Ready for implementation discussion and prioritization.