Context
ScoutBot currently deduplicates by Application Link URL. This works well for most cases, but the same opportunity sometimes appears on multiple aggregator sites under different URLs. Students can receive the same opportunity twice if it's listed on both OpportunityDesk and AfterSchoolAfrica, for example.
Task
Add a secondary deduplication pass based on title similarity.
Suggested approach
After URL-based dedup, check the new entry's title against all titles already in the sheet. If similarity is above a threshold (e.g. 85%), skip the duplicate.
A simple approach without ML:
from difflib import SequenceMatcher
def title_similarity(a, b):
return SequenceMatcher(None, a.lower(), b.lower()).ratio()
Or strip common words (Scholarship, Fellowship, Program, 2025, 2026) before comparing.
Files to touch
scoutbot/pipelines.py — where dedup currently happens
Notes
- Keep URL-based dedup as the primary check (fast)
- Title-based dedup only fires when URL dedup passes
- Log any title-based skips for visibility
- Add a test in
tests/
Context
ScoutBot currently deduplicates by Application Link URL. This works well for most cases, but the same opportunity sometimes appears on multiple aggregator sites under different URLs. Students can receive the same opportunity twice if it's listed on both OpportunityDesk and AfterSchoolAfrica, for example.
Task
Add a secondary deduplication pass based on title similarity.
Suggested approach
After URL-based dedup, check the new entry's title against all titles already in the sheet. If similarity is above a threshold (e.g. 85%), skip the duplicate.
A simple approach without ML:
Or strip common words (Scholarship, Fellowship, Program, 2025, 2026) before comparing.
Files to touch
scoutbot/pipelines.py— where dedup currently happensNotes
tests/