Skip to content

Improve deduplication — catch same opportunity posted under different URLs #51

@kamsirichard

Description

@kamsirichard

Context

ScoutBot currently deduplicates by Application Link URL. This works well for most cases, but the same opportunity sometimes appears on multiple aggregator sites under different URLs. Students can receive the same opportunity twice if it's listed on both OpportunityDesk and AfterSchoolAfrica, for example.

Task

Add a secondary deduplication pass based on title similarity.

Suggested approach

After URL-based dedup, check the new entry's title against all titles already in the sheet. If similarity is above a threshold (e.g. 85%), skip the duplicate.

A simple approach without ML:

from difflib import SequenceMatcher

def title_similarity(a, b):
    return SequenceMatcher(None, a.lower(), b.lower()).ratio()

Or strip common words (Scholarship, Fellowship, Program, 2025, 2026) before comparing.

Files to touch

  • scoutbot/pipelines.py — where dedup currently happens

Notes

  • Keep URL-based dedup as the primary check (fast)
  • Title-based dedup only fires when URL dedup passes
  • Log any title-based skips for visibility
  • Add a test in tests/

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions