Skip to content

fix: dedupe personal website candidates#198

Merged
michaelmwu merged 3 commits intomainfrom
michaelmwu/site-dedupe
Mar 9, 2026
Merged

fix: dedupe personal website candidates#198
michaelmwu merged 3 commits intomainfrom
michaelmwu/site-dedupe

Conversation

@michaelmwu
Copy link
Member

Description

Normalize website dedupe to ignore scheme-only differences so http and https variants do not produce duplicate personal websites.
Stop appending heuristic non-social website candidates after an accepted LLM personal website candidate, while keeping the existing social-link split behavior.
Add extractor and resume profile processor regression tests covering scheme-only duplicates and heuristic junk websites.

Related Issue

N/A

How Has This Been Tested?

  • uv run pytest tests/unit/test_resume_extractor.py -q
  • uv run pytest tests/unit/test_resume_profile_processor.py -q

Copilot AI review requested due to automatic review settings March 9, 2026 20:02
@coderabbitai
Copy link

coderabbitai bot commented Mar 9, 2026

Warning

Rate limit exceeded

@michaelmwu has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 38 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e9b5be9c-1b52-427c-bcd9-8253716b451b

📥 Commits

Reviewing files that changed from the base of the PR and between 24679e8 and ebcd2d7.

📒 Files selected for processing (6)
  • apps/worker/src/five08/worker/crm/resume_profile_processor.py
  • packages/shared/src/five08/crm_normalization.py
  • packages/shared/src/five08/resume_extractor.py
  • tests/unit/test_crm_normalization.py
  • tests/unit/test_resume_extractor.py
  • tests/unit/test_resume_profile_processor.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch michaelmwu/site-dedupe

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Normalizes personal-website deduplication so http:// vs https:// variants collapse to a single website, and updates extraction behavior to avoid adding heuristic “junk” website candidates after an accepted LLM personal-website candidate.

Changes:

  • Introduce a scheme-insensitive website identity key and use it for website/social dedupe in the shared resume extractor.
  • Update the worker resume profile processor to dedupe CRM website updates using a scheme-insensitive key.
  • Add regression tests covering scheme-only duplicates and suppression of heuristic backfill after an accepted LLM website.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
packages/shared/src/five08/resume_extractor.py Adds _website_identity_key() and applies it across multiple website/social dedupe paths; suppresses heuristic non-social website backfill after LLM personal website acceptance.
apps/worker/src/five08/worker/crm/resume_profile_processor.py Updates website link coercion/merge logic to dedupe using a scheme-insensitive key.
tests/unit/test_resume_extractor.py Adds extractor regression tests for scheme-only dedupe and heuristic suppression after LLM website success.
tests/unit/test_resume_profile_processor.py Adds processor regression test ensuring existing/extracted website links don’t trigger updates when differing only by scheme.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +478 to +481
result = extractor.extract("Portfolio: https://bit.ly/charleschen-portfolio")

assert result.website_links == ["http://bit.ly/charleschen-portfolio"]

Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion hard-codes that the surviving deduped URL must be the http variant. The behavior being tested is scheme-insensitive dedupe; depending on candidate ordering or confidence ties, the implementation could reasonably keep https instead while still satisfying the requirement. To reduce brittleness, assert that only one link remains and that its identity matches the expected host/path (ignoring scheme).

Copilot uses AI. Check for mistakes.
netloc = parsed.netloc.casefold()
if netloc.startswith("www."):
netloc = netloc[4:]
path = re.sub(r"/+", "/", parsed.path or "").rstrip("/")
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_website_identity_key() case-folds the host but leaves the URL path casing unchanged. Previously dedupe used normalized_url.casefold(), so links that differ only by path casing (e.g. /About vs /about) would have been deduped but will now be treated as distinct. If case-insensitive path dedupe is still desired for this pipeline, case-fold (or otherwise normalize) the path component when building the key, or add coverage to lock in the intended behavior.

Suggested change
path = re.sub(r"/+", "/", parsed.path or "").rstrip("/")
path = re.sub(r"/+", "/", parsed.path or "").rstrip("/").casefold()

Copilot uses AI. Check for mistakes.
Comment on lines +1255 to +1263
def _website_identity_key(value: str) -> str | None:
normalized_url = _normalize_website_url(value)
if not normalized_url:
return None

try:
parsed = urlsplit(normalized_url)
except Exception:
return normalized_url.casefold()
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_website_identity_key() re-normalizes the URL via _normalize_website_url(), but most call sites already pass a normalized value (e.g., normalized_link / normalized_url). This causes redundant parsing/normalization work and also couples dedupe-key generation to the full normalization policy. Consider letting the key helper accept an already-normalized URL (or splitting into normalize + key steps) so callers don’t pay the normalization cost twice.

Copilot uses AI. Check for mistakes.
netloc = parsed.netloc.casefold()
if netloc.startswith("www."):
netloc = netloc[4:]
path = re.sub(r"/+", "/", parsed.path or "").rstrip("/")
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_website_dedupe_key() duplicates the URL identity logic that already exists in packages/shared/src/five08/resume_extractor.py (_website_identity_key). Keeping two slightly different implementations increases the chance they drift over time (e.g., the current key keeps original path casing, which also changes prior case-insensitive dedupe behavior). Consider moving this into a shared helper (e.g., crm_normalization) and reusing it from both places, and clarify whether path casing should be normalized as part of dedupe.

Suggested change
path = re.sub(r"/+", "/", parsed.path or "").rstrip("/")
path = re.sub(r"/+", "/", parsed.path or "").rstrip("/").casefold()

Copilot uses AI. Check for mistakes.
@michaelmwu michaelmwu merged commit 2b0dbdd into main Mar 9, 2026
5 checks passed
@michaelmwu michaelmwu deleted the michaelmwu/site-dedupe branch March 9, 2026 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants