fix: dedupe personal website candidates by michaelmwu · Pull Request #198 · 508-dev/508-workflows

michaelmwu · 2026-03-09T20:02:48Z

Description

Normalize website dedupe to ignore scheme-only differences so http and https variants do not produce duplicate personal websites.
Stop appending heuristic non-social website candidates after an accepted LLM personal website candidate, while keeping the existing social-link split behavior.
Add extractor and resume profile processor regression tests covering scheme-only duplicates and heuristic junk websites.

Related Issue

N/A

How Has This Been Tested?

uv run pytest tests/unit/test_resume_extractor.py -q
uv run pytest tests/unit/test_resume_profile_processor.py -q

coderabbitai · 2026-03-09T20:03:08Z

Warning

Rate limit exceeded

@michaelmwu has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 38 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e9b5be9c-1b52-427c-bcd9-8253716b451b

📥 Commits

Reviewing files that changed from the base of the PR and between 24679e8 and ebcd2d7.

📒 Files selected for processing (6)

apps/worker/src/five08/worker/crm/resume_profile_processor.py
packages/shared/src/five08/crm_normalization.py
packages/shared/src/five08/resume_extractor.py
tests/unit/test_crm_normalization.py
tests/unit/test_resume_extractor.py
tests/unit/test_resume_profile_processor.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch michaelmwu/site-dedupe

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Normalizes personal-website deduplication so http:// vs https:// variants collapse to a single website, and updates extraction behavior to avoid adding heuristic “junk” website candidates after an accepted LLM personal-website candidate.

Changes:

Introduce a scheme-insensitive website identity key and use it for website/social dedupe in the shared resume extractor.
Update the worker resume profile processor to dedupe CRM website updates using a scheme-insensitive key.
Add regression tests covering scheme-only duplicates and suppression of heuristic backfill after an accepted LLM website.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`packages/shared/src/five08/resume_extractor.py`	Adds `_website_identity_key()` and applies it across multiple website/social dedupe paths; suppresses heuristic non-social website backfill after LLM personal website acceptance.
`apps/worker/src/five08/worker/crm/resume_profile_processor.py`	Updates website link coercion/merge logic to dedupe using a scheme-insensitive key.
`tests/unit/test_resume_extractor.py`	Adds extractor regression tests for scheme-only dedupe and heuristic suppression after LLM website success.
`tests/unit/test_resume_profile_processor.py`	Adds processor regression test ensuring existing/extracted website links don’t trigger updates when differing only by scheme.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-09T20:07:22Z

tests/unit/test_resume_extractor.py

+    result = extractor.extract("Portfolio: https://bit.ly/charleschen-portfolio")
+
+    assert result.website_links == ["http://bit.ly/charleschen-portfolio"]
+


This assertion hard-codes that the surviving deduped URL must be the http variant. The behavior being tested is scheme-insensitive dedupe; depending on candidate ordering or confidence ties, the implementation could reasonably keep https instead while still satisfying the requirement. To reduce brittleness, assert that only one link remains and that its identity matches the expected host/path (ignoring scheme).

Copilot · 2026-03-09T20:07:22Z

packages/shared/src/five08/resume_extractor.py

+    netloc = parsed.netloc.casefold()
+    if netloc.startswith("www."):
+        netloc = netloc[4:]
+    path = re.sub(r"/+", "/", parsed.path or "").rstrip("/")


_website_identity_key() case-folds the host but leaves the URL path casing unchanged. Previously dedupe used normalized_url.casefold(), so links that differ only by path casing (e.g. /About vs /about) would have been deduped but will now be treated as distinct. If case-insensitive path dedupe is still desired for this pipeline, case-fold (or otherwise normalize) the path component when building the key, or add coverage to lock in the intended behavior.

Suggested change

path = re.sub(r"/+", "/", parsed.path or "").rstrip("/")

path = re.sub(r"/+", "/", parsed.path or "").rstrip("/").casefold()

Copilot · 2026-03-09T20:07:22Z

packages/shared/src/five08/resume_extractor.py

+def _website_identity_key(value: str) -> str | None:
+    normalized_url = _normalize_website_url(value)
+    if not normalized_url:
+        return None
+
+    try:
+        parsed = urlsplit(normalized_url)
+    except Exception:
+        return normalized_url.casefold()


_website_identity_key() re-normalizes the URL via _normalize_website_url(), but most call sites already pass a normalized value (e.g., normalized_link / normalized_url). This causes redundant parsing/normalization work and also couples dedupe-key generation to the full normalization policy. Consider letting the key helper accept an already-normalized URL (or splitting into normalize + key steps) so callers don’t pay the normalization cost twice.

Copilot · 2026-03-09T20:07:23Z

apps/worker/src/five08/worker/crm/resume_profile_processor.py

+        netloc = parsed.netloc.casefold()
+        if netloc.startswith("www."):
+            netloc = netloc[4:]
+        path = re.sub(r"/+", "/", parsed.path or "").rstrip("/")


_website_dedupe_key() duplicates the URL identity logic that already exists in packages/shared/src/five08/resume_extractor.py (_website_identity_key). Keeping two slightly different implementations increases the chance they drift over time (e.g., the current key keeps original path casing, which also changes prior case-insensitive dedupe behavior). Consider moving this into a shared helper (e.g., crm_normalization) and reusing it from both places, and clarify whether path casing should be normalized as part of dedupe.

Suggested change

path = re.sub(r"/+", "/", parsed.path or "").rstrip("/")

path = re.sub(r"/+", "/", parsed.path or "").rstrip("/").casefold()

Fix website candidate dedupe

0ed9edd

Copilot AI review requested due to automatic review settings March 9, 2026 20:02

michaelmwu temporarily deployed to test March 9, 2026 20:02 — with GitHub Actions Inactive

Copilot started reviewing on behalf of michaelmwu March 9, 2026 20:03 View session

Suppress heuristic URL backfill

5c416d8

michaelmwu temporarily deployed to test March 9, 2026 20:06 — with GitHub Actions Inactive

Copilot AI reviewed Mar 9, 2026

View reviewed changes

Refine website dedupe helpers

ebcd2d7

michaelmwu temporarily deployed to test March 9, 2026 20:09 — with GitHub Actions Inactive

michaelmwu merged commit 2b0dbdd into main Mar 9, 2026
5 checks passed

michaelmwu deleted the michaelmwu/site-dedupe branch March 9, 2026 20:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: dedupe personal website candidates#198

fix: dedupe personal website candidates#198
michaelmwu merged 3 commits intomainfrom
michaelmwu/site-dedupe

michaelmwu commented Mar 9, 2026

Uh oh!

coderabbitai bot commented Mar 9, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		result = extractor.extract("Portfolio: https://bit.ly/charleschen-portfolio")

		assert result.website_links == ["http://bit.ly/charleschen-portfolio"]

	path = re.sub(r"/+", "/", parsed.path or "").rstrip("/")
	path = re.sub(r"/+", "/", parsed.path or "").rstrip("/").casefold()

Conversation

michaelmwu commented Mar 9, 2026

Description

Related Issue

How Has This Been Tested?

Uh oh!

coderabbitai bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 9, 2026 •

edited

Loading