Refactor banned-term scanning to check visible text#19
Refactor banned-term scanning to check visible text#19harbourviewcompany-create wants to merge 1 commit into
Conversation
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
✅ Deploy Preview for wurx-can ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
✅ Deploy Preview for wurx-otta ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
| .replace(/&#(\d+);/g, (_, codePoint) => String.fromCodePoint(Number(codePoint))) | ||
| .replace(/&#x([a-f0-9]+);/gi, (_, hexCodePoint) => String.fromCodePoint(parseInt(hexCodePoint, 16))); |
There was a problem hiding this comment.
🟡 Unhandled RangeError from String.fromCodePoint on invalid numeric HTML entities
The generic numeric entity handlers on lines 48-49 pass decoded numbers directly to String.fromCodePoint() without validating that they are valid Unicode code points (0 to 0x10FFFF). If an HTML file contains a malformed entity like � or �, String.fromCodePoint() throws an unhandled RangeError, crashing the entire check script. Since this function is applied to every .html file found in the root directory, a single malformed entity in any file would prevent the entire validation suite from running.
| .replace(/&#(\d+);/g, (_, codePoint) => String.fromCodePoint(Number(codePoint))) | |
| .replace(/&#x([a-f0-9]+);/gi, (_, hexCodePoint) => String.fromCodePoint(parseInt(hexCodePoint, 16))); | |
| .replace(/&#(\d+);/g, (_, codePoint) => { const n = Number(codePoint); return n <= 0x10FFFF ? String.fromCodePoint(n) : ""; }) | |
| .replace(/&#x([a-f0-9]+);/gi, (_, hexCodePoint) => { const n = parseInt(hexCodePoint, 16); return n <= 0x10FFFF ? String.fromCodePoint(n) : ""; }); |
Was this helpful? React with 👍 or 👎 to provide feedback.
| const visibleText = extractVisibleText(html); | ||
| for (const pattern of bannedPatterns) { | ||
| if (pattern.test(html)) { | ||
| if (pattern.test(visibleText)) { |
There was a problem hiding this comment.
🚩 Banned-term scan no longer covers HTML attribute content (meta tags, alt text, etc.)
The refactor from pattern.test(html) to pattern.test(visibleText) at line 103 means banned terms in HTML attributes are no longer checked. The extractVisibleText function strips all tags via /<[^>]+>/g at line 57, which discards attribute values entirely. This means banned legacy terms like SummitLine or Roofing appearing in <meta name="description" content="...">, <meta property="og:title" content="...">, <img alt="...">, or <input placeholder="..."> would go undetected. These are user-visible in search results, social media previews, and screen readers respectively. The commit message ("use visible HTML text") suggests this narrowing is intentional—likely to avoid false positives from CSS class names, data attributes, or JS identifiers—but the trade-off should be explicitly acknowledged since meta descriptions and OG tags are the most common places legacy branding lingers after a rebrand.
Was this helpful? React with 👍 or 👎 to provide feedback.
| function decodeHtmlEntities(text) { | ||
| return text | ||
| .replace(/ /gi, " ") | ||
| .replace(/&/gi, "&") | ||
| .replace(/</gi, "<") | ||
| .replace(/>/gi, ">") | ||
| .replace(/"/gi, "\"") | ||
| .replace(/'/gi, "'") | ||
| .replace(/'/gi, "'") | ||
| .replace(///gi, "/") | ||
| .replace(/&#(\d+);/g, (_, codePoint) => String.fromCodePoint(Number(codePoint))) | ||
| .replace(/&#x([a-f0-9]+);/gi, (_, hexCodePoint) => String.fromCodePoint(parseInt(hexCodePoint, 16))); | ||
| } |
There was a problem hiding this comment.
📝 Info: Entity decoding order causes asymmetric double-decode behavior
In decodeHtmlEntities, & is decoded on line 41 before the generic numeric entity handlers on lines 48-49. This means double-encoded numeric entities like &#82; get decoded in two passes: first & → & producing R, then R → R. However, double-encoded named entities like &nbsp; do NOT get double-decoded because the replacement (line 40) already ran before & was decoded (line 41). This asymmetry is not a practical bug for this use case—it actually helps catch obfuscated banned terms—but it's worth noting the function doesn't faithfully mirror browser entity decoding semantics.
Was this helpful? React with 👍 or 👎 to provide feedback.
Motivation
<script>/<style>blocks as a minimum preprocessing step before pattern matching.Description
decodeHtmlEntities(text)to perform lightweight HTML entity decoding and unicode numeric/entity decoding.extractVisibleText(html)which strips HTML comments, removes<script>and<style>blocks, strips tags, decodes entities, and normalizes whitespace.pattern.test(visibleText)wherevisibleTextis produced byextractVisibleText(html), keeping thebannedPatternsarray and failure message format unchanged.Testing
node scripts/check-site.jsand the script completed successfully with no failures.bannedPatternsbehavior were preserved to avoid changing downstream expectations.Codex Task