Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 26 additions & 1 deletion scripts/check-site.js
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,30 @@ function exists(file) {
return fs.existsSync(path.join(root, file));
}

function decodeHtmlEntities(text) {
return text
.replace(/ /gi, " ")
.replace(/&/gi, "&")
.replace(/&lt;/gi, "<")
.replace(/&gt;/gi, ">")
.replace(/&quot;/gi, "\"")
.replace(/&#39;/gi, "'")
.replace(/&#x27;/gi, "'")
.replace(/&#x2F;/gi, "/")
.replace(/&#(\d+);/g, (_, codePoint) => String.fromCodePoint(Number(codePoint)))
.replace(/&#x([a-f0-9]+);/gi, (_, hexCodePoint) => String.fromCodePoint(parseInt(hexCodePoint, 16)));
Comment on lines +48 to +49
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Unhandled RangeError from String.fromCodePoint on invalid numeric HTML entities

The generic numeric entity handlers on lines 48-49 pass decoded numbers directly to String.fromCodePoint() without validating that they are valid Unicode code points (0 to 0x10FFFF). If an HTML file contains a malformed entity like &#99999999999; or &#xFFFFFF;, String.fromCodePoint() throws an unhandled RangeError, crashing the entire check script. Since this function is applied to every .html file found in the root directory, a single malformed entity in any file would prevent the entire validation suite from running.

Suggested change
.replace(/&#(\d+);/g, (_, codePoint) => String.fromCodePoint(Number(codePoint)))
.replace(/&#x([a-f0-9]+);/gi, (_, hexCodePoint) => String.fromCodePoint(parseInt(hexCodePoint, 16)));
.replace(/&#(\d+);/g, (_, codePoint) => { const n = Number(codePoint); return n <= 0x10FFFF ? String.fromCodePoint(n) : ""; })
.replace(/&#x([a-f0-9]+);/gi, (_, hexCodePoint) => { const n = parseInt(hexCodePoint, 16); return n <= 0x10FFFF ? String.fromCodePoint(n) : ""; });
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

}
Comment on lines +38 to +50
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Info: Entity decoding order causes asymmetric double-decode behavior

In decodeHtmlEntities, &amp; is decoded on line 41 before the generic numeric entity handlers on lines 48-49. This means double-encoded numeric entities like &amp;#82; get decoded in two passes: first &amp;& producing &#82;, then &#82;R. However, double-encoded named entities like &amp;nbsp; do NOT get double-decoded because the &nbsp; replacement (line 40) already ran before &amp; was decoded (line 41). This asymmetry is not a practical bug for this use case—it actually helps catch obfuscated banned terms—but it's worth noting the function doesn't faithfully mirror browser entity decoding semantics.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.


function extractVisibleText(html) {
const withoutComments = html.replace(/<!--[\s\S]*?-->/g, " ");
const withoutScriptAndStyle = withoutComments
.replace(/<script\b[^>]*>[\s\S]*?<\/script>/gi, " ")
.replace(/<style\b[^>]*>[\s\S]*?<\/style>/gi, " ");
const withoutTags = withoutScriptAndStyle.replace(/<[^>]+>/g, " ");
const decodedText = decodeHtmlEntities(withoutTags);
return decodedText.replace(/\s+/g, " ").trim();
}

for (const file of requiredFiles) {
if (!exists(file)) {
failures.push(`Missing required file: ${file}`);
Expand Down Expand Up @@ -74,8 +98,9 @@ for (const [file, formName] of Object.entries(formRequirements)) {

for (const file of fs.readdirSync(root).filter((name) => name.endsWith(".html"))) {
const html = read(file);
const visibleText = extractVisibleText(html);
for (const pattern of bannedPatterns) {
if (pattern.test(html)) {
if (pattern.test(visibleText)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 Banned-term scan no longer covers HTML attribute content (meta tags, alt text, etc.)

The refactor from pattern.test(html) to pattern.test(visibleText) at line 103 means banned terms in HTML attributes are no longer checked. The extractVisibleText function strips all tags via /<[^>]+>/g at line 57, which discards attribute values entirely. This means banned legacy terms like SummitLine or Roofing appearing in <meta name="description" content="...">, <meta property="og:title" content="...">, <img alt="...">, or <input placeholder="..."> would go undetected. These are user-visible in search results, social media previews, and screen readers respectively. The commit message ("use visible HTML text") suggests this narrowing is intentional—likely to avoid false positives from CSS class names, data attributes, or JS identifiers—but the trade-off should be explicitly acknowledged since meta descriptions and OG tags are the most common places legacy branding lingers after a rebrand.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

failures.push(`${file} contains banned legacy term: ${pattern}`);
}
}
Expand Down
Loading