Skip to content

fix(seo): fix canonical URLs, hreflang, soft-404s, and noindex for private pages#798

Merged
ImJustChew merged 1 commit intomainfrom
fix/seo-improvements
Apr 27, 2026
Merged

fix(seo): fix canonical URLs, hreflang, soft-404s, and noindex for private pages#798
ImJustChew merged 1 commit intomainfrom
fix/seo-improvements

Conversation

@ImJustChew
Copy link
Copy Markdown
Member

Summary

Addresses multiple issues from Google Search Console Page Indexing report (2026-04-27):

  • 2337 "Alternate page with proper canonical tag" — sitemap listed only /zh/courses/:id as <loc>; en course pages were just hreflang alternates. Now both zh and en are separate <url> entries so Google indexes both language versions independently.
  • 1321 Soft 404s — missing courses returned HTTP 200 with generic shell HTML. Now returns 404 status + X-Robots-Tag: noindex + <meta name="robots" content="noindex"> via handleMissingCourse(). Client-side rendering also adds noindex Helmet.
  • Hreflang bug — the Cloudflare Worker rewrote canonical for bots but left the static index.html hreflang pointing to the root (nthumods.com/zh, nthumods.com/en) for all pages. Added applyHreflang() helper that updates hreflang links to be page-specific in all bot responses (course, bus, department, and generic pages).
  • Hardcoded /zh/ canonical in CourseDetailsContainer — client-side Helmet always emitted canonical /zh/courses/:id even on /en/courses/:id. Fixed to use the lang prop. Also added hreflang alternates to the Helmet.
  • Duplicate without canonical (55 pages) — added handleGenericBotPage() that sets the correct canonical (strips query params) and updates hreflang for all other bot requests to lang-prefixed pages.
  • Noindex missing from private routessettings, student/*, next-steps, waitlist, design-system were blocked by robots.txt but had no noindex handle. Added belt-and-suspenders handle: { noindex: true } so TitleUpdater emits the right meta even if crawled via a backlink.
  • Sitemap additions — bus routes (main/nanda zh+en), calendar, sports-venues, chat, shops, apps, privacy-policy added to both the dynamic worker sitemap and the static fallback.

Files changed

File Change
apps/web/worker.ts New handleMissingCourse, applyHreflang, handleGenericBotPage; fixed sitemap to include en course URLs; hreflang injection for course/bus/dept/generic pages
apps/web/src/components/CourseDetails/CourseDetailsContainer.tsx Fix lang in canonical/OG URL; add hreflang alternates; add noindex Helmet on error state
apps/web/src/router.tsx Add noindex: true to settings, student/*, next-steps, waitlist, design-system
apps/web/public/sitemap.xml Add bus routes, calendar, sports-venues, chat, shops, apps (static fallback only — worker generates dynamic sitemap at runtime)

Test plan

  • Verify /sitemap.xml returns both /zh/courses/:id and /en/courses/:id for each course
  • Verify bot UA (e.g. curl with User-Agent: Googlebot) to a nonexistent course URL returns HTTP 404 with X-Robots-Tag: noindex
  • Verify bot UA to /en/courses/:id returns canonical pointing to /en/courses/:id (not /zh/)
  • Verify bot UA to /zh/timetable returns correct hreflang links (not root URLs)
  • Verify client-side /en/courses/:id has <link rel="canonical" href="...en/courses/..."> in DOM
  • Verify settings page has <meta name="robots" content="noindex, nofollow">

🤖 Generated with Claude Code

…ivate pages

- worker.ts: generate both zh and en course URLs as separate sitemap entries so
  Google indexes both language versions instead of treating en as alternates
- worker.ts: inject correct hreflang links (zh-TW/en/x-default) for course and
  bus pages served to bots, fixing the static root hreflang in index.html
- worker.ts: return 404 status + noindex meta + X-Robots-Tag for missing courses
  to eliminate 1321 soft-404 pages (was returning 200 with generic shell)
- worker.ts: add handleGenericBotPage() that sets correct canonical and hreflang
  for all other lang-prefixed bot requests (strips query params from canonical)
- worker.ts: add bus route URLs (main/nanda, zh+en) to dynamic sitemap
- worker.ts: add calendar, sports-venues, chat, shops, apps to static pages list
- CourseDetailsContainer: fix hardcoded /zh/ canonical to use lang prop so en
  course pages self-canonicalize correctly for client-side rendering
- CourseDetailsContainer: add hreflang alternates in course Helmet so bots that
  render JS see correct language relationships
- CourseDetailsContainer: add noindex Helmet to 404/error state
- router.tsx: add noindex handle to settings, student/*, next-steps, waitlist,
  design-system routes (already blocked by robots.txt, belt-and-suspenders)
- sitemap.xml: add bus routes, calendar, sports-venues, chat, shops, apps pages

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 27, 2026 09:30
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
courseweb Ready Ready Preview Apr 27, 2026 9:30am

Request Review

@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
3.9% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
courseweb-web 45a6c8b Commit Preview URL

Branch Preview URL
Apr 27 2026, 09:31 AM

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses several SEO/indexing issues (canonical URLs, hreflang, soft-404 handling, and noindex for private pages) by updating both the Cloudflare Worker’s bot responses and the client-side Helmet metadata, plus expanding sitemap coverage.

Changes:

  • Update the Cloudflare Worker to emit correct canonical + hreflang for bot requests, return proper 404 + noindex for missing courses, and include both language variants in the dynamic sitemap.
  • Fix course detail client-side canonical/OG URL generation to respect the current language and add hreflang alternates.
  • Add noindex route handles for private/internal pages and expand the static sitemap fallback routes.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
apps/web/worker.ts Adds bot-specific handlers for canonical/hreflang updates, missing-course 404/noindex behavior, and improves dynamic sitemap generation.
apps/web/src/components/CourseDetails/CourseDetailsContainer.tsx Fixes language-specific canonical/OG URL and adds hreflang alternates + noindex in error state.
apps/web/src/router.tsx Marks additional private/internal routes with handle.noindex so shared SEO handling can emit noindex.
apps/web/public/sitemap.xml Updates static fallback sitemap with additional routes and refreshed lastmod dates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 167 to +170
courseCode: `${course.department} ${course.course}-${course.class}`,
educationalLevel: "University",
inLanguage: "zh-TW",
url: `https://nthumods.com/zh/courses/${course.raw_id}`,
url: `https://nthumods.com/${lang}/courses/${course.raw_id}`,
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON-LD Course object still hardcodes inLanguage: "zh-TW" even when rendering an /en/... page, while url is now language-specific. This produces inconsistent structured data for English pages. Consider deriving inLanguage (and possibly the primary name) from the lang prop so the schema matches the page language.

Copilot uses AI. Check for mistakes.
Comment on lines +232 to +233
<meta name="robots" content="noindex, nofollow" />
<meta name="googlebot" content="noindex, nofollow" />
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the missing-course/error state, the component sets noindex, nofollow but doesn’t set a canonical URL. Since the rest of the app (e.g. TitleUpdater) will still emit a canonical pointing at the missing /courses/:id URL, crawlers that hit the client-rendered version can see conflicting signals compared to the Worker (which canonicalizes to /courses). Consider explicitly setting a canonical in this error Helmet (and aligning nofollow vs follow with the Worker directive) to keep behavior consistent.

Suggested change
<meta name="robots" content="noindex, nofollow" />
<meta name="googlebot" content="noindex, nofollow" />
<meta name="robots" content="noindex, follow" />
<meta name="googlebot" content="noindex, follow" />
<link
rel="canonical"
href={`https://nthumods.com/${lang}/courses`}
/>

Copilot uses AI. Check for mistakes.
Comment thread apps/web/worker.ts
Comment on lines +415 to +416
const zhPath = pathname.replace(/^\/(zh|en)\//, "/zh/");
const enPath = pathname.replace(/^\/(zh|en)\//, "/en/");
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleGenericBotPage() computes zhPath/enPath using a regex that only matches /zh/ or /en/ (with a trailing slash). For paths like /zh or /en (which are valid routes), the replacement won’t happen and both alternates can end up pointing to the same URL. Update the replacement logic to also handle the end-of-string case so /zh/en works correctly.

Suggested change
const zhPath = pathname.replace(/^\/(zh|en)\//, "/zh/");
const enPath = pathname.replace(/^\/(zh|en)\//, "/en/");
const zhPath = pathname.replace(/^\/(zh|en)(\/|$)/, "/zh$2");
const enPath = pathname.replace(/^\/(zh|en)(\/|$)/, "/en$2");

Copilot uses AI. Check for mistakes.
Comment thread apps/web/worker.ts
Comment on lines 242 to 246
async function handleDepartmentPage(url: URL, env: Env): Promise<Response> {
const dept = url.searchParams.get("department") ?? "";
const lang = url.pathname.includes("/zh/") ? "zh" : "en";
const fallback = () =>
env.ASSETS.fetch(new Request(`${url.origin}/index.html`));
const fallback = () => handleGenericBotPage(url, env);

Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleGenericBotPage() always drops url.search from the canonical URL (canonicalUrl uses only pathname). But handleDepartmentPage() uses this as its fallback, which means a department-filtered page could canonicalize to /courses (missing ?department=...) when the API errors/returns empty. Consider preserving the department query param in the fallback canonical (and alternates) to avoid collapsing distinct pages into the same canonical on failures.

Copilot uses AI. Check for mistakes.
Comment thread apps/web/worker.ts
Comment on lines +424 to +428
let rewriter = new HTMLRewriter().on('link[rel="canonical"]', {
element(el) {
el.setAttribute("href", canonicalUrl);
},
});
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleGenericBotPage() rewrites only <link rel="canonical">, but leaves og:url (and other social meta) as whatever is in the static index.html (currently https://nthumods.com). Since social crawlers are included in isBot(), this can produce incorrect OG previews/metadata on most pages. Consider also rewriting meta[property="og:url"] (at least) to match the computed canonical URL.

Suggested change
let rewriter = new HTMLRewriter().on('link[rel="canonical"]', {
element(el) {
el.setAttribute("href", canonicalUrl);
},
});
let rewriter = new HTMLRewriter()
.on('link[rel="canonical"]', {
element(el) {
el.setAttribute("href", canonicalUrl);
},
})
.on('meta[property="og:url"]', {
element(el) {
el.setAttribute("content", canonicalUrl);
},
})
.on('meta[name="twitter:url"]', {
element(el) {
el.setAttribute("content", canonicalUrl);
},
})
.on('meta[property="twitter:url"]', {
element(el) {
el.setAttribute("content", canonicalUrl);
},
});

Copilot uses AI. Check for mistakes.
Comment thread apps/web/worker.ts
Comment on lines +141 to 143
if (!apiRes.ok || apiRes.status === 404) {
return handleMissingCourse(lang, env, origin);
}
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleCourseDetailPage() treats any non-OK response (including 5xx/timeouts) as a missing course and serves a 404 + noindex. If the upstream API has a transient outage, this can cause bots to see widespread 404s and potentially deindex valid pages. Consider distinguishing true 404s from other failures (e.g., return a 503/Retry-After or fall back to a 200 shell with a safer robots directive) so temporary API issues don’t look like permanent removals.

Copilot uses AI. Check for mistakes.
@ImJustChew ImJustChew merged commit 4f73e0e into main Apr 27, 2026
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants