fix(seo): fix canonical URLs, hreflang, soft-404s, and noindex for private pages#798
fix(seo): fix canonical URLs, hreflang, soft-404s, and noindex for private pages#798ImJustChew merged 1 commit intomainfrom
Conversation
…ivate pages - worker.ts: generate both zh and en course URLs as separate sitemap entries so Google indexes both language versions instead of treating en as alternates - worker.ts: inject correct hreflang links (zh-TW/en/x-default) for course and bus pages served to bots, fixing the static root hreflang in index.html - worker.ts: return 404 status + noindex meta + X-Robots-Tag for missing courses to eliminate 1321 soft-404 pages (was returning 200 with generic shell) - worker.ts: add handleGenericBotPage() that sets correct canonical and hreflang for all other lang-prefixed bot requests (strips query params from canonical) - worker.ts: add bus route URLs (main/nanda, zh+en) to dynamic sitemap - worker.ts: add calendar, sports-venues, chat, shops, apps to static pages list - CourseDetailsContainer: fix hardcoded /zh/ canonical to use lang prop so en course pages self-canonicalize correctly for client-side rendering - CourseDetailsContainer: add hreflang alternates in course Helmet so bots that render JS see correct language relationships - CourseDetailsContainer: add noindex Helmet to 404/error state - router.tsx: add noindex handle to settings, student/*, next-steps, waitlist, design-system routes (already blocked by robots.txt, belt-and-suspenders) - sitemap.xml: add bus routes, calendar, sports-venues, chat, shops, apps pages Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
courseweb-web | 45a6c8b | Commit Preview URL Branch Preview URL |
Apr 27 2026, 09:31 AM |
There was a problem hiding this comment.
Pull request overview
This PR addresses several SEO/indexing issues (canonical URLs, hreflang, soft-404 handling, and noindex for private pages) by updating both the Cloudflare Worker’s bot responses and the client-side Helmet metadata, plus expanding sitemap coverage.
Changes:
- Update the Cloudflare Worker to emit correct canonical + hreflang for bot requests, return proper 404 + noindex for missing courses, and include both language variants in the dynamic sitemap.
- Fix course detail client-side canonical/OG URL generation to respect the current language and add hreflang alternates.
- Add
noindexroute handles for private/internal pages and expand the static sitemap fallback routes.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| apps/web/worker.ts | Adds bot-specific handlers for canonical/hreflang updates, missing-course 404/noindex behavior, and improves dynamic sitemap generation. |
| apps/web/src/components/CourseDetails/CourseDetailsContainer.tsx | Fixes language-specific canonical/OG URL and adds hreflang alternates + noindex in error state. |
| apps/web/src/router.tsx | Marks additional private/internal routes with handle.noindex so shared SEO handling can emit noindex. |
| apps/web/public/sitemap.xml | Updates static fallback sitemap with additional routes and refreshed lastmod dates. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| courseCode: `${course.department} ${course.course}-${course.class}`, | ||
| educationalLevel: "University", | ||
| inLanguage: "zh-TW", | ||
| url: `https://nthumods.com/zh/courses/${course.raw_id}`, | ||
| url: `https://nthumods.com/${lang}/courses/${course.raw_id}`, |
There was a problem hiding this comment.
The JSON-LD Course object still hardcodes inLanguage: "zh-TW" even when rendering an /en/... page, while url is now language-specific. This produces inconsistent structured data for English pages. Consider deriving inLanguage (and possibly the primary name) from the lang prop so the schema matches the page language.
| <meta name="robots" content="noindex, nofollow" /> | ||
| <meta name="googlebot" content="noindex, nofollow" /> |
There was a problem hiding this comment.
In the missing-course/error state, the component sets noindex, nofollow but doesn’t set a canonical URL. Since the rest of the app (e.g. TitleUpdater) will still emit a canonical pointing at the missing /courses/:id URL, crawlers that hit the client-rendered version can see conflicting signals compared to the Worker (which canonicalizes to /courses). Consider explicitly setting a canonical in this error Helmet (and aligning nofollow vs follow with the Worker directive) to keep behavior consistent.
| <meta name="robots" content="noindex, nofollow" /> | |
| <meta name="googlebot" content="noindex, nofollow" /> | |
| <meta name="robots" content="noindex, follow" /> | |
| <meta name="googlebot" content="noindex, follow" /> | |
| <link | |
| rel="canonical" | |
| href={`https://nthumods.com/${lang}/courses`} | |
| /> |
| const zhPath = pathname.replace(/^\/(zh|en)\//, "/zh/"); | ||
| const enPath = pathname.replace(/^\/(zh|en)\//, "/en/"); |
There was a problem hiding this comment.
handleGenericBotPage() computes zhPath/enPath using a regex that only matches /zh/ or /en/ (with a trailing slash). For paths like /zh or /en (which are valid routes), the replacement won’t happen and both alternates can end up pointing to the same URL. Update the replacement logic to also handle the end-of-string case so /zh ↔ /en works correctly.
| const zhPath = pathname.replace(/^\/(zh|en)\//, "/zh/"); | |
| const enPath = pathname.replace(/^\/(zh|en)\//, "/en/"); | |
| const zhPath = pathname.replace(/^\/(zh|en)(\/|$)/, "/zh$2"); | |
| const enPath = pathname.replace(/^\/(zh|en)(\/|$)/, "/en$2"); |
| async function handleDepartmentPage(url: URL, env: Env): Promise<Response> { | ||
| const dept = url.searchParams.get("department") ?? ""; | ||
| const lang = url.pathname.includes("/zh/") ? "zh" : "en"; | ||
| const fallback = () => | ||
| env.ASSETS.fetch(new Request(`${url.origin}/index.html`)); | ||
| const fallback = () => handleGenericBotPage(url, env); | ||
|
|
There was a problem hiding this comment.
handleGenericBotPage() always drops url.search from the canonical URL (canonicalUrl uses only pathname). But handleDepartmentPage() uses this as its fallback, which means a department-filtered page could canonicalize to /courses (missing ?department=...) when the API errors/returns empty. Consider preserving the department query param in the fallback canonical (and alternates) to avoid collapsing distinct pages into the same canonical on failures.
| let rewriter = new HTMLRewriter().on('link[rel="canonical"]', { | ||
| element(el) { | ||
| el.setAttribute("href", canonicalUrl); | ||
| }, | ||
| }); |
There was a problem hiding this comment.
handleGenericBotPage() rewrites only <link rel="canonical">, but leaves og:url (and other social meta) as whatever is in the static index.html (currently https://nthumods.com). Since social crawlers are included in isBot(), this can produce incorrect OG previews/metadata on most pages. Consider also rewriting meta[property="og:url"] (at least) to match the computed canonical URL.
| let rewriter = new HTMLRewriter().on('link[rel="canonical"]', { | |
| element(el) { | |
| el.setAttribute("href", canonicalUrl); | |
| }, | |
| }); | |
| let rewriter = new HTMLRewriter() | |
| .on('link[rel="canonical"]', { | |
| element(el) { | |
| el.setAttribute("href", canonicalUrl); | |
| }, | |
| }) | |
| .on('meta[property="og:url"]', { | |
| element(el) { | |
| el.setAttribute("content", canonicalUrl); | |
| }, | |
| }) | |
| .on('meta[name="twitter:url"]', { | |
| element(el) { | |
| el.setAttribute("content", canonicalUrl); | |
| }, | |
| }) | |
| .on('meta[property="twitter:url"]', { | |
| element(el) { | |
| el.setAttribute("content", canonicalUrl); | |
| }, | |
| }); |
| if (!apiRes.ok || apiRes.status === 404) { | ||
| return handleMissingCourse(lang, env, origin); | ||
| } |
There was a problem hiding this comment.
handleCourseDetailPage() treats any non-OK response (including 5xx/timeouts) as a missing course and serves a 404 + noindex. If the upstream API has a transient outage, this can cause bots to see widespread 404s and potentially deindex valid pages. Consider distinguishing true 404s from other failures (e.g., return a 503/Retry-After or fall back to a 200 shell with a safer robots directive) so temporary API issues don’t look like permanent removals.


Summary
Addresses multiple issues from Google Search Console Page Indexing report (2026-04-27):
/zh/courses/:idas<loc>; en course pages were just hreflang alternates. Now both zh and en are separate<url>entries so Google indexes both language versions independently.X-Robots-Tag: noindex+<meta name="robots" content="noindex">viahandleMissingCourse(). Client-side rendering also adds noindex Helmet.index.htmlhreflang pointing to the root (nthumods.com/zh,nthumods.com/en) for all pages. AddedapplyHreflang()helper that updates hreflang links to be page-specific in all bot responses (course, bus, department, and generic pages)./zh/canonical in CourseDetailsContainer — client-side Helmet always emitted canonical/zh/courses/:ideven on/en/courses/:id. Fixed to use thelangprop. Also added hreflang alternates to the Helmet.handleGenericBotPage()that sets the correct canonical (strips query params) and updates hreflang for all other bot requests to lang-prefixed pages.settings,student/*,next-steps,waitlist,design-systemwere blocked byrobots.txtbut had nonoindexhandle. Added belt-and-suspendershandle: { noindex: true }soTitleUpdateremits the right meta even if crawled via a backlink.Files changed
apps/web/worker.tshandleMissingCourse,applyHreflang,handleGenericBotPage; fixed sitemap to include en course URLs; hreflang injection for course/bus/dept/generic pagesapps/web/src/components/CourseDetails/CourseDetailsContainer.tsxlangin canonical/OG URL; add hreflang alternates; add noindex Helmet on error stateapps/web/src/router.tsxnoindex: trueto settings, student/*, next-steps, waitlist, design-systemapps/web/public/sitemap.xmlTest plan
/sitemap.xmlreturns both/zh/courses/:idand/en/courses/:idfor each courseUser-Agent: Googlebot) to a nonexistent course URL returns HTTP 404 withX-Robots-Tag: noindex/en/courses/:idreturns canonical pointing to/en/courses/:id(not/zh/)/zh/timetablereturns correct hreflang links (not root URLs)/en/courses/:idhas<link rel="canonical" href="...en/courses/...">in DOMsettingspage has<meta name="robots" content="noindex, nofollow">🤖 Generated with Claude Code