Canonry · arberx · Jun 3, 2026 · Jun 3, 2026 · Jun 3, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -67,17 +67,17 @@ test/                # Unit and integration tests
 Any change to user-visible CLI surface must update **all three** of these in the same change:
 
 1. `printHelp()` in `src/cli.ts` — Options list **and** Examples block
-2. `README.md` — the `## CLI Usage` section (flags, examples, and any defaults)
+2. `docs/cli.md` — the relevant mode section, the **Flag reference** table, and the **Exit codes** section (flags, examples, and any defaults). The root `README.md` only carries a handful of headline examples; the full CLI surface lives in `docs/cli.md`.
 3. `skills/aeo/SKILL.md` — Examples and the relevant mode section (e.g. `### Sitemap Mode`)
 
 This applies to:
 
 - Adding, renaming, or removing a flag
-- Changing a default value (e.g. sitemap `--limit` default of 200) — the default must be stated in the help string, the README, and SKILL.md
+- Changing a default value (e.g. sitemap `--limit` default of 200) — the default must be stated in the help string, `docs/cli.md`, and SKILL.md
 - Adding a new CLI mode (e.g. `--sitemap`) — document flags, defaults, exit-code behavior, and an example invocation in all three places
 - Changing exit-code semantics or output format options
 
-Before opening a PR that touches `src/cli.ts`, grep the README and SKILL.md for the affected flag name to confirm everything still matches.
+Before opening a PR that touches `src/cli.ts`, grep `docs/cli.md` and SKILL.md for the affected flag name to confirm everything still matches.
 
 ## Versioning
 

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -3,7 +3,7 @@
 ## Getting Started
 
 ```bash
-git clone https://github.com/AINYC/aeo-audit.git
+git clone https://github.com/Canonry/aeo-audit.git
 cd aeo-audit
 pnpm install
 pnpm run typecheck

diff --git a/README.md b/README.md
diff --git a/docs/api.md b/docs/api.md
@@ -0,0 +1,62 @@
+# Programmatic API
+
+The library exposes three audit entry points. **Use `runSitemapAudit` for site-wide checks.** `runAeoAudit` only fetches the URL you pass it, so per-page issues like duplicate `FAQPage` blocks, JSON parse errors, or missing schema on individual templates are invisible if you call it on the homepage of a multi-page site.
+
+TypeScript declaration files are included automatically.
+
+## Single page
+
+```ts
+import { runAeoAudit } from '@ainyc/aeo-audit'
+
+const report = await runAeoAudit('https://example.com/specific-page', {
+  includeGeo: false,         // Include geographic signals (default: false)
+  includeAgentSkills: false, // Include agent skill exposure (default: false)
+  includeLighthouse: false,  // Include Lighthouse via PageSpeed Insights (default: false; adds ~15-30s)
+  factors: undefined,        // Run all factors (or pass array of factor IDs)
+  allowPrivateHost: undefined, // Permit ONE named host to resolve to a private/loopback IP (e.g. 'localhost').
+                               // Scoped to that exact host; redirects/sitemap entries to other private hosts stay blocked.
+})
+
+console.log(report.overallGrade) // 'A+'
+console.log(report.overallScore) // 98
+console.log(report.factors)      // Array of factor results with scores, findings, recommendations
+```
+
+## Site-wide (sitemap)
+
+```ts
+import { runSitemapAudit } from '@ainyc/aeo-audit'
+
+const report = await runSitemapAudit('https://example.com', {
+  limit: 200,               // Max pages to audit (default 200, sorted by sitemap priority)
+  factors: ['schema-validity', 'structured-data'],  // Optional subset
+})
+
+console.log(report.aggregateGrade)   // 'B+'
+console.log(report.pagesAudited)     // 22
+console.log(report.crossCuttingIssues) // Per-factor rollup with affectedUrls for every recommendation
+console.log(report.prioritizedFixes)   // Top 5 fixes ranked by site-wide impact
+```
+
+Each entry in `crossCuttingIssues[].topIssues` carries a `recommendation` plus the exact `affectedUrls` so you can attribute each problem to specific pages, e.g. "FAQPage duplicate" pointing at every blog post that has it.
+
+## Static output (offline, from disk)
+
+```ts
+import { runStaticAudit } from '@ainyc/aeo-audit'
+
+const result = await runStaticAudit('./out', {
+  baseUrl: 'https://example.com', // maps files to page URLs (default https://localhost)
+  limit: 200,                     // max HTML files when the path is a directory
+})
+
+if (result.kind === 'single') {
+  console.log(result.report.overallGrade)   // single .html file → AuditReport
+} else {
+  console.log(result.report.aggregateGrade) // directory → SitemapAuditReport shape
+  console.log(result.report.crossCuttingIssues)
+}
+```
+
+`runStaticAudit` performs no network I/O. Coverage is partial: server-only signals (redirects, `X-Robots-Tag`, `Last-Modified`, `Link` headers) aren't visible from static files.
diff --git a/docs/cli.md b/docs/cli.md
@@ -0,0 +1,207 @@
+# CLI reference
+
+```bash
+npx @ainyc/aeo-audit <url|path> [options]
+```
+
+Pass a **URL** to audit a live site, or a **filesystem path** (a `.html` file or a directory of built HTML, e.g. `./out`) to audit static output offline.
+
+Exit code is `0` when the score is ≥ 70 and `1` when it's below (CI-friendly). See [Exit codes](#exit-codes) for the full rules.
+
+## Output formats
+
+```bash
+# Colored terminal output (default)
+npx @ainyc/aeo-audit https://example.com
+
+# JSON output (for CI/CD)
+npx @ainyc/aeo-audit https://example.com --format json
+
+# Markdown report
+npx @ainyc/aeo-audit https://example.com --format markdown
+```
+
+## Running a subset of factors
+
+```bash
+# Run specific factors only
+npx @ainyc/aeo-audit https://example.com --factors structured-data,faq-content
+
+# Validate JSON-LD blocks for parse errors and duplicate singleton @types
+# (catches issues like duplicate FAQPage that Google flags as invalid)
+npx @ainyc/aeo-audit https://example.com --factors schema-validity
+```
+
+Factor IDs are listed in [scoring.md](scoring.md).
+
+## Optional factors
+
+```bash
+# Geographic signals (LocalBusiness geo data, address, areaServed)
+npx @ainyc/aeo-audit https://example.com --include-geo
+
+# Agent skill exposure (Schema.org Action, MCP, A2A cards, form affordances)
+npx @ainyc/aeo-audit https://example.com --include-agent-skills
+
+# Lighthouse (Performance + A11y + Best Practices, mobile) via Google PageSpeed
+# Insights. Adds ~15-30s. Single-URL only (cannot combine with --sitemap).
+npx @ainyc/aeo-audit https://example.com --lighthouse
+
+# Provide a PageSpeed Insights API key to lift anonymous rate limits
+PAGESPEED_API_KEY=xxx npx @ainyc/aeo-audit https://example.com --lighthouse --format json
+```
+
+See [scoring.md](scoring.md#optional-factors) for what each optional factor measures.
+
+## CI gating
+
+```bash
+# Force exit 1 when the meta description is missing (on top of the score gate)
+npx @ainyc/aeo-audit https://example.com --require-meta
+npx @ainyc/aeo-audit https://example.com --sitemap --require-meta
+```
+
+## Sitemap mode
+
+Audit every page discovered from the site's sitemap with bounded concurrency (5 in flight):
+
+```bash
+# Auto-discover the sitemap (tries /sitemap.xml, then /sitemap-index.xml,
+# then the Sitemap: directive in /robots.txt)
+npx @ainyc/aeo-audit https://example.com --sitemap
+
+# Provide an explicit sitemap URL
+npx @ainyc/aeo-audit https://example.com --sitemap https://example.com/sitemap.xml
+
+# Cap the number of pages (default 200, sorted by sitemap priority)
+npx @ainyc/aeo-audit https://example.com --sitemap --limit 50
+
+# Skip per-page output and show only cross-cutting issues
+npx @ainyc/aeo-audit https://example.com --sitemap --top-issues
+
+# Rewrite each <loc>'s origin to the target you named (audit staging with prod's sitemap)
+npx @ainyc/aeo-audit https://staging.example.com --sitemap --rewrite-sitemap-origin
+
+# Audit a whole local dev server: rewrite the sitemap onto localhost and unblock it
+npx @ainyc/aeo-audit http://localhost:3000 --sitemap --rewrite-sitemap-origin --allow-local
+```
+
+Auto-discovery checks `/sitemap.xml` → `/sitemap-index.xml` → `Sitemap:` directives in `/robots.txt`. Astro / Next.js / Vercel sites that only publish `sitemap-index.xml` are discovered without needing an explicit URL.
+
+`--rewrite-sitemap-origin` re-homes every `<loc>` onto the origin of the target URL you passed (preserving path and query) before crawling. Use it when a sitemap hardcodes the canonical/prod domain but you want to audit a different origin that serves the same paths: a staging host, or a local dev server. Every crawled URL is pinned to the origin you explicitly named, so there's no SSRF cost; combined with `--allow-local` it makes a local dev server's whole sitemap auditable in one command.
+
+When the sitemap has more URLs than `--limit`, the run audits the highest-priority pages and prints a notice to stderr listing how many were skipped and how to audit them all.
+
+The optional in-process factors are honored per page: pass `--include-geo` and/or `--include-agent-skills` to add them to every audited page. `--lighthouse` is the exception: it cannot be combined with `--sitemap` because each PageSpeed Insights call takes 15-30s.
+
+## Static-output mode
+
+Point the CLI at a filesystem path instead of a URL to audit built HTML directly: no network, ideal for CI on a `next export` / `dist` / `out` directory:
+
+```bash
+# Audit a whole built directory (aggregated like sitemap mode)
+npx @ainyc/aeo-audit ./out
+
+# Map files to real URLs so canonical / og:url checks are meaningful
+npx @ainyc/aeo-audit ./out --base-url https://example.com
+
+# A single built file
+npx @ainyc/aeo-audit ./dist/index.html
+
+# Gate CI on a missing meta description across the build
+npx @ainyc/aeo-audit ./out --require-meta
+```
+
+A `.html`/`.htm` file produces a single-page report; a directory is walked for HTML files and aggregated like sitemap mode (`--limit`, `--top-issues`, `--factors`, `--include-geo`, `--include-agent-skills`, and `--require-meta` all apply). `index.html` maps to its directory URL (`out/about/index.html` → `<base>/about/`); other files drop the extension (`out/blog/post.html` → `<base>/blog/post`). `llms.txt`, `llms-full.txt`, `robots.txt`, and `sitemap.xml` are read from the directory root when present.
+
+Coverage is **partial by design**: server-only signals (redirects, `X-Robots-Tag`, `Last-Modified`, `Link` headers) aren't visible from static files, so factors that depend on them score as if the header were absent. Audit the deployed URL for full coverage.
+
+## Platform detection
+
+Detect what platform, CMS, framework, or static site generator a website is built on. Useful for competitor research, lead qualification, and triage before an audit.
+
+```bash
+# Identify the stack (WordPress, Webflow, Shopify, Next.js, Vercel, etc.)
+npx @ainyc/aeo-audit https://example.com --detect-platform
+
+# JSON for programmatic use
+npx @ainyc/aeo-audit https://example.com --detect-platform --format json
+
+# Only show high-confidence matches
+npx @ainyc/aeo-audit https://example.com --detect-platform --min-confidence high
+```
+
+The detector inspects HTML, response headers, `<meta name="generator">`, script and link sources, and platform-specific globals to fingerprint:
+
+- **CMS:** WordPress, Drupal, Joomla, Ghost, HubSpot, Craft CMS, Sanity, Contentful, Notion
+- **Site builders:** Wix, Squarespace, Webflow, Framer, Carrd, Bubble
+- **E-commerce:** Shopify, WooCommerce, BigCommerce, Magento, PrestaShop
+- **Frameworks:** Next.js, Nuxt, Gatsby, Remix, Astro, SvelteKit, Angular, Vue, React, Ember, Qwik
+- **Static site generators:** Hugo, Jekyll, Eleventy, Hexo, Docusaurus, MkDocs
+- **Hosting / CDN:** Vercel, Netlify, Cloudflare, GitHub Pages, Fastly, AWS CloudFront
+
+Each detected platform is reported with a confidence bucket (`high`, `medium`, `low`), a numeric score, an optional version, and the list of signals that matched. When no CMS, site builder, or e-commerce platform is found, the report flags the site as `custom-built` (framework and hosting fingerprints are still surfaced for context). Exit code is `0` when at least one platform is detected, `1` otherwise.
+
+### Batch detection
+
+Pass `--urls` to fingerprint many sites in a single run. Pages are fetched with bounded concurrency (5 in flight by default; tune with `--concurrency`).
+
+```bash
+# From a file (one URL per line; # comments and blank lines are skipped)
+npx @ainyc/aeo-audit --detect-platform --urls urls.txt
+
+# Inline comma-separated list
+npx @ainyc/aeo-audit --detect-platform --urls https://a.com,https://b.com,https://c.com
+
+# From stdin
+cat urls.txt | npx @ainyc/aeo-audit --detect-platform --urls -
+
+# JSON for downstream processing
+npx @ainyc/aeo-audit --detect-platform --urls urls.txt --format json
+```
+
+Per-URL fetch errors don't abort the batch: each entry is reported with `status: 'success'` or `status: 'error'`. Exit code is `0` when at least one URL succeeded, `1` otherwise.
+
+## Auditing a local or private target
+
+By default the audit refuses any URL that resolves to a private, loopback, or link-local address. That is the right default for a tool that also runs as a hosted service on arbitrary input. To audit your **own** dev or staging server, pass `--allow-local` (alias `--allow-private`):
+
+```bash
+# Audit a local dev server (pass the explicit scheme; bare hosts default to https)
+npx @ainyc/aeo-audit http://localhost:3000 --allow-local
+
+# A staging box on a private IP / VPN
+npx @ainyc/aeo-audit http://10.0.5.20 --allow-private
+```
+
+The relaxation is **scoped to the single host you named on the CLI, and only that host**. It is evaluated per request hop, so a redirect or a sitemap `<loc>` pointing at any *other* private address (cloud metadata at `169.254.169.254`, internal services, …) is still blocked. There is no flag that disables the guard wholesale, and library/service callers that never set it stay fully protected.
+
+## Auxiliary file diagnostics
+
+When fetching `/llms.txt`, `/llms-full.txt`, `/robots.txt`, and `/sitemap.xml` the audit runs a **content-negotiation probe** that surfaces as a finding on the **AI-Readable Content** factor: if a file returns OK to a bare request but a non-2xx response under `Accept: text/markdown`, the audit reports a content-negotiation trap. This catches Astro / Vercel / Starlight setups that redirect `.txt` → non-existent `.md` for markdown-accepting clients, which makes the file invisible to AI content-extraction tools, even though the file is "present" by every other measure.
+
+## Flag reference
+
+| Flag | Description |
+|------|-------------|
+| `--format <type>` | Output format: `text` (default), `json`, `markdown` |
+| `--factors <list>` | Comma-separated factor IDs to run (runs all if omitted) |
+| `--include-geo` | Include the optional geographic signals factor |
+| `--include-agent-skills` | Include the optional agent skill exposure factor |
+| `--lighthouse` | Include the optional Lighthouse factor (Performance + Accessibility + Best Practices, mobile strategy) via Google PageSpeed Insights. Single-URL only; cannot combine with `--sitemap` or `--detect-platform`. Adds ~15-30s. Set `PAGESPEED_API_KEY` env var to lift anonymous rate limits. |
+| `--sitemap [url]` | Audit all pages from the sitemap. Auto-discovery tries `/sitemap.xml`, then `/sitemap-index.xml`, then `Sitemap:` directives in `/robots.txt`. Pass an explicit URL to override. |
+| `--limit <n>` | Max pages to audit in sitemap mode (default 200, sorted by sitemap priority) |
+| `--top-issues` | In sitemap mode, skip per-page output and show only cross-cutting issues |
+| `--detect-platform` | Identify the platform/CMS/framework powering the site instead of running an audit |
+| `--urls <src>` | In `--detect-platform` mode, run on multiple URLs. `<src>` is a file path (one URL per line), a comma-separated list, or `-` for stdin |
+| `--concurrency <n>` | In `--detect-platform` batch mode, max in-flight fetches (default 5) |
+| `--min-confidence <lvl>` | In platform-detect mode, only report matches at or above this level: `low` (default), `medium`, `high` |
+| `--require-meta` | Exit `1` if any audited page is missing `<meta name="description">`, regardless of the overall score. Works in single-URL, sitemap, and static-output modes. |
+| `--allow-local` (alias `--allow-private`) | Allow the single target host you named on the CLI to resolve to a private/loopback IP (e.g. `http://localhost:3000`). Scoped to that one host only; redirects and sitemap `<loc>`s to any other private host stay blocked. |
+| `--rewrite-sitemap-origin` | In `--sitemap` mode, rewrite every `<loc>`'s origin to the target URL's origin (preserving path/query) before crawling. For auditing a staging host or local dev server with a sitemap that hardcodes the prod domain. |
+| `--base-url <url>` | In static-output mode, the base URL used to map files to page URLs (e.g. `out/about/index.html` → `<base>/about/`). Default `https://localhost`. |
+| `-h`, `--help` | Show the help message |
+
+## Exit codes
+
+Exit code `0` for score ≥ 70, `1` for < 70 (CI-friendly). In sitemap and static-directory modes the exit code is based on the aggregate score. In `--detect-platform` mode the exit code is `0` if any platform is detected (or, in batch mode, if any URL succeeded), `1` otherwise. When `--require-meta` is passed, exit is forced to `1` if any audited page lacks `<meta name="description">`, regardless of the score-based rule.