Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,17 +67,17 @@ test/ # Unit and integration tests
Any change to user-visible CLI surface must update **all three** of these in the same change:

1. `printHelp()` in `src/cli.ts` — Options list **and** Examples block
2. `README.md` — the `## CLI Usage` section (flags, examples, and any defaults)
2. `docs/cli.md` — the relevant mode section, the **Flag reference** table, and the **Exit codes** section (flags, examples, and any defaults). The root `README.md` only carries a handful of headline examples; the full CLI surface lives in `docs/cli.md`.
3. `skills/aeo/SKILL.md` — Examples and the relevant mode section (e.g. `### Sitemap Mode`)

This applies to:

- Adding, renaming, or removing a flag
- Changing a default value (e.g. sitemap `--limit` default of 200) — the default must be stated in the help string, the README, and SKILL.md
- Changing a default value (e.g. sitemap `--limit` default of 200) — the default must be stated in the help string, `docs/cli.md`, and SKILL.md
- Adding a new CLI mode (e.g. `--sitemap`) — document flags, defaults, exit-code behavior, and an example invocation in all three places
- Changing exit-code semantics or output format options

Before opening a PR that touches `src/cli.ts`, grep the README and SKILL.md for the affected flag name to confirm everything still matches.
Before opening a PR that touches `src/cli.ts`, grep `docs/cli.md` and SKILL.md for the affected flag name to confirm everything still matches.

## Versioning

Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
## Getting Started

```bash
git clone https://github.com/AINYC/aeo-audit.git
git clone https://github.com/Canonry/aeo-audit.git
cd aeo-audit
pnpm install
pnpm run typecheck
Expand Down
389 changes: 37 additions & 352 deletions README.md

Large diffs are not rendered by default.

62 changes: 62 additions & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Programmatic API

The library exposes three audit entry points. **Use `runSitemapAudit` for site-wide checks.** `runAeoAudit` only fetches the URL you pass it, so per-page issues like duplicate `FAQPage` blocks, JSON parse errors, or missing schema on individual templates are invisible if you call it on the homepage of a multi-page site.

TypeScript declaration files are included automatically.

## Single page

```ts
import { runAeoAudit } from '@ainyc/aeo-audit'

const report = await runAeoAudit('https://example.com/specific-page', {
includeGeo: false, // Include geographic signals (default: false)
includeAgentSkills: false, // Include agent skill exposure (default: false)
includeLighthouse: false, // Include Lighthouse via PageSpeed Insights (default: false; adds ~15-30s)
factors: undefined, // Run all factors (or pass array of factor IDs)
allowPrivateHost: undefined, // Permit ONE named host to resolve to a private/loopback IP (e.g. 'localhost').
// Scoped to that exact host; redirects/sitemap entries to other private hosts stay blocked.
})

console.log(report.overallGrade) // 'A+'
console.log(report.overallScore) // 98
console.log(report.factors) // Array of factor results with scores, findings, recommendations
```

## Site-wide (sitemap)

```ts
import { runSitemapAudit } from '@ainyc/aeo-audit'

const report = await runSitemapAudit('https://example.com', {
limit: 200, // Max pages to audit (default 200, sorted by sitemap priority)
factors: ['schema-validity', 'structured-data'], // Optional subset
})

console.log(report.aggregateGrade) // 'B+'
console.log(report.pagesAudited) // 22
console.log(report.crossCuttingIssues) // Per-factor rollup with affectedUrls for every recommendation
console.log(report.prioritizedFixes) // Top 5 fixes ranked by site-wide impact
```

Each entry in `crossCuttingIssues[].topIssues` carries a `recommendation` plus the exact `affectedUrls` so you can attribute each problem to specific pages, e.g. "FAQPage duplicate" pointing at every blog post that has it.

## Static output (offline, from disk)

```ts
import { runStaticAudit } from '@ainyc/aeo-audit'

const result = await runStaticAudit('./out', {
baseUrl: 'https://example.com', // maps files to page URLs (default https://localhost)
limit: 200, // max HTML files when the path is a directory
})

if (result.kind === 'single') {
console.log(result.report.overallGrade) // single .html file → AuditReport
} else {
console.log(result.report.aggregateGrade) // directory → SitemapAuditReport shape
console.log(result.report.crossCuttingIssues)
}
```

`runStaticAudit` performs no network I/O. Coverage is partial: server-only signals (redirects, `X-Robots-Tag`, `Last-Modified`, `Link` headers) aren't visible from static files.
207 changes: 207 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
# CLI reference

```bash
npx @ainyc/aeo-audit <url|path> [options]
```

Pass a **URL** to audit a live site, or a **filesystem path** (a `.html` file or a directory of built HTML, e.g. `./out`) to audit static output offline.

Exit code is `0` when the score is ≥ 70 and `1` when it's below (CI-friendly). See [Exit codes](#exit-codes) for the full rules.

## Output formats

```bash
# Colored terminal output (default)
npx @ainyc/aeo-audit https://example.com

# JSON output (for CI/CD)
npx @ainyc/aeo-audit https://example.com --format json

# Markdown report
npx @ainyc/aeo-audit https://example.com --format markdown
```

## Running a subset of factors

```bash
# Run specific factors only
npx @ainyc/aeo-audit https://example.com --factors structured-data,faq-content

# Validate JSON-LD blocks for parse errors and duplicate singleton @types
# (catches issues like duplicate FAQPage that Google flags as invalid)
npx @ainyc/aeo-audit https://example.com --factors schema-validity
```

Factor IDs are listed in [scoring.md](scoring.md).

## Optional factors

```bash
# Geographic signals (LocalBusiness geo data, address, areaServed)
npx @ainyc/aeo-audit https://example.com --include-geo

# Agent skill exposure (Schema.org Action, MCP, A2A cards, form affordances)
npx @ainyc/aeo-audit https://example.com --include-agent-skills

# Lighthouse (Performance + A11y + Best Practices, mobile) via Google PageSpeed
# Insights. Adds ~15-30s. Single-URL only (cannot combine with --sitemap).
npx @ainyc/aeo-audit https://example.com --lighthouse

# Provide a PageSpeed Insights API key to lift anonymous rate limits
PAGESPEED_API_KEY=xxx npx @ainyc/aeo-audit https://example.com --lighthouse --format json
```

See [scoring.md](scoring.md#optional-factors) for what each optional factor measures.

## CI gating

```bash
# Force exit 1 when the meta description is missing (on top of the score gate)
npx @ainyc/aeo-audit https://example.com --require-meta
npx @ainyc/aeo-audit https://example.com --sitemap --require-meta
```

## Sitemap mode

Audit every page discovered from the site's sitemap with bounded concurrency (5 in flight):

```bash
# Auto-discover the sitemap (tries /sitemap.xml, then /sitemap-index.xml,
# then the Sitemap: directive in /robots.txt)
npx @ainyc/aeo-audit https://example.com --sitemap

# Provide an explicit sitemap URL
npx @ainyc/aeo-audit https://example.com --sitemap https://example.com/sitemap.xml

# Cap the number of pages (default 200, sorted by sitemap priority)
npx @ainyc/aeo-audit https://example.com --sitemap --limit 50

# Skip per-page output and show only cross-cutting issues
npx @ainyc/aeo-audit https://example.com --sitemap --top-issues

# Rewrite each <loc>'s origin to the target you named (audit staging with prod's sitemap)
npx @ainyc/aeo-audit https://staging.example.com --sitemap --rewrite-sitemap-origin

# Audit a whole local dev server: rewrite the sitemap onto localhost and unblock it
npx @ainyc/aeo-audit http://localhost:3000 --sitemap --rewrite-sitemap-origin --allow-local
```

Auto-discovery checks `/sitemap.xml` → `/sitemap-index.xml` → `Sitemap:` directives in `/robots.txt`. Astro / Next.js / Vercel sites that only publish `sitemap-index.xml` are discovered without needing an explicit URL.

`--rewrite-sitemap-origin` re-homes every `<loc>` onto the origin of the target URL you passed (preserving path and query) before crawling. Use it when a sitemap hardcodes the canonical/prod domain but you want to audit a different origin that serves the same paths: a staging host, or a local dev server. Every crawled URL is pinned to the origin you explicitly named, so there's no SSRF cost; combined with `--allow-local` it makes a local dev server's whole sitemap auditable in one command.

When the sitemap has more URLs than `--limit`, the run audits the highest-priority pages and prints a notice to stderr listing how many were skipped and how to audit them all.

The optional in-process factors are honored per page: pass `--include-geo` and/or `--include-agent-skills` to add them to every audited page. `--lighthouse` is the exception: it cannot be combined with `--sitemap` because each PageSpeed Insights call takes 15-30s.

## Static-output mode

Point the CLI at a filesystem path instead of a URL to audit built HTML directly: no network, ideal for CI on a `next export` / `dist` / `out` directory:

```bash
# Audit a whole built directory (aggregated like sitemap mode)
npx @ainyc/aeo-audit ./out

# Map files to real URLs so canonical / og:url checks are meaningful
npx @ainyc/aeo-audit ./out --base-url https://example.com

# A single built file
npx @ainyc/aeo-audit ./dist/index.html

# Gate CI on a missing meta description across the build
npx @ainyc/aeo-audit ./out --require-meta
```

A `.html`/`.htm` file produces a single-page report; a directory is walked for HTML files and aggregated like sitemap mode (`--limit`, `--top-issues`, `--factors`, `--include-geo`, `--include-agent-skills`, and `--require-meta` all apply). `index.html` maps to its directory URL (`out/about/index.html` → `<base>/about/`); other files drop the extension (`out/blog/post.html` → `<base>/blog/post`). `llms.txt`, `llms-full.txt`, `robots.txt`, and `sitemap.xml` are read from the directory root when present.

Coverage is **partial by design**: server-only signals (redirects, `X-Robots-Tag`, `Last-Modified`, `Link` headers) aren't visible from static files, so factors that depend on them score as if the header were absent. Audit the deployed URL for full coverage.

## Platform detection

Detect what platform, CMS, framework, or static site generator a website is built on. Useful for competitor research, lead qualification, and triage before an audit.

```bash
# Identify the stack (WordPress, Webflow, Shopify, Next.js, Vercel, etc.)
npx @ainyc/aeo-audit https://example.com --detect-platform

# JSON for programmatic use
npx @ainyc/aeo-audit https://example.com --detect-platform --format json

# Only show high-confidence matches
npx @ainyc/aeo-audit https://example.com --detect-platform --min-confidence high
```

The detector inspects HTML, response headers, `<meta name="generator">`, script and link sources, and platform-specific globals to fingerprint:

- **CMS:** WordPress, Drupal, Joomla, Ghost, HubSpot, Craft CMS, Sanity, Contentful, Notion
- **Site builders:** Wix, Squarespace, Webflow, Framer, Carrd, Bubble
- **E-commerce:** Shopify, WooCommerce, BigCommerce, Magento, PrestaShop
- **Frameworks:** Next.js, Nuxt, Gatsby, Remix, Astro, SvelteKit, Angular, Vue, React, Ember, Qwik
- **Static site generators:** Hugo, Jekyll, Eleventy, Hexo, Docusaurus, MkDocs
- **Hosting / CDN:** Vercel, Netlify, Cloudflare, GitHub Pages, Fastly, AWS CloudFront

Each detected platform is reported with a confidence bucket (`high`, `medium`, `low`), a numeric score, an optional version, and the list of signals that matched. When no CMS, site builder, or e-commerce platform is found, the report flags the site as `custom-built` (framework and hosting fingerprints are still surfaced for context). Exit code is `0` when at least one platform is detected, `1` otherwise.

### Batch detection

Pass `--urls` to fingerprint many sites in a single run. Pages are fetched with bounded concurrency (5 in flight by default; tune with `--concurrency`).

```bash
# From a file (one URL per line; # comments and blank lines are skipped)
npx @ainyc/aeo-audit --detect-platform --urls urls.txt

# Inline comma-separated list
npx @ainyc/aeo-audit --detect-platform --urls https://a.com,https://b.com,https://c.com

# From stdin
cat urls.txt | npx @ainyc/aeo-audit --detect-platform --urls -

# JSON for downstream processing
npx @ainyc/aeo-audit --detect-platform --urls urls.txt --format json
```

Per-URL fetch errors don't abort the batch: each entry is reported with `status: 'success'` or `status: 'error'`. Exit code is `0` when at least one URL succeeded, `1` otherwise.

## Auditing a local or private target

By default the audit refuses any URL that resolves to a private, loopback, or link-local address. That is the right default for a tool that also runs as a hosted service on arbitrary input. To audit your **own** dev or staging server, pass `--allow-local` (alias `--allow-private`):

```bash
# Audit a local dev server (pass the explicit scheme; bare hosts default to https)
npx @ainyc/aeo-audit http://localhost:3000 --allow-local

# A staging box on a private IP / VPN
npx @ainyc/aeo-audit http://10.0.5.20 --allow-private
```

The relaxation is **scoped to the single host you named on the CLI, and only that host**. It is evaluated per request hop, so a redirect or a sitemap `<loc>` pointing at any *other* private address (cloud metadata at `169.254.169.254`, internal services, …) is still blocked. There is no flag that disables the guard wholesale, and library/service callers that never set it stay fully protected.

## Auxiliary file diagnostics

When fetching `/llms.txt`, `/llms-full.txt`, `/robots.txt`, and `/sitemap.xml` the audit runs a **content-negotiation probe** that surfaces as a finding on the **AI-Readable Content** factor: if a file returns OK to a bare request but a non-2xx response under `Accept: text/markdown`, the audit reports a content-negotiation trap. This catches Astro / Vercel / Starlight setups that redirect `.txt` → non-existent `.md` for markdown-accepting clients, which makes the file invisible to AI content-extraction tools, even though the file is "present" by every other measure.

## Flag reference

| Flag | Description |
|------|-------------|
| `--format <type>` | Output format: `text` (default), `json`, `markdown` |
| `--factors <list>` | Comma-separated factor IDs to run (runs all if omitted) |
| `--include-geo` | Include the optional geographic signals factor |
| `--include-agent-skills` | Include the optional agent skill exposure factor |
| `--lighthouse` | Include the optional Lighthouse factor (Performance + Accessibility + Best Practices, mobile strategy) via Google PageSpeed Insights. Single-URL only; cannot combine with `--sitemap` or `--detect-platform`. Adds ~15-30s. Set `PAGESPEED_API_KEY` env var to lift anonymous rate limits. |
| `--sitemap [url]` | Audit all pages from the sitemap. Auto-discovery tries `/sitemap.xml`, then `/sitemap-index.xml`, then `Sitemap:` directives in `/robots.txt`. Pass an explicit URL to override. |
| `--limit <n>` | Max pages to audit in sitemap mode (default 200, sorted by sitemap priority) |
| `--top-issues` | In sitemap mode, skip per-page output and show only cross-cutting issues |
| `--detect-platform` | Identify the platform/CMS/framework powering the site instead of running an audit |
| `--urls <src>` | In `--detect-platform` mode, run on multiple URLs. `<src>` is a file path (one URL per line), a comma-separated list, or `-` for stdin |
| `--concurrency <n>` | In `--detect-platform` batch mode, max in-flight fetches (default 5) |
| `--min-confidence <lvl>` | In platform-detect mode, only report matches at or above this level: `low` (default), `medium`, `high` |
| `--require-meta` | Exit `1` if any audited page is missing `<meta name="description">`, regardless of the overall score. Works in single-URL, sitemap, and static-output modes. |
| `--allow-local` (alias `--allow-private`) | Allow the single target host you named on the CLI to resolve to a private/loopback IP (e.g. `http://localhost:3000`). Scoped to that one host only; redirects and sitemap `<loc>`s to any other private host stay blocked. |
| `--rewrite-sitemap-origin` | In `--sitemap` mode, rewrite every `<loc>`'s origin to the target URL's origin (preserving path/query) before crawling. For auditing a staging host or local dev server with a sitemap that hardcodes the prod domain. |
| `--base-url <url>` | In static-output mode, the base URL used to map files to page URLs (e.g. `out/about/index.html` → `<base>/about/`). Default `https://localhost`. |
| `-h`, `--help` | Show the help message |

## Exit codes

Exit code `0` for score ≥ 70, `1` for < 70 (CI-friendly). In sitemap and static-directory modes the exit code is based on the aggregate score. In `--detect-platform` mode the exit code is `0` if any platform is detected (or, in batch mode, if any URL succeeded), `1` otherwise. When `--require-meta` is passed, exit is forced to `1` if any audited page lacks `<meta name="description">`, regardless of the score-based rule.
Loading
Loading