A small HTTP microservice that validates and sanitizes Markdown payloads against an HTML-tag allowlist. It is meant to sit between an untrusted producer (form, CMS, API caller) and any consumer that will render Markdown as HTML, so the consumer can rely on the body being free of script-bearing or otherwise dangerous tags.
The service exposes a validation endpoint backed by sanitize-html, plus a liveness probe. It does not render Markdown to HTML; it inspects raw Markdown for embedded HTML, strips anything outside the allowlist, and tells the caller whether the input was modified.
POST /validate accepts JSON of the form { "markdown": "..." } and returns:
{
"safe": true,
"message": "Markdown is safe",
"sanitized": "...",
"frontMatter": null
}safeistrueonly if sanitization made no changes to the body and no HTML-like content was detected in the front matter. Any disallowed tag, attribute, or URL scheme in either part will flip it tofalse.sanitizedcontains only the Markdown body, post-sanitization. It never contains the front-matter block.frontMatteris the raw YAML between the---markers, ornullif no front matter was present. It is returned untouched — see the front-matter section below.messageis a human-readable summary.
The status code is 200 for any request that conforms to the published JSON Schema, and 400 when the body fails schema validation (missing markdown, wrong type, unexpected fields, etc.). 400 responses include a details array listing each per-field violation.
GET /health returns 200 with { "status": "ok" }. It is intended for liveness probes (Docker HEALTHCHECK, Kubernetes, load balancers) and does not exercise the sanitizer.
GET /openapi.json returns the OpenAPI 3.1 specification of this service, including request/response schemas, the documented headers, and all status codes. Point a Swagger UI / Postman / Stoplight at it to explore or generate clients.
Allowed tags: headings h1-h6, paragraphs and breaks (p, br, hr), lists (ul, ol, li, dl, dt, dd), text emphasis (strong, em, u, s, b, i, mark, sub, sup), code blocks (pre, code, kbd, samp), tables (table, thead, tbody, tr, td, th), blockquotes, images (img), and links (a).
Allowed attributes:
a:href,title,targetimg:src,alt,width,heightcode:class
Allowed URL schemes for hrefs and image sources: http, https, mailto. Anything else (including javascript:, data:, vbscript:) is dropped.
A leading YAML block of the form ---\n...\n---\n is detected and exposed in a separate frontMatter field. The block contents are not run through sanitize-html — they are returned to the caller raw. The reason is that YAML is a data format, not a display format, and trying to sanitize it as HTML produces false positives on legitimate values.
What the service does check: if the front-matter content contains an HTML-like token (< immediately followed by a letter, !, or /), safe is set to false. That covers the realistic threat model — an attacker smuggling <script> or <iframe> past the sanitizer by hiding it in metadata. It does not catch every possible misuse, so:
If you intend to render any front-matter value as HTML, sanitize it on the consumer side. Treat
frontMatteras untrusted input.
npm install
npm start # listens on http://localhost:5001
npm test # runs the Jest suitecurl -s -X POST http://localhost:5001/validate \
-H 'content-type: application/json' \
-d '{"markdown":"# Hello\n\n<script>alert(1)</script>"}'{
"safe": false,
"message": "Markdown contains unsafe content",
"sanitized": "# Hello\n\n",
"frontMatter": null
}docker build -t markdown-security .
docker run --rm -p 5001:5001 markdown-securityThe image is built on node:24-alpine, runs as the unprivileged node user, and ships a HEALTHCHECK that hits /health. The bundled .dockerignore keeps .git, .env, tests and CI artefacts out of the image.
| Env var | Default | Description |
|---|---|---|
PORT |
5001 |
TCP port the HTTP server binds to. |
LOG_LEVEL |
info |
pino log level (trace, debug, info, warn, error, fatal, silent). Forced to silent under NODE_ENV=test. |
ALLOWLIST_FILE |
unset | Path to a JSON file with a custom sanitize-html configuration. When set, replaces the built-in allowlist wholesale. See Customising the allowlist. |
RATE_LIMIT_RPM |
unset | Positive integer. When set, enables per-IP rate limiting on POST /validate at this many requests per minute. Disabled by default. See Rate limiting. |
The JSON body limit is fixed at 256kb. Markdown larger than that is rejected by Express with a 413 before reaching the handler. Adjust express.json({ limit: ... }) in server.js if you need more.
Set ALLOWLIST_FILE to a JSON file whose contents are passed straight to sanitize-html. Useful when different consumers need different policies (e.g. a strict subset for user-generated content, a relaxed superset for trusted authoring tools).
{
"allowedTags": ["p", "em", "strong", "a"],
"allowedAttributes": { "a": ["href"] },
"allowedSchemes": ["https"],
"disallowedTagsMode": "discard"
}The file is loaded once at startup. Malformed JSON, a missing file, or a non-array allowedTags causes the process to exit immediately rather than silently fall back. The default allowlist lives in lib/allowlist.js and is exported as DEFAULT_ALLOWLIST for reference.
Set RATE_LIMIT_RPM to a positive integer to enable per-IP rate limiting on POST /validate. The window is 60 seconds and the limit applies only to /validate — /health and /openapi.json are always reachable so that probes and clients can introspect the service even under load. Exceeding the limit returns 429 Too Many Requests with retry-after and ratelimit-* headers (RFC 9462).
The limiter keys on req.ip. If the service is deployed behind a reverse proxy, configure app.set('trust proxy', ...) in server.js so the limiter sees the real client address rather than the proxy. The service ships with no trust proxy configuration to avoid header-injection in untrusted topologies.
Invalid values (0, negative, non-integer) cause the process to exit at startup rather than silently disable.
Every request is logged as a single JSON line on stdout via pino-http. Each request is tagged with an id surfaced in the x-request-id response header and included in every log line. If the caller sends an x-request-id header that matches ^[a-zA-Z0-9_.-]{1,128}$, the service reuses it; otherwise a fresh UUID is generated. Use this id to correlate a client trace with the server log for a given request.
- Allowlist, not denylist. New tags are blocked by default. To extend the surface, edit the
allowedTags/allowedAttributesarrays inserver.jsand add a regression test. - Front matter is exposed raw, not trusted. It is returned in its own
frontMatterfield, never insidesanitized. A coarse HTML-like check decidessafe, but the consumer must sanitize any front-matter value it intends to render as HTML. query parseris set tosimple. Express's defaultqs-based parser has shipped two array-limit DoS bypasses (GHSA-w7fw-mjwx-w883,GHSA-6rw7-vpxm-498p); the simple parser is not affected. Do not change this without re-reviewing those advisories.- Body size cap.
express.json({ limit: '256kb' })is the first line of defence against payload-amplification attacks againstsanitize-html. - Rate limiting is opt-in via
RATE_LIMIT_RPMand disabled by default. The service still expects to live behind a gateway for auth and TLS; the in-process limiter is defence-in-depth for/validate, not a substitute for an upstream policy layer. - Property-based fuzzing.
tests/fuzzing.test.jsrunsfast-checkagainst/validateto exercise invariants (no dangerous tags ever leak tosanitized, sanitization is idempotent, front matter never appears insidesanitized). Hundreds of randomized payloads per release. - Schema-validated boundary.
POST /validaterejects any body that does not conform to the OpenAPIValidateRequestschema (ajv). Extra fields, wrong types, and missing/empty values are caught before reaching the sanitizer, with structureddetailsper violation. - SBOM attached to every release. A CycloneDX 1.6 JSON Software Bill of Materials (
sbom.cdx.json) is generated from the production lockfile and uploaded as a release asset by.github/workflows/sbom.yml. Runnpm run sbomto produce one locally.
npm audit reports zero vulnerabilities at the time of writing (May 2026, against express@5, sanitize-html@2.17, pino@10, pino-http@11, ajv@8, express-rate-limit@8, jest@30, supertest@7.2, fast-check@4).
server.js Express app + /validate, /health and /openapi.json handlers.
lib/allowlist.js Default sanitize-html allowlist and ALLOWLIST_FILE loader.
openapi.json OpenAPI 3.1 contract served by /openapi.json.
tests/validation.test.js Jest + Supertest suite covering happy path and rejection cases.
tests/fuzzing.test.js Property-based tests (fast-check) for sanitizer invariants.
tests/request-id.test.js Coverage for the x-request-id middleware.
tests/openapi.test.js Coverage for the OpenAPI endpoint and contract.
tests/allowlist.test.js Unit + integration coverage for the allowlist loader.
tests/rate-limit.test.js Coverage for the RATE_LIMIT_RPM middleware on /validate.
Dockerfile, .dockerignore Container build.
See CHANGELOG.md for the version history.
MIT - see LICENSE.