Skip to content

SINENSIA/markdown-security

Repository files navigation

markdown-security

A small HTTP microservice that validates and sanitizes Markdown payloads against an HTML-tag allowlist. It is meant to sit between an untrusted producer (form, CMS, API caller) and any consumer that will render Markdown as HTML, so the consumer can rely on the body being free of script-bearing or otherwise dangerous tags.

The service exposes a validation endpoint backed by sanitize-html, plus a liveness probe. It does not render Markdown to HTML; it inspects raw Markdown for embedded HTML, strips anything outside the allowlist, and tells the caller whether the input was modified.

How it works

POST /validate accepts JSON of the form { "markdown": "..." } and returns:

{
  "safe": true,
  "message": "Markdown is safe",
  "sanitized": "...",
  "frontMatter": null
}
  • safe is true only if sanitization made no changes to the body and no HTML-like content was detected in the front matter. Any disallowed tag, attribute, or URL scheme in either part will flip it to false.
  • sanitized contains only the Markdown body, post-sanitization. It never contains the front-matter block.
  • frontMatter is the raw YAML between the --- markers, or null if no front matter was present. It is returned untouched — see the front-matter section below.
  • message is a human-readable summary.

The status code is 200 for any request that conforms to the published JSON Schema, and 400 when the body fails schema validation (missing markdown, wrong type, unexpected fields, etc.). 400 responses include a details array listing each per-field violation.

GET /health returns 200 with { "status": "ok" }. It is intended for liveness probes (Docker HEALTHCHECK, Kubernetes, load balancers) and does not exercise the sanitizer.

GET /openapi.json returns the OpenAPI 3.1 specification of this service, including request/response schemas, the documented headers, and all status codes. Point a Swagger UI / Postman / Stoplight at it to explore or generate clients.

Allowlist

Allowed tags: headings h1-h6, paragraphs and breaks (p, br, hr), lists (ul, ol, li, dl, dt, dd), text emphasis (strong, em, u, s, b, i, mark, sub, sup), code blocks (pre, code, kbd, samp), tables (table, thead, tbody, tr, td, th), blockquotes, images (img), and links (a).

Allowed attributes:

  • a: href, title, target
  • img: src, alt, width, height
  • code: class

Allowed URL schemes for hrefs and image sources: http, https, mailto. Anything else (including javascript:, data:, vbscript:) is dropped.

YAML front matter

A leading YAML block of the form ---\n...\n---\n is detected and exposed in a separate frontMatter field. The block contents are not run through sanitize-html — they are returned to the caller raw. The reason is that YAML is a data format, not a display format, and trying to sanitize it as HTML produces false positives on legitimate values.

What the service does check: if the front-matter content contains an HTML-like token (< immediately followed by a letter, !, or /), safe is set to false. That covers the realistic threat model — an attacker smuggling <script> or <iframe> past the sanitizer by hiding it in metadata. It does not catch every possible misuse, so:

If you intend to render any front-matter value as HTML, sanitize it on the consumer side. Treat frontMatter as untrusted input.

Quickstart

npm install
npm start              # listens on http://localhost:5001
npm test               # runs the Jest suite
curl -s -X POST http://localhost:5001/validate \
  -H 'content-type: application/json' \
  -d '{"markdown":"# Hello\n\n<script>alert(1)</script>"}'
{
  "safe": false,
  "message": "Markdown contains unsafe content",
  "sanitized": "# Hello\n\n",
  "frontMatter": null
}

Docker

docker build -t markdown-security .
docker run --rm -p 5001:5001 markdown-security

The image is built on node:24-alpine, runs as the unprivileged node user, and ships a HEALTHCHECK that hits /health. The bundled .dockerignore keeps .git, .env, tests and CI artefacts out of the image.

Configuration

Env var Default Description
PORT 5001 TCP port the HTTP server binds to.
LOG_LEVEL info pino log level (trace, debug, info, warn, error, fatal, silent). Forced to silent under NODE_ENV=test.
ALLOWLIST_FILE unset Path to a JSON file with a custom sanitize-html configuration. When set, replaces the built-in allowlist wholesale. See Customising the allowlist.
RATE_LIMIT_RPM unset Positive integer. When set, enables per-IP rate limiting on POST /validate at this many requests per minute. Disabled by default. See Rate limiting.

The JSON body limit is fixed at 256kb. Markdown larger than that is rejected by Express with a 413 before reaching the handler. Adjust express.json({ limit: ... }) in server.js if you need more.

Customising the allowlist

Set ALLOWLIST_FILE to a JSON file whose contents are passed straight to sanitize-html. Useful when different consumers need different policies (e.g. a strict subset for user-generated content, a relaxed superset for trusted authoring tools).

{
  "allowedTags": ["p", "em", "strong", "a"],
  "allowedAttributes": { "a": ["href"] },
  "allowedSchemes": ["https"],
  "disallowedTagsMode": "discard"
}

The file is loaded once at startup. Malformed JSON, a missing file, or a non-array allowedTags causes the process to exit immediately rather than silently fall back. The default allowlist lives in lib/allowlist.js and is exported as DEFAULT_ALLOWLIST for reference.

Rate limiting

Set RATE_LIMIT_RPM to a positive integer to enable per-IP rate limiting on POST /validate. The window is 60 seconds and the limit applies only to /validate/health and /openapi.json are always reachable so that probes and clients can introspect the service even under load. Exceeding the limit returns 429 Too Many Requests with retry-after and ratelimit-* headers (RFC 9462).

The limiter keys on req.ip. If the service is deployed behind a reverse proxy, configure app.set('trust proxy', ...) in server.js so the limiter sees the real client address rather than the proxy. The service ships with no trust proxy configuration to avoid header-injection in untrusted topologies.

Invalid values (0, negative, non-integer) cause the process to exit at startup rather than silently disable.

Logging and request correlation

Every request is logged as a single JSON line on stdout via pino-http. Each request is tagged with an id surfaced in the x-request-id response header and included in every log line. If the caller sends an x-request-id header that matches ^[a-zA-Z0-9_.-]{1,128}$, the service reuses it; otherwise a fresh UUID is generated. Use this id to correlate a client trace with the server log for a given request.

Security notes

  • Allowlist, not denylist. New tags are blocked by default. To extend the surface, edit the allowedTags / allowedAttributes arrays in server.js and add a regression test.
  • Front matter is exposed raw, not trusted. It is returned in its own frontMatter field, never inside sanitized. A coarse HTML-like check decides safe, but the consumer must sanitize any front-matter value it intends to render as HTML.
  • query parser is set to simple. Express's default qs-based parser has shipped two array-limit DoS bypasses (GHSA-w7fw-mjwx-w883, GHSA-6rw7-vpxm-498p); the simple parser is not affected. Do not change this without re-reviewing those advisories.
  • Body size cap. express.json({ limit: '256kb' }) is the first line of defence against payload-amplification attacks against sanitize-html.
  • Rate limiting is opt-in via RATE_LIMIT_RPM and disabled by default. The service still expects to live behind a gateway for auth and TLS; the in-process limiter is defence-in-depth for /validate, not a substitute for an upstream policy layer.
  • Property-based fuzzing. tests/fuzzing.test.js runs fast-check against /validate to exercise invariants (no dangerous tags ever leak to sanitized, sanitization is idempotent, front matter never appears inside sanitized). Hundreds of randomized payloads per release.
  • Schema-validated boundary. POST /validate rejects any body that does not conform to the OpenAPI ValidateRequest schema (ajv). Extra fields, wrong types, and missing/empty values are caught before reaching the sanitizer, with structured details per violation.
  • SBOM attached to every release. A CycloneDX 1.6 JSON Software Bill of Materials (sbom.cdx.json) is generated from the production lockfile and uploaded as a release asset by .github/workflows/sbom.yml. Run npm run sbom to produce one locally.

npm audit reports zero vulnerabilities at the time of writing (May 2026, against express@5, sanitize-html@2.17, pino@10, pino-http@11, ajv@8, express-rate-limit@8, jest@30, supertest@7.2, fast-check@4).

Project layout

server.js                 Express app + /validate, /health and /openapi.json handlers.
lib/allowlist.js          Default sanitize-html allowlist and ALLOWLIST_FILE loader.
openapi.json              OpenAPI 3.1 contract served by /openapi.json.
tests/validation.test.js  Jest + Supertest suite covering happy path and rejection cases.
tests/fuzzing.test.js     Property-based tests (fast-check) for sanitizer invariants.
tests/request-id.test.js  Coverage for the x-request-id middleware.
tests/openapi.test.js     Coverage for the OpenAPI endpoint and contract.
tests/allowlist.test.js   Unit + integration coverage for the allowlist loader.
tests/rate-limit.test.js  Coverage for the RATE_LIMIT_RPM middleware on /validate.
Dockerfile, .dockerignore Container build.

Changelog

See CHANGELOG.md for the version history.

License

MIT - see LICENSE.

About

Markdown security and validation microservice

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors