Skip to content

Finish typed text annotations from #464 (PlainTextStr / RawTextStr + DB backfill) #527

@jordandrako

Description

@jordandrako

Background

PR #464 (the post-pentest fix for CRIT-002: Stored XSS + DoS via task title/description) introduced schema-level HTML sanitisation. The original PR scoped it to TaskBase.title (_strip_html, plain text) and TaskBase.description (_sanitize_html, rich text).

In review, @LeeJMorel proposed a systemic approach:

Sanitizing fields most definitely impacts more than just tasks. We should probably address this on a more systemic level, given it's an app that ingests a LOT of data by its nature.

Ideally I'd like to make _strip_html / _sanitize_html into typed annotations: PlainText, RichText, RawText, etc., and then do a DB migration from str and maybe get a sanitizer library (I'm thinking nh3? RUST FOR LYFE)

…and then partly executed it:

Okay I added a sanitizer at the schema level, remove it from here and submit the CORS fix?

Current state — only half done

What landed:

  • SanitizedBaseModel (backend/app/schemas/base.py) — model-level validator that runs nh3.clean() on every str field via model_validator(mode="before").
  • nh3 adopted as the sanitiser.

What didn't:

  • ❌ The typed annotations Lee specified. Only RichTextStr exists — and it's an opt-out marker, not part of a PlainText / RichText / RawText triad. Everything unmarked falls through to nh3.clean, which is the rich-text sanitiser.
  • ❌ The DB migration from str to the typed columns.

The bug this leaves

nh3.clean is an HTML sanitiser — its job is to produce output safe to embed in HTML. It HTML-encodes ambiguous characters:

>>> nh3.clean("Session Zero & Planning")
'Session Zero & Planning'

That's correct for fields rendered via dangerouslySetInnerHTML (rich text). It's wrong for plain-text fields rendered as React text nodes, because React text nodes don't decode entities — & shows up literally on screen.

The validator skips work when the input isn't a dict (if not isinstance(data, dict): return data), so anything built via Model.model_validate(orm_instance) quietly avoids the issue. Anything built from a dict — request bodies and kwargs construction — gets sanitised.

So inbound writes (ProjectCreate, QueueCreate, DocumentCreate, etc.) HTML-encode plain-text field values and store the encoded form. Reads via the standard ORM path return whatever's in the DB unchanged.

Evidence — DB is already corrupt

Spot-checking the dev DB (queried after a fresh dev-seed.sh + normal app use today):

queues.name             id=19  'Death House Encounter &'
documents.title         id=18  'Sheet & Stuff'
counter_groups.name     id=1   'Sample & Full'
tags.name               id=121 'stuff & things'
tasks.title             id=4257 '& < > ... ;{ } ^ ! @ # $ %…'

These rows weren't seeded (the seed script bypasses Pydantic by writing SQLModels directly). They're real API writes from today. Production almost certainly has similar rows for any user-supplied name that contained &, <, >, or ".

This was first noticed in the new /api/v1/recents endpoint in PR #526, where the symptom was even more obvious because RecentItemRead was being constructed with kwargs (so the validator ran a second time on already-encoded DB data, double-escaping it). #526 patched that specific endpoint with model_construct(...) to skip the redundant re-sanitisation, but the systemic issue remains.

Proposed scope

This is "finish what #464 started," not a new design. Lee already specified the shape.

Schema layer

  1. Replace the single RichTextStr opt-out with three typed annotations on SanitizedBaseModel:
    • PlainTextStr — strips HTML tags, preserves benign characters (no &&amp; encoding). Implementation: a tag-stripping pass via nh3.clean(s, tags=set(), attributes={}) followed by html.unescape(...), or equivalent. Use this for name, title, slug, tag labels, etc.
    • RichTextStr — current behaviour: nh3.clean(s) with the standard allowlist. Use for fields rendered as HTML.
    • RawTextStr — opt-out, no sanitisation. Reserved for fields with their own validation (e.g. enums stored as strings, emoji, slugs already validated by regex). Used sparingly.
  2. Make the default for an unmarked str either an explicit error (force the schema author to pick) or PlainTextStr (safe-but-non-encoding default). Today's "everything is rich-text-encoded unless marked" is the wrong default.
  3. Update every existing schema field to its correct marker. Audit pass on app/schemas/.

Data migration

  1. Alembic data migration that runs html.unescape(...) on every column now typed PlainTextStr whose stored value matches the HTML-encoded pattern. Idempotent (html.unescape("Foo & Bar") == "Foo & Bar"), so safe to run on rows that were already clean.
  2. Scope to confirmed plain-text columns only — never touch rich-text columns, those are correctly HTML-encoded.

Tests

  1. Round-trip test: POST a name like "Foo & <Bar>", GET it back, assert the response matches the input string exactly (no &amp;, no entity-encoded tags).
  2. XSS regression: POST <img src=x onerror=alert(1)> into a plain-text field, GET it back, assert tags are removed but ampersands etc. survive intact. Re-runs the <img onerror> payload that was the original CRIT-002 demonstrator.
  3. Rich-text path unchanged — keep security: fix CORS wildcard, stored XSS, CSP header, and validation error disclosure #464's allowlist tests.

Out of scope

  • Frontend decoding workarounds. Rejected — wrong layer, lossy round-trip if a user intentionally types &amp;, and dangerous if accidentally applied to rich-text fields.
  • Removing nh3 — keep it; it's the right library for the rich-text path.

Owner / reviewer

@LeeJMorel — you wrote the original design and partial implementation; this is finishing it. Happy to write the PR if you'd like, just want sign-off on the PlainTextStr / RawTextStr design and the migration approach before I start.

cc @jordandrako

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions