You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #464 (the post-pentest fix for CRIT-002: Stored XSS + DoS via task title/description) introduced schema-level HTML sanitisation. The original PR scoped it to TaskBase.title (_strip_html, plain text) and TaskBase.description (_sanitize_html, rich text).
In review, @LeeJMorel proposed a systemic approach:
Sanitizing fields most definitely impacts more than just tasks. We should probably address this on a more systemic level, given it's an app that ingests a LOT of data by its nature.
Ideally I'd like to make _strip_html / _sanitize_html into typed annotations: PlainText, RichText, RawText, etc., and then do a DB migration from str and maybe get a sanitizer library (I'm thinking nh3? RUST FOR LYFE)
…and then partly executed it:
Okay I added a sanitizer at the schema level, remove it from here and submit the CORS fix?
Current state — only half done
What landed:
✅ SanitizedBaseModel (backend/app/schemas/base.py) — model-level validator that runs nh3.clean() on every str field via model_validator(mode="before").
✅ nh3 adopted as the sanitiser.
What didn't:
❌ The typed annotations Lee specified. Only RichTextStr exists — and it's an opt-out marker, not part of a PlainText / RichText / RawText triad. Everything unmarked falls through to nh3.clean, which is the rich-text sanitiser.
❌ The DB migration from str to the typed columns.
The bug this leaves
nh3.clean is an HTML sanitiser — its job is to produce output safe to embed in HTML. It HTML-encodes ambiguous characters:
>>>nh3.clean("Session Zero & Planning")
'Session Zero & Planning'
That's correct for fields rendered via dangerouslySetInnerHTML (rich text). It's wrong for plain-text fields rendered as React text nodes, because React text nodes don't decode entities — & shows up literally on screen.
The validator skips work when the input isn't a dict (if not isinstance(data, dict): return data), so anything built via Model.model_validate(orm_instance) quietly avoids the issue. Anything built from a dict — request bodies and kwargs construction — gets sanitised.
So inbound writes (ProjectCreate, QueueCreate, DocumentCreate, etc.) HTML-encode plain-text field values and store the encoded form. Reads via the standard ORM path return whatever's in the DB unchanged.
Evidence — DB is already corrupt
Spot-checking the dev DB (queried after a fresh dev-seed.sh + normal app use today):
queues.name id=19 'Death House Encounter &'
documents.title id=18 'Sheet & Stuff'
counter_groups.name id=1 'Sample & Full'
tags.name id=121 'stuff & things'
tasks.title id=4257 '& < > ... ;{ } ^ ! @ # $ %…'
These rows weren't seeded (the seed script bypasses Pydantic by writing SQLModels directly). They're real API writes from today. Production almost certainly has similar rows for any user-supplied name that contained &, <, >, or ".
This was first noticed in the new /api/v1/recents endpoint in PR #526, where the symptom was even more obvious because RecentItemRead was being constructed with kwargs (so the validator ran a second time on already-encoded DB data, double-escaping it). #526 patched that specific endpoint with model_construct(...) to skip the redundant re-sanitisation, but the systemic issue remains.
Proposed scope
This is "finish what #464 started," not a new design. Lee already specified the shape.
Schema layer
Replace the single RichTextStr opt-out with three typed annotations on SanitizedBaseModel:
PlainTextStr — strips HTML tags, preserves benign characters (no & → & encoding). Implementation: a tag-stripping pass via nh3.clean(s, tags=set(), attributes={}) followed by html.unescape(...), or equivalent. Use this for name, title, slug, tag labels, etc.
RichTextStr — current behaviour: nh3.clean(s) with the standard allowlist. Use for fields rendered as HTML.
RawTextStr — opt-out, no sanitisation. Reserved for fields with their own validation (e.g. enums stored as strings, emoji, slugs already validated by regex). Used sparingly.
Make the default for an unmarked str either an explicit error (force the schema author to pick) or PlainTextStr (safe-but-non-encoding default). Today's "everything is rich-text-encoded unless marked" is the wrong default.
Update every existing schema field to its correct marker. Audit pass on app/schemas/.
Data migration
Alembic data migration that runs html.unescape(...) on every column now typed PlainTextStr whose stored value matches the HTML-encoded pattern. Idempotent (html.unescape("Foo & Bar") == "Foo & Bar"), so safe to run on rows that were already clean.
Scope to confirmed plain-text columns only — never touch rich-text columns, those are correctly HTML-encoded.
Tests
Round-trip test: POST a name like "Foo & <Bar>", GET it back, assert the response matches the input string exactly (no &, no entity-encoded tags).
XSS regression: POST <img src=x onerror=alert(1)> into a plain-text field, GET it back, assert tags are removed but ampersands etc. survive intact. Re-runs the <img onerror> payload that was the original CRIT-002 demonstrator.
Frontend decoding workarounds. Rejected — wrong layer, lossy round-trip if a user intentionally types &, and dangerous if accidentally applied to rich-text fields.
Removing nh3 — keep it; it's the right library for the rich-text path.
Owner / reviewer
@LeeJMorel — you wrote the original design and partial implementation; this is finishing it. Happy to write the PR if you'd like, just want sign-off on the PlainTextStr / RawTextStr design and the migration approach before I start.
Background
PR #464 (the post-pentest fix for CRIT-002: Stored XSS + DoS via task title/description) introduced schema-level HTML sanitisation. The original PR scoped it to
TaskBase.title(_strip_html, plain text) andTaskBase.description(_sanitize_html, rich text).In review, @LeeJMorel proposed a systemic approach:
…and then partly executed it:
Current state — only half done
What landed:
SanitizedBaseModel(backend/app/schemas/base.py) — model-level validator that runsnh3.clean()on everystrfield viamodel_validator(mode="before").nh3adopted as the sanitiser.What didn't:
RichTextStrexists — and it's an opt-out marker, not part of aPlainText/RichText/RawTexttriad. Everything unmarked falls through tonh3.clean, which is the rich-text sanitiser.strto the typed columns.The bug this leaves
nh3.cleanis an HTML sanitiser — its job is to produce output safe to embed in HTML. It HTML-encodes ambiguous characters:That's correct for fields rendered via
dangerouslySetInnerHTML(rich text). It's wrong for plain-text fields rendered as React text nodes, because React text nodes don't decode entities —&shows up literally on screen.The validator skips work when the input isn't a dict (
if not isinstance(data, dict): return data), so anything built viaModel.model_validate(orm_instance)quietly avoids the issue. Anything built from a dict — request bodies and kwargs construction — gets sanitised.So inbound writes (
ProjectCreate,QueueCreate,DocumentCreate, etc.) HTML-encode plain-text field values and store the encoded form. Reads via the standard ORM path return whatever's in the DB unchanged.Evidence — DB is already corrupt
Spot-checking the dev DB (queried after a fresh
dev-seed.sh+ normal app use today):These rows weren't seeded (the seed script bypasses Pydantic by writing SQLModels directly). They're real API writes from today. Production almost certainly has similar rows for any user-supplied name that contained
&,<,>, or".This was first noticed in the new
/api/v1/recentsendpoint in PR #526, where the symptom was even more obvious becauseRecentItemReadwas being constructed with kwargs (so the validator ran a second time on already-encoded DB data, double-escaping it). #526 patched that specific endpoint withmodel_construct(...)to skip the redundant re-sanitisation, but the systemic issue remains.Proposed scope
This is "finish what #464 started," not a new design. Lee already specified the shape.
Schema layer
RichTextStropt-out with three typed annotations onSanitizedBaseModel:PlainTextStr— strips HTML tags, preserves benign characters (no&→&encoding). Implementation: a tag-stripping pass vianh3.clean(s, tags=set(), attributes={})followed byhtml.unescape(...), or equivalent. Use this forname,title,slug, tag labels, etc.RichTextStr— current behaviour:nh3.clean(s)with the standard allowlist. Use for fields rendered as HTML.RawTextStr— opt-out, no sanitisation. Reserved for fields with their own validation (e.g. enums stored as strings, emoji, slugs already validated by regex). Used sparingly.streither an explicit error (force the schema author to pick) orPlainTextStr(safe-but-non-encoding default). Today's "everything is rich-text-encoded unless marked" is the wrong default.app/schemas/.Data migration
html.unescape(...)on every column now typedPlainTextStrwhose stored value matches the HTML-encoded pattern. Idempotent (html.unescape("Foo & Bar") == "Foo & Bar"), so safe to run on rows that were already clean.Tests
"Foo & <Bar>", GET it back, assert the response matches the input string exactly (no&, no entity-encoded tags).<img src=x onerror=alert(1)>into a plain-text field, GET it back, assert tags are removed but ampersands etc. survive intact. Re-runs the<img onerror>payload that was the original CRIT-002 demonstrator.Out of scope
&, and dangerous if accidentally applied to rich-text fields.nh3— keep it; it's the right library for the rich-text path.Owner / reviewer
@LeeJMorel — you wrote the original design and partial implementation; this is finishing it. Happy to write the PR if you'd like, just want sign-off on the
PlainTextStr/RawTextStrdesign and the migration approach before I start.cc @jordandrako