Finish typed text annotations from #464 (PlainTextStr / RawTextStr + DB backfill)

## Background

[PR #464](https://github.com/Morelitea/initiative/pull/464) (the post-pentest fix for CRIT-002: Stored XSS + DoS via task title/description) introduced schema-level HTML sanitisation. The original PR scoped it to `TaskBase.title` (`_strip_html`, plain text) and `TaskBase.description` (`_sanitize_html`, rich text).

In review, @LeeJMorel proposed a systemic approach:

> Sanitizing fields most definitely impacts more than just tasks. We should probably address this on a more systemic level, given it's an app that ingests a LOT of data by its nature.
>
> Ideally I'd like to make `_strip_html` / `_sanitize_html` into typed annotations: **PlainText, RichText, RawText**, etc., and then do a DB migration from `str` and maybe get a sanitizer library (I'm thinking nh3? RUST FOR LYFE)

…and then partly executed it:

> Okay I added a sanitizer at the schema level, remove it from here and submit the CORS fix?

## Current state — only half done

What landed:

- ✅ `SanitizedBaseModel` ([backend/app/schemas/base.py](https://github.com/Morelitea/initiative/blob/main/backend/app/schemas/base.py)) — model-level validator that runs `nh3.clean()` on every `str` field via `model_validator(mode="before")`.
- ✅ `nh3` adopted as the sanitiser.

What didn't:

- ❌ The typed annotations Lee specified. Only `RichTextStr` exists — and it's an **opt-out** marker, not part of a `PlainText` / `RichText` / `RawText` triad. Everything unmarked falls through to `nh3.clean`, which is the rich-text sanitiser.
- ❌ The DB migration from `str` to the typed columns.

## The bug this leaves

`nh3.clean` is an **HTML sanitiser** — its job is to produce output safe to embed in HTML. It HTML-encodes ambiguous characters:

```python
>>> nh3.clean("Session Zero & Planning")
'Session Zero &amp; Planning'
```

That's correct for fields rendered via `dangerouslySetInnerHTML` (rich text). It's **wrong for plain-text fields** rendered as React text nodes, because React text nodes don't decode entities — `&amp;` shows up literally on screen.

The validator skips work when the input isn't a dict (`if not isinstance(data, dict): return data`), so anything built via `Model.model_validate(orm_instance)` quietly avoids the issue. Anything built from a dict — request bodies **and** kwargs construction — gets sanitised.

So inbound writes (`ProjectCreate`, `QueueCreate`, `DocumentCreate`, etc.) HTML-encode plain-text field values and store the encoded form. Reads via the standard ORM path return whatever's in the DB unchanged.

## Evidence — DB is already corrupt

Spot-checking the dev DB (queried after a fresh `dev-seed.sh` + normal app use today):

```
queues.name             id=19  'Death House Encounter &amp;'
documents.title         id=18  'Sheet &amp; Stuff'
counter_groups.name     id=1   'Sample &amp; Full'
tags.name               id=121 'stuff &amp; things'
tasks.title             id=4257 '&amp; &lt; &gt; ... ;{ } ^ ! @ # $ %…'
```

These rows weren't seeded (the seed script bypasses Pydantic by writing SQLModels directly). They're real API writes from today. Production almost certainly has similar rows for any user-supplied name that contained `&`, `<`, `>`, or `"`.

This was first noticed in the new `/api/v1/recents` endpoint in [PR #526](https://github.com/Morelitea/initiative/pull/526), where the symptom was even more obvious because `RecentItemRead` was being constructed with kwargs (so the validator ran a *second* time on already-encoded DB data, double-escaping it). #526 patched that specific endpoint with `model_construct(...)` to skip the redundant re-sanitisation, but the systemic issue remains.

## Proposed scope

This is "finish what #464 started," not a new design. Lee already specified the shape.

### Schema layer

1. Replace the single `RichTextStr` opt-out with three typed annotations on `SanitizedBaseModel`:
   - `PlainTextStr` — strips HTML tags, **preserves benign characters** (no `&` → `&amp;` encoding). Implementation: a tag-stripping pass via `nh3.clean(s, tags=set(), attributes={})` followed by `html.unescape(...)`, or equivalent. Use this for `name`, `title`, `slug`, tag labels, etc.
   - `RichTextStr` — current behaviour: `nh3.clean(s)` with the standard allowlist. Use for fields rendered as HTML.
   - `RawTextStr` — opt-out, no sanitisation. Reserved for fields with their own validation (e.g. enums stored as strings, emoji, slugs already validated by regex). Used sparingly.
2. Make the default for an unmarked `str` either an explicit error (force the schema author to pick) or `PlainTextStr` (safe-but-non-encoding default). Today's "everything is rich-text-encoded unless marked" is the wrong default.
3. Update every existing schema field to its correct marker. Audit pass on `app/schemas/`.

### Data migration

1. Alembic data migration that runs `html.unescape(...)` on every column now typed `PlainTextStr` whose stored value matches the HTML-encoded pattern. Idempotent (`html.unescape("Foo & Bar") == "Foo & Bar"`), so safe to run on rows that were already clean.
2. Scope to confirmed plain-text columns only — never touch rich-text columns, those are correctly HTML-encoded.

### Tests

1. Round-trip test: POST a name like `"Foo & <Bar>"`, GET it back, assert the response matches the input string exactly (no `&amp;`, no entity-encoded tags).
2. XSS regression: POST `<img src=x onerror=alert(1)>` into a plain-text field, GET it back, assert tags are removed but ampersands etc. survive intact. Re-runs the `<img onerror>` payload that was the original CRIT-002 demonstrator.
3. Rich-text path unchanged — keep #464's allowlist tests.

## Out of scope

- Frontend decoding workarounds. Rejected — wrong layer, lossy round-trip if a user intentionally types `&amp;`, and dangerous if accidentally applied to rich-text fields.
- Removing `nh3` — keep it; it's the right library for the rich-text path.

## Owner / reviewer

@LeeJMorel — you wrote the original design and partial implementation; this is finishing it. Happy to write the PR if you'd like, just want sign-off on the `PlainTextStr` / `RawTextStr` design and the migration approach before I start.

cc @jordandrako


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finish typed text annotations from #464 (PlainTextStr / RawTextStr + DB backfill) #527

Background

Current state — only half done

The bug this leaves

Evidence — DB is already corrupt

Proposed scope

Schema layer

Data migration

Tests

Out of scope

Owner / reviewer

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Finish typed text annotations from #464 (PlainTextStr / RawTextStr + DB backfill) #527

Description

Background

Current state — only half done

The bug this leaves

Evidence — DB is already corrupt

Proposed scope

Schema layer

Data migration

Tests

Out of scope

Owner / reviewer

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions