Skip to content

Paste C / interactive SAS verification: matrix-nio is too thin; plan to switch to mautrix-python #1

@amiller

Description

@amiller

TL;DR

Our three-tier agent onboarding is Paste A → Paste B → Paste C, with a visible Element effect at each tier. Pastes A and B are shipped and verified end-to-end. Paste C (SAS / interactive verification so Element shows the bot as "verified by you") is not shipping because the Python SDK we standardized on — matrix-nio — does not have reliable high-level SAS support, and hand-rolling the state machine on top of its primitives did not produce a stable flow.

Plan: switch the responder to mautrix-python for the verification layer. mautrix-python is the SDK that powers every Element Bridge (Telegram, iMessage, Signal, WhatsApp, etc.), drives real SAS flows in production, and is already what hermes-agent's gateway uses internally. This is not a fork, not a rewrite of our stack — it's using the right tool for this layer.

This issue documents what we tried, why it didn't work, and the recommended next steps.


Context

The three-tier onboarding in the signup page (/signup) is designed so an agent — or a new human — progresses through visible Element states:

Paste End state in Element Status
A Bot alive, E2EE DM decrypts ✅ shipped
A Element shows yellow "encrypted by a device not verified by its owner" (expected intermediate)
B MSK/SSK/USK published; warning drops to "not verified by you" ✅ shipped (POST /signup/api/crosssign)
C User clicks Verify → SAS emoji flow → green shield 🚧 blocked on SDK

Paste A uses matrix-nio[e2e] (libolm-backed crypto) and works on weak open models (verified with gpt-oss-120b via OpenRouter): end-to-end onboarding, encrypted DM, server-assigned event_id returned as un-fakeable proof of success. Paste B is a server-side helper (knock-approver/approver.py): it generates three Ed25519 keypairs for the new user, signs SSK and USK with MSK, signs the user's device with SSK, uploads via /_matrix/client/v3/keys/device_signing/upload + /_matrix/client/v3/keys/signatures/upload. Verified with /keys/query showing master_keys, self_signing_keys, user_signing_keys, and the device signed by SSK.

Paste C would add a SAS verification handler to the responder so that when the user clicks "Verify" on the bot's profile in Element, the Matrix verification state machine runs through m.key.verification.{request,ready,start,accept,key,mac,done} without any human-on-the-bot-side intervention.

What we tried (matrix-nio)

Reference callback shape, modeled on matrix-nio docs and examples:

from nio import (
    KeyVerificationStart, KeyVerificationKey,
    KeyVerificationMac, KeyVerificationCancel,
)

async def on_verification(event):
    tx = event.transaction_id
    if isinstance(event, KeyVerificationStart):
        if "m.sas.v1" in event.short_authentication_string:
            await client.accept_key_verification(tx)
    elif isinstance(event, KeyVerificationKey):
        await client.confirm_short_auth_string(tx)
    elif isinstance(event, KeyVerificationMac):
        sas = client.key_verifications.get(tx)
        if sas:
            try: await client.to_device(sas.get_mac())
            except Exception: pass

client.add_to_device_callback(on_verification, (
    KeyVerificationStart, KeyVerificationKey,
    KeyVerificationMac, KeyVerificationCancel,
))

To test, we wrote a driver (tests/test_sas_verify.py, not committed) that:

  1. signs up a fresh bot via /signup/api
  2. runs /signup/api/crosssign (Paste B) so the bot has MSK/SSK/USK
  3. launches the Paste A+C responder
  4. as a separate matrix-nio client (standing in for the user's Element), fetches the bot's device via /_matrix/client/v3/keys/query, injects it into the verifier's device_store, and calls start_key_verification(device) to send the initial m.key.verification.start

What happened

Bot's log consistently showed exactly one event:

[bot verif] KeyVerificationStart tx=...

Then silence. The verifier never received m.key.verification.key from the bot, so confirm_short_auth_string never fired, MACs never exchanged, 30-second timeout, done.

Things we ruled out:

  • ❌ bot crashed (process stayed up, sync loop healthy, !ping still returned pong)
  • device_signed: false (re-crosssigned after bot uploaded device keys; device_signed: true confirmed via /keys/query)
  • ❌ nio-specific event-routing issue ("cast a wide net" variant registering on ToDeviceEvent base class behaved identically)

Root cause (as best we can tell)

matrix-nio's Sas class (in nio/crypto/sas.py) is a partial implementation: it exposes the primitives (share_key, get_mac, confirm_short_auth_string) but does not orchestrate the full state machine for you. The caller is expected to:

  • know when to call share_key vs. wait for nio's sync loop to have implicitly done so
  • emit the m.key.verification.accept event as a side effect of accept_key_verification, but ALSO sometimes emit a m.key.verification.key via share_key() — redundantly? not always? — depending on whether we were initiator or responder
  • handle the MAC at the right moment (calling get_mac() when the internal state isn't ready raises or returns None)

Upstream matrix-nio has open issues about verification flow fragility (see issues labeled verification in the poljar/matrix-nio repo). The Rust SDK (matrix-rust-sdk) has a bootstrap_cross_signing() helper and higher-level verification orchestration; Python nio doesn't.

Options (re-evaluated)

  1. Switch verification to mautrix-python — ✅ recommended

    • mature, drives every Element bridge in production
    • same dep hermes-agent's own gateway already uses (no new tree)
    • has OlmMachine-level verification helpers that coordinate the full flow
    • trade-off: heavier setup (aiosqlite/aiopg store for OlmMachine state) and a more verbose API, but Paste A's responder can stay nio or we migrate that too
  2. Augment matrix-nio with our own SAS state machine — possible

    • ~150 lines against raw to-device events + python-olm primitives, bypassing nio's Sas class
    • keeps a single SDK surface for agents
    • trade-off: we now own Matrix verification spec compliance; any drift is on us
  3. Fork matrix-nio — overkill, not recommended

    • fork rate > maintenance rate. Project isn't at a scale that justifies it.
  4. Wait for upstream nio — ❌ rejected

    • that's what we were doing; no sign of movement. Not a real option.
  5. Rust SDK bindings (matrix-sdk-ffi, matrix-sdk-crypto-js) — possible, higher friction

    • canonical implementation, best feature coverage
    • Python ergonomics are rougher than mautrix-python for a small bot

Proposed follow-up work

  • Spike: port responder.py from matrix-nio to mautrix-python. ~200 lines of rewrite. Verify the same !ping round-trip works.
  • Add OlmMachine-driven verification callback that handles the full SAS flow (Element's handle_verification pattern from the bridges).
  • Write an automated end-to-end test: a second mautrix-python client plays the "human clicks Verify" role, drives the SAS handshake, asserts completion.
  • Once green: re-enable Paste C on /signup and in skills/matrix-invite-join/SKILL.md with the mautrix-python version.
  • Long-shot: consider whether Paste A's responder itself should be mautrix-python too, for consistency and because mautrix handles megolm session lifetime / key backup more gracefully than nio.

Interim guidance for users

Paste C is not live. For trust-by-you, compare the bot's MSK fingerprint out-of-band — it's returned as msk_public in the POST /signup/api/crosssign response. Open the bot's profile in Element; Element shows the same fingerprint under "Security." If they match, the bot is the one you onboarded. This is the same security signal SAS gives you, just done with eyes instead of emojis.

Broader take

Matrix has the right threat model for a federated E2EE messenger, but its Python client ecosystem has significant integration gaps for anyone trying to build bots / agents that do more than "send and receive plaintext." Paste A is enough to demonstrate the whole stack end-to-end working from scratch in one paste-to-agent — that's real and shippable. Paste B proves out the server-side cross-signing path — also real. Paste C is a good aha-moment story for demos, but not demo-quality reliable on matrix-nio as it stands. We should stop banging on that door and walk through the one next to it (mautrix-python).


Labels: roadmap, matrix-sdk, verification

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions