Skip to content

KeyError: 'quarantined_media' in on_POSITION causes all workers to crash repeatedly on 1.152.0 #19750

@fir3drag0n

Description

@fir3drag0n

Description

After upgrading to 1.152.0, the event_worker crashes repeatedly with an unhandled KeyError: 'quarantined_media' in on_POSITION. This tears down the Redis
replication connection every ~3 seconds, causing the sync worker to stop receiving events and bridges to fail delivering E2EE decryption keys.

Steps to reproduce

Root Cause

on_POSITION in synapse/replication/tcp/handler.py line 635 does a direct dict lookup:

stream = self._streams[cmd.stream_name]

If a worker receives a POSITION for a stream it doesn't own (e.g. quarantined_media on the event worker), this raises an unhandled KeyError which tears down the
Twisted connection.

Note: adding quarantined_media: ["media_worker"] to stream_writers in homeserver.yaml does not helpWriterLocations.__init__() rejects it as an unexpected keyword
argument.

### Homeserver

homeserver

### Synapse Version

1.152.0

### Installation Method

Docker (matrixdotorg/synapse)

### Database

PostgreSQL 18

### Workers

Multiple workers

### Platform

OSUbuntu 24.04 (Oracle Cloud)
Kernel6.17.0
Archaarch64 (ARM64)
Docker27.4.1
Python3.13.13
Synapse1.152.0 (matrixdotorg/synapse:latest)
DatenbankPostgreSQL 18
DeploymentDocker, Worker-Setup mit Redis-Replication


### Configuration

homeserver.yaml (relevanter Ausschnitt):
stream_writers:
  events: ["event_worker"]
  receipts: ["event_worker"]
  typing: ["event_worker"]
  presence: ["event_worker"]
  to_device: ["event_worker"]
  account_data: ["event_worker"]

worker_event.yaml:
worker_app: synapse.app.generic_worker
worker_name: event_worker

worker_listeners:
  - port: 8083
    bind_addresses: ['127.0.0.1']
    type: http
    resources:
      - names: [client, federation, replication]
        compress: true

worker_media.yaml:
worker_app: synapse.app.generic_worker
worker_name: media_worker

worker_listeners:
  - type: http
    port: 8085
    resources:
      - names: [media, replication]


### Relevant log output

```shell
CRITICAL - sentinel - Unhandled Error
Traceback (most recent call last):
  File ".../twisted/internet/posixbase.py", line 491, in _doReadOrWrite
  File ".../twisted/internet/tcp.py", line 250, in doRead
  File ".../txredisapi.py", line 1858, in dataReceived
  File ".../synapse/replication/tcp/redis.py", line 178, in messageReceived
  File ".../synapse/replication/tcp/redis.py", line 219, in handle_command
  File ".../synapse/replication/tcp/handler.py", line 635, in on_POSITION
builtins.KeyError: 'quarantined_media'

Anything else that would be useful to know?

Fix

stream = self._streams.get(cmd.stream_name)
if stream is None:
logger.debug("Ignoring POSITION for unknown stream %s", cmd.stream_name)
return

The fix/workaround needs to be applied to all workers, not just the event worker.
Every worker subscribes to Redis pub/sub and receives all POSITION broadcasts,
including for streams it doesn't own.

Affected workers in our setup: event_worker, sync_worker, media_worker,
federation_worker, push_worker.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions