Skip to content

☂️ Coherent update epic #638

@unmarshall

Description

@unmarshall

What you would like to be added?

Disaggregated inference splits the LLM serving into distinct (separately deployed and scaled) components - prefill (context generation) and decode (token generation). This introduces a hard operational constraint during version upgrades (often incompatible). Only compatible prefill and decode instances should communicate and the decode:prefill update ratio should be kept proportional else will be result in mismatched pools of compatible instances. This reduces the effective end-to-end serving capacity during the update.

Why is this needed?

The goal of coherent update strategy is to maintain balanced compatible capacity across components so that version upgrades (especially incompatible) does not reduce serving capacity during upgrades.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions