What you would like to be added?
Disaggregated inference splits the LLM serving into distinct (separately deployed and scaled) components - prefill (context generation) and decode (token generation). This introduces a hard operational constraint during version upgrades (often incompatible). Only compatible prefill and decode instances should communicate and the decode:prefill update ratio should be kept proportional else will be result in mismatched pools of compatible instances. This reduces the effective end-to-end serving capacity during the update.
Why is this needed?
The goal of coherent update strategy is to maintain balanced compatible capacity across components so that version upgrades (especially incompatible) does not reduce serving capacity during upgrades.
What you would like to be added?
Disaggregated inference splits the LLM serving into distinct (separately deployed and scaled) components - prefill (context generation) and decode (token generation). This introduces a hard operational constraint during version upgrades (often incompatible). Only compatible prefill and decode instances should communicate and the decode:prefill update ratio should be kept proportional else will be result in mismatched pools of compatible instances. This reduces the effective end-to-end serving capacity during the update.
Why is this needed?
The goal of
coherentupdate strategy is to maintain balanced compatible capacity across components so that version upgrades (especially incompatible) does not reduce serving capacity during upgrades.