By default max_inflight_tasks = max_staleness * per_device_train_batch_size * gradient_accumulation_steps * num_processes
AsyncRolloutWorker aiohttp TCP connector limits it to 100 by default which results in vLLM having 100 reqs running even in cases where max_inflight_tasks > 100
# My max_inflight_tasks = 256 and yet I only have 100 vLLM reqs
Engine 000: Avg prompt throughput: 255.0 tokens/s, Avg generation throughput: 4757.7 tokens/s
Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.7%, Prefix cache hit rate: 0.0%
^^^^^^^^ ^^^^^^
I wonder if this behavior is intended for stability. Setting that limit to max(100, max_inflight_tasks) so that vLLM handles all max_inflight_tasks? User can adjust --max-num-seqs from vLLM to control Running requests from there...
I can move to discussion if needed, thanks in advance
By default
max_inflight_tasks = max_staleness * per_device_train_batch_size * gradient_accumulation_steps * num_processesAsyncRolloutWorkeraiohttpTCP connector limits it to 100 by default which results in vLLM having 100 reqs running even in cases wheremax_inflight_tasks > 100I wonder if this behavior is intended for stability. Setting that limit to
max(100, max_inflight_tasks)so that vLLM handles all max_inflight_tasks? User can adjust--max-num-seqsfrom vLLM to control Running requests from there...I can move to discussion if needed, thanks in advance