Motivation
VIDUR currently simulates collocated prefill+decode on a single replica fleet. As disaggregated serving (à la Dynamo, Mooncake, DistServe) becomes standard for production LLM deployment, VIDUR needs a scheduler that can sweep prefill:decode worker ratios and predict optimal fleet splits.
Proposed addition
DisaggregatedScheduler — a discrete-event simulation of separate prefill and decode worker fleets:
arrival_queue → prefill_queue → kv_transfer_queue → decode_queue → done
KV transfer latency (the key new latency term):
t_kv = kv_bytes / interconnect_bandwidth
kv_bytes = 2 × num_layers × num_kv_heads × head_dim × seq_len × 2B (fp16)
Configurable interconnect: NVLink (600 GB/s), InfiniBand (400 GB/s), PCIe (64 GB/s).
Outputs: p50/p90/p99 E2E latency, TTFT, KV transfer stats, per-fleet utilization, effective throughput — exactly what's needed to answer "what p:d ratio minimises p99 TTFT at traffic λ for model M over interconnect I?"
Implementation status
Working implementation + 3 passing tests. Will submit PR once issue is confirmed in scope (wanted to check before opening a large PR).
Related
Motivation
VIDUR currently simulates collocated prefill+decode on a single replica fleet. As disaggregated serving (à la Dynamo, Mooncake, DistServe) becomes standard for production LLM deployment, VIDUR needs a scheduler that can sweep prefill:decode worker ratios and predict optimal fleet splits.
Proposed addition
DisaggregatedScheduler— a discrete-event simulation of separate prefill and decode worker fleets:arrival_queue → prefill_queue → kv_transfer_queue → decode_queue → done
KV transfer latency (the key new latency term):
t_kv = kv_bytes / interconnect_bandwidth
kv_bytes = 2 × num_layers × num_kv_heads × head_dim × seq_len × 2B (fp16)
Configurable interconnect: NVLink (600 GB/s), InfiniBand (400 GB/s), PCIe (64 GB/s).
Outputs: p50/p90/p99 E2E latency, TTFT, KV transfer stats, per-fleet utilization, effective throughput — exactly what's needed to answer "what p:d ratio minimises p99 TTFT at traffic λ for model M over interconnect I?"
Implementation status
Working implementation + 3 passing tests. Will submit PR once issue is confirmed in scope (wanted to check before opening a large PR).
Related