[Feature] Disaggregated prefill/decode scheduler for P/D ratio optimization

## Motivation

  VIDUR currently simulates collocated prefill+decode on a single replica fleet. As disaggregated serving (à la Dynamo, Mooncake, DistServe) becomes standard for production LLM deployment, VIDUR needs a scheduler that can sweep **prefill:decode worker ratios** and predict optimal fleet splits.

  ## Proposed addition
  
  `DisaggregatedScheduler` — a discrete-event simulation of separate prefill and decode worker fleets:
  arrival_queue → prefill_queue → kv_transfer_queue → decode_queue → done

  **KV transfer latency** (the key new latency term):

  t_kv = kv_bytes / interconnect_bandwidth
  kv_bytes = 2 × num_layers × num_kv_heads × head_dim × seq_len × 2B  (fp16)

  **Configurable interconnect:** NVLink (600 GB/s), InfiniBand (400 GB/s), PCIe (64 GB/s).

  **Outputs:** p50/p90/p99 E2E latency, TTFT, KV transfer stats, per-fleet utilization, effective throughput — exactly what's needed to answer "what p:d ratio minimises p99 TTFT at traffic λ for model M over interconnect I?"

  ## Implementation status

  Working implementation + 3 passing tests. Will submit PR once issue is confirmed in scope (wanted to check before opening a large PR).

  ## Related

  - Issue #75 (MoE/EP support) — the same PR includes MoE model configs (DeepSeek-V3, Mixtral)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Disaggregated prefill/decode scheduler for P/D ratio optimization #79

Motivation

Proposed addition

Implementation status

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature] Disaggregated prefill/decode scheduler for P/D ratio optimization #79

Description

Motivation

Proposed addition

Implementation status

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions