Skip to content

[Feature] Disaggregated prefill/decode scheduler for P/D ratio optimization #79

Description

@buddywhitman

Motivation

VIDUR currently simulates collocated prefill+decode on a single replica fleet. As disaggregated serving (à la Dynamo, Mooncake, DistServe) becomes standard for production LLM deployment, VIDUR needs a scheduler that can sweep prefill:decode worker ratios and predict optimal fleet splits.

Proposed addition

DisaggregatedScheduler — a discrete-event simulation of separate prefill and decode worker fleets:
arrival_queue → prefill_queue → kv_transfer_queue → decode_queue → done

KV transfer latency (the key new latency term):

t_kv = kv_bytes / interconnect_bandwidth
kv_bytes = 2 × num_layers × num_kv_heads × head_dim × seq_len × 2B (fp16)

Configurable interconnect: NVLink (600 GB/s), InfiniBand (400 GB/s), PCIe (64 GB/s).

Outputs: p50/p90/p99 E2E latency, TTFT, KV transfer stats, per-fleet utilization, effective throughput — exactly what's needed to answer "what p:d ratio minimises p99 TTFT at traffic λ for model M over interconnect I?"

Implementation status

Working implementation + 3 passing tests. Will submit PR once issue is confirmed in scope (wanted to check before opening a large PR).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions