[perf]: support multistream overlap(dbo) for deepseek #941

zxdukki · 2025-05-23T15:46:25Z

What this PR does / why we need it?

Based on the design of dual-batch overlap proposed by Deepseek team and also the implementation of fused moe in VLLM project, we implement the multi-stream(also known as dual-batch) overlap for deepseek+mla on Ascend NPU. We split the input batch of model into two microbatches and then overlap the comp/comm ops in attention and moe layers using two streams to improve the performance. Our approach can be easily extended when adding dispatch/combine communications for moe layer.
Compared with the previously proposed draft, we use one stream for computation ops and the other for communication ops, separately. In out opinions, it is beneficial for arranging the order of executing different ops and thus avoiding the contention of computation/communication resources.

Note that this PR is in progress. The benchmark performance will be updated soon.

ref: overlap for llama
ref: dbo in sglang

Does this PR introduce any user-facing change?

Adding an env variable "VLLM_ENABLE_DBO". Users can enable dbo by setting "VLLM_ENABLE_DBO=1"

How was this patch tested?

This patch can be tested with vllm-0.8.5 using its online service with benchmark tests. We have decoupled the func of dbo from vllm and it should be able to run without any modification to the code of vllm(some modifications is better to implement in vllm though).

Any advice/discussion is welcome.

Performance Benchmark

We have ran the benchmark_serving script of vllm to test the performance after using dual-batch overlap.

python -m vllm.entrypoints.openai.api_server \ --model=DeepSeek-R1-W8A8 \ --trust-remote-code \ --distributed-executor-backend=mp \ -tp=16 \ --port 8006 \ --max-num-seqs 390 \ --max-model-len 32768 \ --max-num-batched-tokens 65536 \ --block-size 128 \ --compilation_config 0 \ --gpu-memory-utilization 0.90 \ --disable-log-requests \ --additional-config '{"expert_tensor_parallel_size":1,"enable_inter_dp_scheduling":true,"init_torchair_graph_batch_sizes":true,"trace_recompiles":true,"ascend_scheduler_config":{},"enable_graph_mode":false}'

and run benchmark with the parameters of :
--dataset-name random --random-input-len 4096 --random-output-len 1 --num-prompts 200 --max-concurrency 8 --request-rate 5 --metric-percentiles 90

test with the version not using alltoall in Ascend A2 (tp16 ep16 + deepseek r1 w8a8)

prefill qps: 2.17-> 2.60

test with the version using alltoall (can be further optimized by \i.e., overlapping micro-batch1's moe comp with micro-batch2's dispatch a2a comm):

prefill qps: 0.90 -> 1.01
Mean TTFT：8226->7432ms

Signed-off-by: zhuohuan <zxdu1997@gmail.com>

github-actions bot added the module:ops label May 23, 2025

zxdukki force-pushed the dev_multistream_overlap branch from 943d296 to 68070f1 Compare May 23, 2025 16:05

github-actions bot added the module:core label May 26, 2025

zxdukki force-pushed the dev_multistream_overlap branch from b0eed8a to 9053dd1 Compare May 27, 2025 14:21

zxdukki marked this pull request as ready for review May 28, 2025 05:06

zxdukki added 2 commits May 28, 2025 14:05

[feat]: support dbo for deepseek

fc48292

Signed-off-by: zhuohuan <zxdu1997@gmail.com>

[fix]: reduced dependency on vllm for dbo

5220270

Signed-off-by: zhuohuan <zxdu1997@gmail.com>

zxdukki changed the title ~~[feat][WIP]: support multistream overlap(dbo) for deepseek~~ [perf]: support multistream overlap(dbo) for deepseek May 28, 2025

zxdukki force-pushed the dev_multistream_overlap branch from c31fa28 to 4e35808 Compare May 28, 2025 09:10

[feat]: improve overlap performance

eef22e1

Signed-off-by: zhuohuan <zxdu1997@gmail.com>

zxdukki force-pushed the dev_multistream_overlap branch 3 times, most recently from 05ca45a to 58186c3 Compare May 28, 2025 13:36

[fix]: resolve format issues

cfe6a5a

Signed-off-by: zhuohuan <zxdu1997@gmail.com>

zxdukki force-pushed the dev_multistream_overlap branch from 58186c3 to cfe6a5a Compare May 28, 2025 13:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[perf]: support multistream overlap(dbo) for deepseek #941

[perf]: support multistream overlap(dbo) for deepseek #941

Uh oh!

zxdukki commented May 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

[perf]: support multistream overlap(dbo) for deepseek #941

Are you sure you want to change the base?

[perf]: support multistream overlap(dbo) for deepseek #941

Uh oh!

Conversation

zxdukki commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Performance Benchmark

Uh oh!

Uh oh!

zxdukki commented May 23, 2025 •

edited

Loading