Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
Purpose
Support benchmark trimming with user-define time interval to get accurate decode related metrics.
VLLM's current benchmark does not support a trimming feature, which is essential for obtaining precise decode-related metrics. At the beginning of the benchmark, the vLLM inference server cannot fully utilize its decode capacity because the nature of prefill prevents it from processing large batches immediately. This means some time is required for the server to reach a fully loaded state for decoding. Similarly, at the end of the benchmark, as requests finish gradually, the server utilization drops. However, the vLLM benchmark cannot distinguish between this low-load data and the stable, high-load data, which corrupts the resulting decode metrics.
We have added
--warmup-timeand--cooldown-timeoptions to support configuring the effective time interval for measurement.The accurate time interval is calculated by the following formula: (Effective Time interval) = (Benchmark End Time (E) - Cooldown Time (c)) - (Benchmark Start Time (S) + Warmup Time (w))
Test Plan
Test Result
The trimmed benchmark result is added at the bottom of the benchmark results. We can get
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.