-
Notifications
You must be signed in to change notification settings - Fork 690
[Feature] Support NVFP4 MoE #6003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #6003 +/- ##
==========================================
Coverage ? 67.28%
==========================================
Files ? 355
Lines ? 46098
Branches ? 7111
==========================================
Hits ? 31015
Misses ? 12829
Partials ? 2254
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/re-run all-failed |
|
/re-run base_tests |
|
/re-run all-failed |
Removed the logic for generating random padding IDs.
|
/re-run all-failed |
|
/re-run all-failed |
Motivation
This pr supports modelopt format NVFP4 inference (currently, only Qwen/Ernie) by introducing Flashinfer as a backend. It requires GPU with sm>=100 and Flashinfer installtion.
Modifications
With paddle compatible api, this pr introdudaces Flashinfer as a backend. There may be coexistence issues with some third-party including pytorch code (e.g. xgrammar, triton). , Currenly we cannot use them at same time and we are working on resolving this.
Usage or Command
New Environment Variables
FD_FLASHINFER_MOE_BACKEND: FP4 MoE backend, could beflashinfer-cutlass,flashinfer-trtllmorNone(default isNone, will useflashinfer-cutass). Currently we only support flashinfer-cutlass.FD_NVFP4_GEMM_BACKEND: FP4 dense GEMM backend, could beflashinfer-cutlass,flashinfer-trtllm,flashinfer-cudnnorNone(default is None, will use flashinfer-cutlass). Currently we only support flashinfer-cutlass.PADDLE_COMPATIBLE_API: This is an environment variable for Flashinfer with Paddle, set it totrueto use paddle compatible api, default is false.Start the Server
python -m fastdeploy.entrypoints.openai.api_server \ --model nv-community/Qwen3-30B-A3B-FP4 \ --port 8180 \ --metrics-port 8181 \ --engine-worker-queue-port 8182 \ --cache-queue-port 8183 \ --tensor-parallel-size 1 \ --max-model-len 32768 \ --max-num-seqs 128Performance Benchmark Command
Reference: https://github.com/PaddlePaddle/FastDeploy/tree/develop/benchmarks
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.