Skip to content

Conversation

@Echo-Nie
Copy link
Contributor

@Echo-Nie Echo-Nie commented Jan 12, 2026

Motivation

This pr supports modelopt format NVFP4 inference (currently, only Qwen/Ernie) by introducing Flashinfer as a backend. It requires GPU with sm>=100 and Flashinfer installtion.

Modifications

With paddle compatible api, this pr introdudaces Flashinfer as a backend. There may be coexistence issues with some third-party including pytorch code (e.g. xgrammar, triton). , Currenly we cannot use them at same time and we are working on resolving this.

Usage or Command

New Environment Variables

  • FD_FLASHINFER_MOE_BACKEND : FP4 MoE backend, could be flashinfer-cutlass, flashinfer-trtllm or None (default is None, will use flashinfer-cutass). Currently we only support flashinfer-cutlass.

  • FD_NVFP4_GEMM_BACKEND: FP4 dense GEMM backend, could be flashinfer-cutlass, flashinfer-trtllm, flashinfer-cudnn or None (default is None, will use flashinfer-cutlass). Currently we only support flashinfer-cutlass.

  • PADDLE_COMPATIBLE_API: This is an environment variable for Flashinfer with Paddle, set it to true to use paddle compatible api, default is false.

Start the Server

python -m fastdeploy.entrypoints.openai.api_server \
    --model nv-community/Qwen3-30B-A3B-FP4 \
    --port 8180 \
    --metrics-port 8181 \
    --engine-worker-queue-port 8182 \
    --cache-queue-port 8183 \
    --tensor-parallel-size 1 \
    --max-model-len  32768 \
    --max-num-seqs 128

Performance Benchmark Command

Reference: https://github.com/PaddlePaddle/FastDeploy/tree/develop/benchmarks

python benchmark_serving.py \
  --backend openai-chat \
  --model nv-community/Qwen3-30B-A3B-FP4 \
  --endpoint /v1/chat/completions \
  --host 0.0.0.0 \
  --port 8180 \
  --dataset-name EBChat \
  --dataset-path ./data/filtered_sharedgpt_2000_input_1136_output_200_fd.json \
  --hyperparameter-path ./yaml/request_yaml/qwen25-vl-32kyaml \
  --percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len \
  --metric-percentiles 80,95,99,99.9,99.95,99.99 \
  --num-prompts 1000 \
  --max-concurrency 64 \
  --save-result

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Jan 12, 2026

Thanks for your contribution!

@codecov-commenter
Copy link

codecov-commenter commented Jan 12, 2026

Codecov Report

❌ Patch coverage is 15.51155% with 256 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@fe5ba4b). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...deploy/model_executor/layers/quantization/nvfp4.py 12.79% 224 Missing and 1 partial ⚠️
fastdeploy/model_executor/layers/moe/moe.py 20.00% 19 Missing and 1 partial ⚠️
fastdeploy/flashinfer.py 63.63% 2 Missing and 2 partials ⚠️
...loy/model_executor/layers/quantization/__init__.py 20.00% 4 Missing ⚠️
fastdeploy/model_executor/utils.py 25.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #6003   +/-   ##
==========================================
  Coverage           ?   67.28%           
==========================================
  Files              ?      355           
  Lines              ?    46098           
  Branches           ?     7111           
==========================================
  Hits               ?    31015           
  Misses             ?    12829           
  Partials           ?     2254           
Flag Coverage Δ
GPU 67.27% <15.51%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Echo-Nie
Copy link
Contributor Author

/re-run all-failed

@Echo-Nie
Copy link
Contributor Author

/re-run base_tests

@Echo-Nie
Copy link
Contributor Author

/re-run all-failed

Removed the logic for generating random padding IDs.
@Echo-Nie
Copy link
Contributor Author

/re-run all-failed

@Echo-Nie
Copy link
Contributor Author

/re-run all-failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants