[Feature] Support NVFP4 MoE #6003

Echo-Nie · 2026-01-12T11:32:52Z

Motivation

This pr supports modelopt format NVFP4 inference (currently, only Qwen/Ernie) by introducing Flashinfer as a backend. It requires GPU with sm>=100 and Flashinfer installtion.

Modifications

With paddle compatible api, this pr introdudaces Flashinfer as a backend. There may be coexistence issues with some third-party including pytorch code (e.g. xgrammar, triton). , Currenly we cannot use them at same time and we are working on resolving this.

Usage or Command

New Environment Variables

FD_FLASHINFER_MOE_BACKEND : FP4 MoE backend, could be flashinfer-cutlass, flashinfer-trtllm or None (default is None, will use flashinfer-cutass). Currently we only support flashinfer-cutlass.
FD_NVFP4_GEMM_BACKEND: FP4 dense GEMM backend, could be flashinfer-cutlass, flashinfer-trtllm, flashinfer-cudnn or None (default is None, will use flashinfer-cutlass). Currently we only support flashinfer-cutlass.
PADDLE_COMPATIBLE_API: This is an environment variable for Flashinfer with Paddle, set it to true to use paddle compatible api, default is false.

Start the Server

python -m fastdeploy.entrypoints.openai.api_server \
    --model nv-community/Qwen3-30B-A3B-FP4 \
    --port 8180 \
    --metrics-port 8181 \
    --engine-worker-queue-port 8182 \
    --cache-queue-port 8183 \
    --tensor-parallel-size 1 \
    --max-model-len  32768 \
    --max-num-seqs 128

Performance Benchmark Command

Reference: https://github.com/PaddlePaddle/FastDeploy/tree/develop/benchmarks

python benchmark_serving.py \
  --backend openai-chat \
  --model nv-community/Qwen3-30B-A3B-FP4 \
  --endpoint /v1/chat/completions \
  --host 0.0.0.0 \
  --port 8180 \
  --dataset-name EBChat \
  --dataset-path ./data/filtered_sharedgpt_2000_input_1136_output_200_fd.json \
  --hyperparameter-path ./yaml/request_yaml/qwen25-vl-32kyaml \
  --percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len \
  --metric-percentiles 80,95,99,99.9,99.95,99.99 \
  --num-prompts 1000 \
  --max-concurrency 64 \
  --save-result

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-01-12T11:32:59Z

Thanks for your contribution!

codecov-commenter · 2026-01-12T13:03:33Z

Codecov Report

❌ Patch coverage is 15.51155% with 256 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@fe5ba4b). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...deploy/model_executor/layers/quantization/nvfp4.py	12.79%	224 Missing and 1 partial ⚠️
fastdeploy/model_executor/layers/moe/moe.py	20.00%	19 Missing and 1 partial ⚠️
fastdeploy/flashinfer.py	63.63%	2 Missing and 2 partials ⚠️
...loy/model_executor/layers/quantization/__init__.py	20.00%	4 Missing ⚠️
fastdeploy/model_executor/utils.py	25.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #6003   +/-   ##
==========================================
  Coverage           ?   67.28%           
==========================================
  Files              ?      355           
  Lines              ?    46098           
  Branches           ?     7111           
==========================================
  Hits               ?    31015           
  Misses             ?    12829           
  Partials           ?     2254

Flag	Coverage Δ
GPU	`67.27% <15.51%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tests/quantization/test_modelopt_nvfp4.py

fastdeploy/model_executor/layers/quantization/nvfp4.py

fastdeploy/flashinfer.py

docs/zh/quantization/nvfp4.md

Echo-Nie · 2026-01-14T14:37:53Z

/re-run all-failed

Echo-Nie · 2026-01-20T10:48:24Z

/re-run base_tests

Echo-Nie · 2026-01-20T11:47:39Z

/re-run all-failed

Removed the logic for generating random padding IDs.

Echo-Nie · 2026-01-20T23:35:15Z

/re-run all-failed

Echo-Nie · 2026-01-22T00:53:06Z

/re-run all-failed

zoooo0820 and others added 20 commits October 24, 2025 10:39

fp4 dense

5250085

[WIP] support nvfp4, dense part

b0c863a

[wip] developing loading qwen model

d5f3fd2

loading

1176cae

update

7137054

dense fp4 OK, cudagraph error

0594090

[WIP] moe forward part

ae80853

with flashinfer-backend

6b2ebd6

qwen3_moe_fp4

0b28b4b

update

2d2bd06

support flashinfer-cutlass moe, qwen3-moe-fp4 OK

c329d92

support ernie4.5-fp4

eb089b3

solve confilict

1931732

fix load error

03aa695

add some ut

5233398

add docs

748e812

Merge branch 'develop' into support_fp4_moe

3d38d73

fix CLA, test

be11fc3

Merge remote-tracking branch 'zoooo/support_fp4_moe' into fp4_moe

e071d51

fix the apply() in ModelOptNvFp4FusedMoE

509fc33

Echo-Nie temporarily deployed to Metax_ci January 12, 2026 11:32 — with GitHub Actions Inactive

fix CodeStyle

798cb6b

Echo-Nie temporarily deployed to Metax_ci January 12, 2026 16:07 — with GitHub Actions Inactive

Merge branch 'develop' into fp4_moe

17d0740

Echo-Nie temporarily deployed to Metax_ci January 13, 2026 03:29 — with GitHub Actions Inactive

zoooo0820 reviewed Jan 13, 2026

View reviewed changes

tests/quantization/test_modelopt_nvfp4.py Show resolved Hide resolved

fastdeploy/model_executor/layers/quantization/nvfp4.py Show resolved Hide resolved

fastdeploy/flashinfer.py Outdated Show resolved Hide resolved

docs/zh/quantization/nvfp4.md Outdated Show resolved Hide resolved

del the PADDLE_COMPATIBLE_API

ca2a699

Echo-Nie had a problem deploying to Metax_ci January 13, 2026 07:29 — with GitHub Actions Error

Echo-Nie temporarily deployed to Metax_ci January 14, 2026 10:56 — with GitHub Actions Inactive

Merge branch 'develop' into fp4_moe

14bbd6b

Echo-Nie dismissed zoooo0820’s stale review via 14bbd6b January 15, 2026 10:06

Echo-Nie temporarily deployed to Metax_ci January 15, 2026 10:06 — with GitHub Actions Inactive

Merge branch 'develop' into fp4_moe

d7426fd

Echo-Nie temporarily deployed to Metax_ci January 15, 2026 13:20 — with GitHub Actions Inactive

fix token_ids

b3e600d

Echo-Nie had a problem deploying to Metax_ci January 19, 2026 11:22 — with GitHub Actions Error

Merge branch 'develop' into fp4_moe

d9f8a74

Echo-Nie temporarily deployed to Metax_ci January 19, 2026 11:23 — with GitHub Actions Inactive

fix CI in Hopper

ee8f622

Echo-Nie temporarily deployed to Metax_ci January 19, 2026 11:58 — with GitHub Actions Inactive

move flashinfer imports inside the function

4057e1e

Echo-Nie temporarily deployed to Metax_ci January 20, 2026 08:28 — with GitHub Actions Inactive

Merge branch 'PaddlePaddle:develop' into fp4_moe

0faf0c2

Echo-Nie temporarily deployed to Metax_ci January 20, 2026 11:54 — with GitHub Actions Inactive

fix model_runner

f9ec344

Removed the logic for generating random padding IDs.

Echo-Nie had a problem deploying to Metax_ci January 20, 2026 16:14 — with GitHub Actions Failure

Echo-Nie had a problem deploying to Metax_ci January 20, 2026 23:35 — with GitHub Actions Failure

Merge branch 'develop' into fp4_moe

ce6d40f

Echo-Nie temporarily deployed to Metax_ci January 21, 2026 15:59 — with GitHub Actions Inactive

Remove skip condition for CUDA version in nvfp4 test

fb71cca

Echo-Nie had a problem deploying to Metax_ci January 22, 2026 09:47 — with GitHub Actions Failure

Merge branch 'develop' into fp4_moe

a2fa9ff

Echo-Nie deployed to Metax_ci January 22, 2026 10:54 — with GitHub Actions Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Support NVFP4 MoE #6003

[Feature] Support NVFP4 MoE #6003

Echo-Nie commented Jan 12, 2026 •

edited

Loading

Uh oh!

paddle-bot bot commented Jan 12, 2026

Uh oh!

codecov-commenter commented Jan 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Echo-Nie commented Jan 14, 2026

Uh oh!

Echo-Nie commented Jan 20, 2026

Uh oh!

Echo-Nie commented Jan 20, 2026

Uh oh!

Echo-Nie commented Jan 20, 2026

Uh oh!

Echo-Nie commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Feature] Support NVFP4 MoE #6003

Are you sure you want to change the base?

[Feature] Support NVFP4 MoE #6003

Conversation

Echo-Nie commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

New Environment Variables

Start the Server

Performance Benchmark Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Jan 12, 2026

Uh oh!

codecov-commenter commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Echo-Nie commented Jan 14, 2026

Uh oh!

Echo-Nie commented Jan 20, 2026

Uh oh!

Echo-Nie commented Jan 20, 2026

Uh oh!

Echo-Nie commented Jan 20, 2026

Uh oh!

Echo-Nie commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Echo-Nie commented Jan 12, 2026 •

edited

Loading

codecov-commenter commented Jan 12, 2026 •

edited

Loading