-
Notifications
You must be signed in to change notification settings - Fork 690
[Feature] Support NVFP4 MoE #6003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Echo-Nie
wants to merge
39
commits into
PaddlePaddle:develop
Choose a base branch
from
Echo-Nie:fp4_moe
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+961
−4
Open
Changes from all commits
Commits
Show all changes
39 commits
Select commit
Hold shift + click to select a range
5250085
fp4 dense
zoooo0820 b0c863a
[WIP] support nvfp4, dense part
zoooo0820 d5f3fd2
[wip] developing loading qwen model
zoooo0820 1176cae
loading
bukejiyu 7137054
update
bukejiyu 0594090
dense fp4 OK, cudagraph error
zoooo0820 ae80853
[WIP] moe forward part
zoooo0820 6b2ebd6
with flashinfer-backend
zoooo0820 0b28b4b
qwen3_moe_fp4
bukejiyu 2d2bd06
update
bukejiyu c329d92
support flashinfer-cutlass moe, qwen3-moe-fp4 OK
zoooo0820 eb089b3
support ernie4.5-fp4
zoooo0820 1931732
solve confilict
zoooo0820 03aa695
fix load error
zoooo0820 5233398
add some ut
zoooo0820 748e812
add docs
zoooo0820 3d38d73
Merge branch 'develop' into support_fp4_moe
zoooo0820 be11fc3
fix CLA, test
Echo-Nie e071d51
Merge remote-tracking branch 'zoooo/support_fp4_moe' into fp4_moe
Echo-Nie 509fc33
fix the apply() in ModelOptNvFp4FusedMoE
Echo-Nie 798cb6b
fix CodeStyle
Echo-Nie 17d0740
Merge branch 'develop' into fp4_moe
Echo-Nie ca2a699
del the PADDLE_COMPATIBLE_API
Echo-Nie 359b6b6
Merge branch 'develop' into fp4_moe
Echo-Nie 14fc296
fix broken url: nvidia_gpu.md
Echo-Nie a25fea0
fix docs
Echo-Nie d93cdb5
Merge branch 'develop' into fp4_moe
Echo-Nie 88c8347
Merge branch 'develop' into fp4_moe
Echo-Nie 14bbd6b
Merge branch 'develop' into fp4_moe
Echo-Nie d7426fd
Merge branch 'develop' into fp4_moe
Echo-Nie b3e600d
fix token_ids
Echo-Nie d9f8a74
Merge branch 'develop' into fp4_moe
Echo-Nie ee8f622
fix CI in Hopper
Echo-Nie 4057e1e
move flashinfer imports inside the function
Echo-Nie 0faf0c2
Merge branch 'PaddlePaddle:develop' into fp4_moe
Echo-Nie f9ec344
fix model_runner
Echo-Nie ce6d40f
Merge branch 'develop' into fp4_moe
Echo-Nie fb71cca
Remove skip condition for CUDA version in nvfp4 test
Echo-Nie a2fa9ff
Merge branch 'develop' into fp4_moe
Echo-Nie File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,66 @@ | ||
|
|
||
| # NVFP4 Quantization | ||
| NVFP4 is an innovative 4-bit floating-point format introduced by NVIDIA. For detailed information, please refer to [Introducing NVFP4 for Efficient and Accurate Low-Precision Inference](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/). | ||
|
|
||
| Based on [FlashInfer](https://github.com/flashinfer-ai/flashinfer), Fastdeploy supports NVFP4 quantized model inference in the format produced by [Modelopt](https://github.com/NVIDIA/TensorRT-Model-Optimizer). | ||
|
|
||
| - Note: Currently, this feature only supports FP4 quantized models of Ernie/Qwen series. | ||
|
|
||
| ## How to Use | ||
| ### Environment Setup | ||
| #### Supported Environment | ||
| - **Supported Hardware**: GPU sm >= 100 | ||
| - **PaddlePaddle Version**: 3.3.0 or higher | ||
| - **Fastdeploy Version**: 2.5.0 or higher | ||
|
|
||
| #### FastDeploy Installation | ||
| Please ensure that FastDeploy is installed with NVIDIA GPU support. | ||
| Follow the official guide to set up the base environment: [Fastdeploy NVIDIA GPU Environment Installation Guide](https://paddlepaddle.github.io/FastDeploy/get_started/installation/nvidia_gpu/). | ||
|
|
||
| ### Running Inference Service | ||
| ```bash | ||
| python -m fastdeploy.entrypoints.openai.api_server \ | ||
| --model nv-community/Qwen3-30B-A3B-FP4 \ | ||
| --port 8180 \ | ||
| --metrics-port 8181 \ | ||
| --engine-worker-queue-port 8182 \ | ||
| --cache-queue-port 8183 \ | ||
| --tensor-parallel-size 1 \ | ||
| --max-model-len 32768 \ | ||
| --max-num-seqs 128 | ||
| ``` | ||
|
|
||
| ### API Access | ||
| Make service requests using the following command | ||
|
|
||
| ```shell | ||
| curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{ | ||
| "messages": [ | ||
| {"role": "user", "content": "把李白的静夜思改写为现代诗"} | ||
| ] | ||
| }' | ||
| ``` | ||
|
|
||
| FastDeploy service interface is compatible with OpenAI protocol. You can make service requests using the following Python code. | ||
|
|
||
| ```python | ||
| import openai | ||
| host = "0.0.0.0" | ||
| port = "8180" | ||
| client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null") | ||
|
|
||
| response = client.chat.completions.create( | ||
| model="null", | ||
| messages=[ | ||
| {"role": "system", "content": "I'm a helpful AI assistant."}, | ||
| {"role": "user", "content": "把李白的静夜思改写为现代诗"}, | ||
| ], | ||
| stream=True, | ||
| ) | ||
| for chunk in response: | ||
| if chunk.choices[0].delta: | ||
| print(chunk.choices[0].delta.content, end='') | ||
| print('\n') | ||
| ```. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,66 @@ | ||
| [English](../../quantization/nvfp4.md) | ||
|
|
||
| # NVFP4量化 | ||
| NVFP4 是 NVIDIA 引入的创新 4 位浮点格式,详细介绍请参考[Introducing NVFP4 for Efficient and Accurate Low-Precision Inference](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/)。 | ||
|
|
||
| 基于[FlashInfer](https://github.com/flashinfer-ai/flashinfer), Fastdeploy 支持[Modelopt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) 产出格式的NVFP4量化模型推理。 | ||
|
|
||
| - 注:目前该功能仅支持Ernie / Qwen系列的FP4量化模型。 | ||
|
|
||
| ## 如何使用 | ||
| ### 环境准备 | ||
| #### 支持环境 | ||
| - **支持硬件**:GPU sm >= 100 | ||
| - **PaddlePaddle 版本**:3.3.0 或更高版本 | ||
| - **Fastdeploy 版本**:2.5.0 或更高版本 | ||
|
|
||
| #### Fastdeploy 安装 | ||
| FastDeploy 需以 NVIDIA GPU 模式安装,具体安装方式请参考官方文档:[Fastdeploy NVIDIA GPU 环境安装指南](https://paddlepaddle.github.io/FastDeploy/zh/get_started/installation/nvidia_gpu/)。 | ||
|
|
||
| ### 运行推理服务 | ||
| ```bash | ||
| python -m fastdeploy.entrypoints.openai.api_server \ | ||
| --model nv-community/Qwen3-30B-A3B-FP4 \ | ||
| --port 8180 \ | ||
| --metrics-port 8181 \ | ||
| --engine-worker-queue-port 8182 \ | ||
| --cache-queue-port 8183 \ | ||
| --tensor-parallel-size 1 \ | ||
| --max-model-len 32768 \ | ||
| --max-num-seqs 128 | ||
| ``` | ||
|
|
||
| ### 接口访问 | ||
| 通过如下命令发起服务请求 | ||
|
|
||
| ```shell | ||
| curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{ | ||
| "messages": [ | ||
| {"role": "user", "content": "把李白的静夜思改写为现代诗"} | ||
| ] | ||
| }' | ||
| ``` | ||
|
|
||
| FastDeploy服务接口兼容OpenAI协议,可以通过如下Python代码发起服务请求。 | ||
|
|
||
| ```python | ||
| import openai | ||
| host = "0.0.0.0" | ||
| port = "8180" | ||
| client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null") | ||
|
|
||
| response = client.chat.completions.create( | ||
| model="null", | ||
| messages=[ | ||
| {"role": "system", "content": "I'm a helpful AI assistant."}, | ||
| {"role": "user", "content": "把李白的静夜思改写为现代诗"}, | ||
| ], | ||
| stream=True, | ||
| ) | ||
| for chunk in response: | ||
| if chunk.choices[0].delta: | ||
| print(chunk.choices[0].delta.content, end='') | ||
| print('\n') | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| """ | ||
| # Copyright (c) 2026 PaddlePaddle Authors. All Rights Reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| """ | ||
|
|
||
| import functools | ||
| import importlib | ||
| import importlib.util | ||
| import shutil | ||
|
|
||
|
|
||
| @functools.cache | ||
| def has_flashinfer() -> bool: | ||
| """Return `True` if FlashInfer is available.""" | ||
| # Use find_spec to check if the module exists without importing it | ||
| # This avoids potential CUDA initialization side effects | ||
| if importlib.util.find_spec("flashinfer") is None: | ||
| return False | ||
| # Also check if nvcc is available since it's required to JIT compile flashinfer | ||
| if shutil.which("nvcc") is None: | ||
| return False | ||
| return True | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.