Skip to content

Fix GatherND division by zero when batch dimensions mismatch#27090

Open
chaya2350 wants to merge 16 commits intomicrosoft:mainfrom
chaya2350:fix-gathernd-division-by-zero
Open

Fix GatherND division by zero when batch dimensions mismatch#27090
chaya2350 wants to merge 16 commits intomicrosoft:mainfrom
chaya2350:fix-gathernd-division-by-zero

Conversation

@chaya2350
Copy link

@chaya2350 chaya2350 commented Jan 21, 2026

Fixes #23828

Added validation to check:

  • num_batches is not zero
  • num_slices is divisible by num_batches

Before this fix, mismatched batch dimensions caused a crash due to division by zero.

Description

This PR fixes a division by zero crash in the GatherND operator when batch dimensions mismatch between input and indices tensors.

Changes made:

Added validation in gather_nd.cc to check that num_batches is not zero before division
Added validation that num_slices is divisible by num_batches
Added a unit test to verify the fix

Motivation and Context

Description
Fixes #23828

When batch_dims is set but the actual batch dimensions of the input tensor and indices tensor don't align correctly, the code performs a division that can result in division by zero, causing a crash.

For example, with:

Input shape: [2, 2, 2]
Indices shape: [2, 1]
batch_dims=1
The calculation num_slices / num_batches would crash if num_batches is 0, or produce unexpected results if they don't divide evenly.

This fix returns a clear error message instead of crashing.

Fixes microsoft#23828

Added validation to check:
- num_batches is not zero
- num_slices is divisible by num_batches

Before this fix, mismatched batch dimensions caused a crash due to
division by zero.
@chaya2350
Copy link
Author

@fdwr @justinchuby Could you please review this PR? It fixes the issue you reported. Thanks! 🙂

@justinchuby
Copy link
Contributor

@tianleiwu

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a division by zero crash in the GatherND operator when batch dimensions between input and indices tensors are mismatched. The fix adds validation to ensure num_batches is non-zero and that num_slices is evenly divisible by num_batches before performing the division.

Changes:

  • Added validation checks in gather_nd.cc to prevent division by zero
  • Added a unit test to verify the fix handles batch dimension mismatches correctly

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
onnxruntime/core/providers/cpu/tensor/gather_nd.cc Added validation to check num_batches is not zero and num_slices is divisible by num_batches before division
onnxruntime/test/providers/cpu/tensor/gather_nd_op_test.cc Added test case to verify batch dimension mismatch returns an error instead of crashing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

justinchuby
justinchuby previously approved these changes Jan 21, 2026
@justinchuby justinchuby requested review from fdwr and tianleiwu January 21, 2026 17:08
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@chaya2350
Copy link
Author

I've addressed the Copilot comment. Ready for re-review! 🙂

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

chaya2350 and others added 2 commits January 22, 2026 09:50
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address Copilot review: added test case to cover the num_batches == 0 validation path.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

justinchuby
justinchuby previously approved these changes Jan 22, 2026
@justinchuby justinchuby enabled auto-merge (squash) January 22, 2026 16:09
fdwr
fdwr previously approved these changes Jan 23, 2026
Copy link
Contributor

@fdwr fdwr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks :).

auto-merge was automatically disabled January 26, 2026 11:18

Head branch was pushed to by a user without write access

auto-merge was automatically disabled January 28, 2026 09:56

Head branch was pushed to by a user without write access

@justinchuby justinchuby enabled auto-merge (squash) January 28, 2026 17:47
justinchuby
justinchuby previously approved these changes Jan 28, 2026
auto-merge was automatically disabled January 29, 2026 10:20

Head branch was pushed to by a user without write access

@chaya2350 chaya2350 closed this Feb 5, 2026
@chaya2350 chaya2350 deleted the fix-gathernd-division-by-zero branch February 5, 2026 14:44
@justinchuby
Copy link
Contributor

Why was this closed?

@fdwr
Copy link
Contributor

fdwr commented Feb 6, 2026

@justinchuby - Weird, the branch was deleted too.

@chaya2350 chaya2350 restored the fix-gathernd-division-by-zero branch February 8, 2026 07:23
@chaya2350
Copy link
Author

chaya2350 commented Feb 8, 2026

This was a mistake...
I got this one confused with another PR
Apologies!

@chaya2350 chaya2350 reopened this Feb 8, 2026
@justinchuby
Copy link
Contributor

@chaya2350 could you run lintrunner f to format the code? Thanks

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tianleiwu
Copy link
Contributor

@chaya2350, please format the code with lintrunner -a. The "Lint / Python format" check is required to pass for pull request.
See https://github.com/microsoft/onnxruntime/blob/main/docs/Coding_Conventions_and_Standards.md#linting

LGTM. The changes are safe and improve the robustness of the GatherND operator on CPU.

@justinchuby
Copy link
Contributor

Looks like there are some errors with CUDA. @tianleiwu do you have suggestions?

@chaya2350
Copy link
Author

Hi,
The failing test is intended to run only on the CPU execution provider.
As defined in the test code, I explicitly excluded all other EPs (including CUDA) using the exclude_providers argument in OpTester::Run.
However, according to the test log, the test is still being executed with CUDAExecutionProvider:

registered execution providers: CUDAExecutionProvider
This suggests that the exclusion is not being respected in the CI environment, or that the test runner is forcing all tests to run on all available EPs.
Could you please advise how to ensure this test is executed only on CPU as intended?
If there is a CI configuration or test runner setting that needs to be updated, I would appreciate your guidance.
Thank you!

@justinchuby
Copy link
Contributor

Based on the team's suggestions, could you take a look at

and ensure the error strings match with the cpu provider?

@tianleiwu
Copy link
Contributor

tianleiwu commented Feb 14, 2026

@chaya2350, test can explicitly run on CPU provider only like

    std::vector<std::unique_ptr<IExecutionProvider>> execution_providers;
    execution_providers.push_back(DefaultCpuExecutionProvider());
    test.Run(OpTester::ExpectResult::kExpectFailure,
           "GatherND: input tensor batch dimensions cannot be zero", {}, nullptr, &execution_providers);

see

void Run(ExpectResult expect_result = ExpectResult::kExpectSuccess, const std::string& expected_failure_string = "",
const std::unordered_set<std::string>& excluded_provider_types = {},
const RunOptions* run_options = nullptr,
std::vector<std::unique_ptr<IExecutionProvider>>* execution_providers = nullptr,
ExecutionMode execution_mode = ExecutionMode::ORT_SEQUENTIAL,
const Graph::ResolveOptions& resolve_options = {});

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CPU EP] GatherND crashes with division by zero when batch dimensions mismatch between input and indices

4 participants