Skip to content

Python: Fix IndexError when reasoning models produce reasoning-only messages in Magentic-One workflow#4413

Merged
moonbox3 merged 4 commits intomicrosoft:mainfrom
moonbox3:agent/fix-4384-1
Mar 3, 2026
Merged

Python: Fix IndexError when reasoning models produce reasoning-only messages in Magentic-One workflow#4413
moonbox3 merged 4 commits intomicrosoft:mainfrom
moonbox3:agent/fix-4384-1

Conversation

@moonbox3
Copy link
Contributor

@moonbox3 moonbox3 commented Mar 3, 2026

Motivation and Context

When reasoning models (e.g. gpt-5-mini) are used inside Magentic-One workflows, they can produce messages containing only text_reasoning content without any preceding text content. This caused an IndexError because the code assumed all_messages was non-empty when attaching reasoning_details.

Fixes #4384

Description

The root cause was in _prepare_message_for_openai in _chat_client.py: the text_reasoning case unconditionally accessed all_messages[-1] to attach reasoning details, but when reasoning content appeared before any text content, the list was empty. The fix initializes the message entry in all_messages if it hasn't been created yet before attaching reasoning details. Two regression tests are added covering reasoning-only messages and reasoning-before-text ordering.

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

Note: PR autogenerated by moonbox3's agent

…t#4384)

In _prepare_message_for_openai(), the text_reasoning case unconditionally
accessed all_messages[-1] to attach reasoning_details. When a reasoning
model (e.g. gpt-5-mini) returns reasoning_details without text content,
all_messages is empty, causing an IndexError.

Guard the access by initializing all_messages with the current args dict
when it is empty, so reasoning_details can be safely attached.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 3, 2026 08:47
Copy link
Contributor Author

@moonbox3 moonbox3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 3 | Confidence: 64%

✓ Correctness

This fix addresses an IndexError when text_reasoning content appears before any other content (so all_messages is still empty). The fix appends args to all_messages when the list is empty, preventing the crash on all_messages[-1]. The tests adequately cover the regression cases. However, there is a potential correctness concern: if text_reasoning appears before text content, args is appended to all_messages by the new guard, and if later content-processing logic also appends args to all_messages (at end-of-loop or in another case branch), the same dict reference would appear twice in the output list. The test for the mixed case (test_prepare_message_with_text_reasoning_before_text) uses a very loose assertion (len >= 1) which could be masking this double-append issue.

✓ Security Reliability

This diff fixes an IndexError when all_messages is empty and a text_reasoning content is processed. The fix appends args to all_messages when the list is empty, preventing the crash. There are no security issues (no new injection surfaces, no secrets, no unsafe deserialization beyond pre-existing json.loads). The main reliability concern is that args could potentially be appended twice—once in the text_reasoning guard and once in a subsequent content handler—leading to a duplicate entry in the output. The test for mixed ordering uses a loose len(prepared) >= 1 assertion that could mask such duplication bugs.

✓ Test Coverage

The two new tests directly cover the regression fix (empty all_messages when text_reasoning is the first content item). The first test (test_prepare_message_with_only_text_reasoning_content) is well-structured with strong assertions. The second test (test_prepare_message_with_text_reasoning_before_text) uses weak assertions (len >= 1, any(...)) that would pass even if the text content were silently dropped, reducing its value as a regression guard.

Suggestions

  • In test_prepare_message_with_text_reasoning_before_text, the assertion assert len(prepared) >= 1 is unusually loose. Consider asserting the exact expected length (e.g., == 1 or == 2) to verify the fix doesn't produce duplicate messages when reasoning precedes text content. A loose assertion could mask a double-append bug where args is added to all_messages once by the new guard and again by the normal text-content path.
  • Verify that the rest of _prepare_message_for_openai does not also append args to all_messages after iterating content items (e.g., at the end of the content loop). If it does, the if not all_messages: all_messages.append(args) guard would cause args to appear twice in the result when reasoning content precedes text content.
  • Consider whether args can be appended to all_messages a second time by a subsequent content case handler (e.g., the default _ case for text content), which would produce a duplicate message entry. If so, a guard like if args not in all_messages or tracking an appended flag would be more robust.
  • The test test_prepare_message_with_text_reasoning_before_text uses assert len(prepared) >= 1, which is too loose to catch duplicate-entry bugs. Consider asserting the exact expected length and verifying each element's structure.
  • The second test (test_prepare_message_with_text_reasoning_before_text) should assert the exact expected length and verify the text content ('The answer is 42.') is present in the prepared output, rather than using loose checks like len(prepared) >= 1 and any(...).
  • Consider adding a test for multiple text_reasoning contents without any text content, to verify the fix handles repeated reasoning-only entries correctly.

Automated review by moonbox3's agents

@markwallace-microsoft
Copy link
Member

markwallace-microsoft commented Mar 3, 2026

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
packages/core/agent_framework/openai
   _chat_client.py2922691%210, 240–241, 245, 363, 370, 446–453, 455–458, 468, 546, 548, 565, 603, 629, 645, 685
TOTAL22264275887% 

Python Unit Test Overview

Tests Skipped Failures Errors Time
4728 247 💤 0 ❌ 0 🔥 1m 21s ⏱️

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an IndexError in the Python OpenAI chat client message preparation path when reasoning-capable models emit text_reasoning content without preceding text in Magentic-One workflows.

Changes:

  • Guard _prepare_message_for_openai against empty all_messages when attaching reasoning_details.
  • Add regression tests for reasoning-only messages and reasoning-before-text ordering.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
python/packages/core/agent_framework/openai/_chat_client.py Prevents IndexError when text_reasoning appears before any message content.
python/packages/core/tests/openai/test_openai_chat_client.py Adds regression coverage for the previously-crashing scenarios.

…icrosoft#4384)

- Buffer pending reasoning details and attach to the next message with
  content/tool_calls, avoiding standalone reasoning-only messages.
- When reasoning is the only content, emit a message with empty content
  to satisfy Chat Completions schema requirements.
- Strengthen test assertions to verify text+reasoning co-location and
  that all messages with reasoning_details also have content or tool_calls.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Contributor Author

@moonbox3 moonbox3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 3 | Confidence: 89%

✓ Correctness

The change correctly fixes a bug where text_reasoning content appearing before any other content in a message would create an invalid Chat Completions payload (a message appended to all_messages without content or tool_calls). The new buffering approach via pending_reasoning properly defers the reasoning details until a real content or tool_calls message is ready, and the fallback for reasoning-only messages emits a valid payload with empty content. The updated tests are more precise and verify the structural correctness of the output. No correctness issues found.

✓ Security Reliability

The diff fixes a reliability bug where reasoning-only content could produce an invalid Chat Completions payload (missing the 'content' field). The changes buffer pending reasoning and attach it to the next substantive message, with a fallback that emits an empty-content message. No new security concerns: json.loads is called on internal framework-owned protected_data (not a user-controlled trust boundary), no secrets are introduced, and no resource leaks are present. The only minor reliability observation is that consecutive text_reasoning entries would silently overwrite the buffer, but this mirrors the prior behavior and is unlikely in practice.

✓ Test Coverage

The updated tests meaningfully strengthen assertions for the reasoning-before-text and reasoning-only scenarios, replacing weak 'any message has reasoning' checks with precise structural validation. However, the diff introduces a code path where pending_reasoning attaches to a message with tool_calls (not just content), and this path has no test coverage. Additionally, there is no test for the edge case where multiple consecutive text_reasoning contents appear (the second silently overwrites the first via pending_reasoning reassignment), nor for the author_name/role handling in the trailing fallback block.

Suggestions

  • If multiple text_reasoning contents appear before the first content/tool_calls item, pending_reasoning is silently overwritten, losing earlier reasoning blocks. Consider accumulating into a list if that scenario is possible.
  • If multiple consecutive text_reasoning contents are possible, consider accumulating pending_reasoning into a list rather than overwriting, to avoid silent data loss. The old code had the same limitation, so this is pre-existing.
  • Add a test for text_reasoning appearing before a tool-call content (e.g., FunctionCallContent). The production code attaches pending_reasoning when 'tool_calls' is in args, but no test exercises this branch.
  • Add a test for multiple consecutive text_reasoning contents before any text/tool content. Currently pending_reasoning is silently overwritten; a test would document whether this is intentional or a data-loss bug.
  • Add a test verifying the fallback message includes 'name' when message.author_name is set and role != 'tool', and omits it when role == 'tool'. The new fallback block at lines 597-600 has this logic but no dedicated coverage.

Automated review by moonbox3's agents

moonbox3 and others added 2 commits March 3, 2026 18:47
…ft#4384)

- Always buffer reasoning into pending_reasoning instead of conditionally
  attaching to the previous message via fragile all_messages emptiness check
- Attach buffered reasoning to last message at end-of-loop when no subsequent
  content consumed it
- Assert exact content values (content == '' not in ('', None))
- Assert exact list lengths (== 1 not >= 1) for stronger regression guards
- Add test for reasoning before FunctionCallContent

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@moonbox3 moonbox3 enabled auto-merge March 3, 2026 09:55
@moonbox3 moonbox3 self-assigned this Mar 3, 2026
@moonbox3 moonbox3 added this pull request to the merge queue Mar 3, 2026
Merged via the queue into microsoft:main with commit dae3caa Mar 3, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python: [Bug]: Reasoning models failed inside Magentic-One workflow

5 participants