Skip to content

Ground Truth Error in Agent multi_step Trails #23

@Bokai-Ji

Description

@Bokai-Ji

Hi, I noticed some inconsistencies in the ground truth labels and descriptions for agent_multi_step samples.

1. Logical Inconsistency in Message Deletion In these samples, the user explicitly instructs: "If the message capacity is full, delete the most recent message."

However, the current ground truth performs the following:

  • Calls get_earliest_message_id() instead of get_latest_message_id().
  • Deletes message_id=3 instead of message_id=4 (assuming 4 is the latest).

💡Suggested Fix: The ground truth should be updated to call the function that retrieves the latest message ID to align with the user's prompt.

2. Ambiguity in the todo-list Description The current description states:

"First, help Frank place a food delivery order at 'Hema Fresh,' ordering two 'Fresh Gift Packs.'"

The phrase "help Frank" is ambiguous. It is more likely to be interpreted as requiring the agent to log in to Frank's account, whereas the context implies the current user (Eve) is placing the order.

💡Suggested Fix: I suggest removing "help Frank" to avoid confusion regarding account switching. The sentence would be clearer as:

"First, place a food delivery order at 'Hema Fresh,' ordering two 'Fresh Gift Packs.'"

3. Unnatural Dependency leading to "Goal Dropping"

The current environment enforces a constraint where send_message() fails because the Inbox is full. This is counter-intuitive and contradicts the standard logic of email systems (where sending is independent of inbox storage).

The Core Problem: Because this dependency is unnatural, the Agent lacks the prior knowledge to infer that "Cleaning the Inbox" is a prerequisite for "Sending". When the Agent receives the error, it correctly switches focus to the sub-goal (deleting a message). However, once the deletion is successful, the Agent fails to link this back to the original goal (sending). It assumes the task is "Handling the storage issue" and terminates the conversation, resulting in failures.

💡 Suggested Fix: Since the constraint is non-standard, the environment must explicitly bridge this logical gap. The error message should serve as a chain-of-thought prompt to remind the Agent of the suspended original goal.

Change Error Message From:

...You need to ask the user which message to delete.

to

...You need to ask the user which message to delete and resend the message.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions