fix: Exclude computer tool from tool_use modality by dzorlu · Pull Request #398 · meta-pytorch/OpenEnv

dzorlu · 2026-02-19T23:17:38Z

Summary

Fixed bug where tool_use tasks were incorrectly exposed to the computer tool
Added inverse filtering for tool_use modality to exclude computer tool
Updated and added tests to verify correct behavior

Problem

The tool filtering logic in task_env.py only handled computer_use modality (keeping ONLY the computer tool). For tool_use modality, there was no filtering - all tools passed through, including the computer tool if the MCP endpoint exposed it.

This caused training runs to log call_tool(computer) for what should be pure tool-use tasks.

Changes

src/envs/fleet_env/task_env.py: Add filtering to exclude computer tool for tool_use modality
tests/envs/test_fleet_task_env.py: Update test name and assertions, add new test for function format

Test plan

Run pytest tests/envs/test_fleet_task_env.py::TestFleetTaskEnvComputerUseFiltering -v - all 5 tests pass

🤖 Generated with Claude Code

- FleetTaskEnv wraps FleetEnvClient with task-oriented interface - Accepts task configs from export_training_tasks.py - Creates versioned environments on reset - Injects task prompt into observations - Executes verifier for reward computation on episode completion - Supports both sync and async step methods - Factory functions: make_fleet_task_env, from_json_file - Tests: 20 unit tests for init, specs, verifiers, factories 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The MCP images don't exist for all environment versions, causing FleetVersionNotFoundError when trying to create environments. Changing the default to None allows the Fleet SDK to use standard images which are available for all versions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

FleetEnvClient.from_fleet() was not accepting data_key/data_version parameters, causing them to be passed through **kwargs to HTTPEnvClient which doesn't accept them. - Add data_key and data_version as explicit parameters - Pass them to fleet.make() - Update task_env.py to pass them separately 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fleet SDK expects data_key in "key:version" format, not as separate parameters. Updated from_fleet() to combine them before calling fleet.make(). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

HTTPEnvClient.reset() doesn't support seed parameter yet. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Increases default timeout from 15s to 60s for Fleet API calls. This prevents timeouts during environment initialization. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Previously reset() did partial work and reset_async() added tool fetching. Now reset_async() does all the work (including fetching tools) and reset() is just a sync wrapper that calls it via run_until_complete(). This ensures both methods return identical results including tools. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

MCP's call_tool() returns a CallToolResult Pydantic object, not plain text. This was causing ugly repr strings to be passed to agents like: "meta=None content=[TextContent(type='text', text='...')] ..." Now properly extracts: - Text content from result.content[].text - Tries JSON parsing for structured results - Falls back to structuredContent if available - Handles isError cases 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Tests for: - FleetMCPClient._extract_tool_result(): - Single text content extraction - JSON parsing from text - Multiple text contents - Error result handling - Structured content fallback - Empty result handling - FleetTaskEnv reset: - reset_async() returns tools - reset() calls reset_async() (sync wrapper) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Move fleet.make() and list_tools() into FleetTaskEnv.__init__() - Tools are now fetched at env creation, not during reset - reset_async() calls _orch.reset() with error handling, returns cached tools - Use asyncio.run() for Python 3.13 compatibility - Update tests for new initialization pattern 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Log task_key and verifier code preview when verifier fails - Catch syntax errors separately with clear message - Show which functions were found if 'verify' is missing Helps debug issues like "Verifier code must define a 'verify' function" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace custom _execute_verifier_local() with Fleet SDK's Task.verify_detailed() which properly sets up the verifier namespace with: - Environment type annotation - Helper functions (normalized_contains, etc.) - Proper function discovery (not just "verify" function) This fixes "name 'Environment' is not defined" errors during verifier execution. Changes: - _compute_reward: Create Fleet SDK Task and call verify_detailed() - Support both 'verifier_code' and 'verifier_func' field names - Add comprehensive logging for debugging - Remove broken _execute_verifier_local method Tests: - Update all verifier tests to mock Fleet SDK Task.verify_detailed() - Add tests for various edge cases (no verifier, no orch, exceptions) - Fix fixture to avoid asyncio.run() conflicts with pytest-asyncio 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…context

- Add retry with exponential backoff (3 attempts, 1s/2s/4s delays) - Log errors instead of silently swallowing exceptions - Log warning when some clients fail but others succeed - Log error after all retries exhausted This fixes silent failures when MCP connections are flaky, which caused 'no tools found' errors in SkyRL training.

call_tool now retries with exponential backoff (3 attempts, 1s/2s/4s) on connection errors, similar to list_tools. ValueError (tool not found) is not retried. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Adds exponential backoff retry (3 attempts, 2s base delay) around fleet.make() to handle transient Fleet API errors like health check failures that can occur during instance provisioning. Only retries on transient errors (health check, timeout, connection). Permanent errors are raised immediately. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add Toolathlon-style context management tools for long trajectories: - check_context: Check visible/total turn counts - manage_context: Drop old turns to free up context space - search_history: Search all history (including dropped) - search_tool_output: Search truncated tool output - view_tool_output: Paginate through truncated output The ContextManager class can be used by any training framework that maintains chat_history. It tracks full history and handles truncated tool outputs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Computer-use tasks require MCP-enabled container images (e.g., famazon:mcp0.0.7) which have scrot installed for screenshots and the MCP server with 'computer' tool for mouse/keyboard control.

Previously, tools were only fetched for tool_use modality due to a restrictive condition. This caused computer_use tasks to fail with "no tools found in observation" because the computer tool (mouse, keyboard, screenshot) was never fetched. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When task_modality is computer_use, filter tools to only include the 'computer' tool. This prevents the model from using API tools when it should be using mouse/keyboard control. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Two critical fixes for VL (vision-language) model training: 1. ImageContent extraction: _extract_tool_result() now handles MCP ImageContent (base64 images with mimeType) and converts them to OpenAI-compatible format for VL models. 2. Tool filtering: computer_use modality now always filters to only the 'computer' tool. If no computer tool found, clears all tools and logs warning (prevents model from using API tools). Tests added: - test_extract_image_content - test_extract_mixed_text_and_image_content - test_extract_image_default_mimetype - test_computer_use_filters_to_computer_tool - test_computer_use_clears_tools_when_no_computer_tool - test_tool_use_does_not_filter - test_computer_use_filters_function_format 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

For VL (vision-language) models doing computer_use tasks, the model needs visual input to know where to click. Previously, reset() only returned metadata without a screenshot, leaving VL models blind. Now for computer_use modality, reset_async() automatically takes a screenshot after reset and includes it in the observation as `initial_screenshot`. This is in OpenAI-compatible format for VL models. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The manager API (POST /reset) hangs indefinitely on some env images (e.g. google-maps v0.0.53). Since reset failure is already handled gracefully (warning + continue), this adds a short dedicated timeout (default 10s) so the reset fails fast instead of blocking for the full request_timeout_s (60-120s). This saves 50-110s per episode during training when the manager API is unresponsive, while still allowing reset to succeed on healthy envs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Previously, tool filtering only handled computer_use modality (keeping only the computer tool). For tool_use modality, all tools passed through including the computer tool if the MCP endpoint exposed it. This caused tool_use training runs to incorrectly have access to the computer tool (mouse/keyboard control) when they should only have API tools. Changes: - Add inverse filtering for tool_use to exclude the computer tool - Update test to verify computer tool is excluded for tool_use - Add test for function format exclusion 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

greptile-apps · 2026-02-19T23:20:25Z

Greptile Summary

Fixes tool filtering bug where tool_use modality tasks incorrectly received the computer tool alongside API tools. The fix adds inverse filtering logic to explicitly exclude the computer tool for tool_use tasks (lines 222-233 in task_env.py), complementing the existing computer_use filtering that keeps only the computer tool. The implementation correctly handles both tool formats (direct name field and nested function.name field). Tests were updated to verify the exclusion behavior for both formats.

Confidence Score: 5/5

This PR is safe to merge with no issues identified
The fix is focused, well-tested, and addresses a clear bug. The logic mirrors the existing computer_use filtering pattern, handles both tool formats correctly, and includes comprehensive test coverage for the new behavior. No mechanical issues, alignment concerns, or edge cases identified.
No files require special attention

Important Files Changed

Filename	Overview
src/envs/fleet_env/task_env.py	Added inverse filtering logic to exclude `computer` tool for `tool_use` modality, preventing incorrect tool exposure during training
tests/envs/test_fleet_task_env.py	Updated test assertions and added new test case to verify `computer` tool exclusion for both name and function formats

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[reset_async called] --> B{Fetch tools if not fetched}
    B -->|Tools fetched| C{Check modality}
    C -->|tool_use| D[Filter: EXCLUDE 'computer' tool]
    C -->|computer_use| E[Filter: KEEP ONLY 'computer' tool]
    D --> F{Check tool format}
    E --> G{Check tool format}
    F -->|name format| H[Remove if t.name == 'computer']
    F -->|function format| I[Remove if t.function.name == 'computer']
    G -->|name format| J[Keep only if t.name == 'computer']
    G -->|function format| K[Keep only if t.function.name == 'computer']
    H --> L[Return filtered tools cache]
    I --> L
    J --> L
    K --> L
    E -->|No computer tool found| M[Warning: Config error - Clear tools cache]
    M --> L

_{Last reviewed commit: 26f3940}

Deniz and others added 30 commits December 12, 2025 12:14

fleet integartion step 0

38edfc2

updated README

f67bc43

readme update

164853c

another iteraton

935826f

readme

7c09d5b

conb

7efae22

Add __init__.py to envs package for pip install compatibility

791a071

fix: Remove seed parameter from reset() call

7852847

HTTPEnvClient.reset() doesn't support seed parameter yet. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix: fetch tools lazily in reset_async to avoid asyncio.run in async …

ced5eca

…context

debug: add logging for call_tool to trace success/failure paths

a2f3531

fix: unwrap ExceptionGroup to show actual error cause

a08cb6d

debug: add logging for Fleet instance creation timing

9806eb8

Use image_type='mcp' for computer_use tasks

a1ac1a7

Computer-use tasks require MCP-enabled container images (e.g., famazon:mcp0.0.7) which have scrot installed for screenshots and the MCP server with 'computer' tool for mouse/keyboard control.

Deniz and others added 8 commits February 11, 2026 19:19

debug: Log actual screenshot result format from MCP

6e4a522

fix: Handle Fleet MCP base64_image format for VL models

80be63b

Remove debug logging from task_env.py

7a1a755

test: Add tests for base64_image format handling

0c0b535

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Exclude computer tool from tool_use modality#398

fix: Exclude computer tool from tool_use modality#398
dzorlu wants to merge 38 commits intometa-pytorch:mainfrom
fleet-ai:fix/exclude-computer-tool-for-tool-use

dzorlu commented Feb 19, 2026

Uh oh!

greptile-apps bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dzorlu commented Feb 19, 2026

Summary

Problem

Changes

Test plan

Uh oh!

greptile-apps bot commented Feb 19, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant