Enhancing Multimodal GUI AI-Agent Benchmarking

This repository contains the source code and experimental setup for "Enhancing Multimodal GUI AI-Agent Benchmarking," a modular, code-centric framework designed to improve upon the foundation of the Spider2-V benchmark.

While powerful benchmarks like Spider2-V provide realistic tasks for evaluating AI agents, their often monolithic designs can hinder research due to poor developer experience (Dev-X), limited observability, and a lack of reproducibility. This framework, built upon the lightweight and code-centric Smolagents library, addresses these challenges directly.

🖼️ Architectural Overview

The proposed architecture decouples the agent's logic from the task environment. It uses a stack of modern, open-source tools to create isolated, observable, and reproducible sandboxes for the agent to operate in. This design focuses on modularity and standard interfaces, allowing components to be swapped and extended with minimal friction.

🔄 Agent Workflow

At the core of the framework is a dynamic, iterative workflow loop that enables the agent to reason, act, and learn from feedback. This Observe -> Plan -> Act cycle is central to its problem-solving strategy:

1. Observation: The agent begins by assessing its environment through a multi-modal observation. This includes both a screenshot of the GUI and the textual output (or error trace) from the last executed code.
2. Reasoning and Planning: Based on this combined visual and textual data, the agent analyzes its progress, diagnoses any errors, and formulates a step-by-step plan to move closer to the final goal.
3. Action and Tool Usage: The plan culminates in a concrete action. The agent decides whether to use direct Python code for efficiency or pyautogui for necessary GUI interaction. It then generates the code for the chosen tool.
4. Execution: The action is executed in the sandboxed environment, which produces a new observation and restarts the loop.

This iterative process allows the agent to tackle complex, multi-step tasks and provides a mechanism for autonomous error recovery.

✨ Key Features

This framework introduces several key improvements over the baseline implementation:

Powered by Smolagents: Built on the simple, code-first Smolagents framework, enabling transparent agent logic and easy integration with the broader open-source AI ecosystem (e.g., Hugging Face).
Modular & Decoupled Architecture: Built on Docker, QEMU, and SSH, the framework isolates the agent, its tools, and the sandbox environment. Components are hot-swappable, allowing for rapid iteration.
Flexible Action Space: The agent isn't locked into rigid primitives. It can dynamically choose between direct Python code execution (via a Jupyter Kernel Gateway) for efficiency and GUI automation (pyautogui) for UI-specific tasks.
Rich Observability: Gone are the days of manual debugging in VM snapshots. This framework provides:
- Live GUI Access: Monitor the agent in real-time through any web browser using noVNC.
- Structured Logging: Detailed, step-by-step logs of every action, observation, and thought process.
- Telemetry: Compatibility with OpenTelemetry for exporting detailed performance metrics.
Improved Developer Experience (Dev-X): By leveraging open standards and reducing the core Python codebase by 66%, the framework is significantly easier to understand, maintain, and extend.

🚀 Getting Started

Follow these steps to set up the environment and run the benchmark.

🛠️ Prerequisites

Ensure you have the following dependencies installed on your system:

Python (3.12+): The core programming language.
UV: A fast Python package installer and resolver, used here to manage dependencies.
Direnv: A tool to manage project-specific environment variables automatically.
Docker: To build and manage the containerized sandbox environments.

⚙️ Installation

Clone the repository:

git clone https://github.com/melchiorhering/Automating-DS-DE-Workflows.git
cd Automating-DS-DE-Workflows

Prepare the Sandbox OS Image: This framework requires a configured QEMU virtual machine image to serve as the agent's sandbox. You have two options to set this up:
- Option A (Recommended): Download a pre-configured image directly from the Hugging Face Hub. Place the downloaded ubuntu-base directory inside the src/docker/vms/ folder.
  - Use Git submodules to sync the OS image files in the this repository
  - Use the python script in src/docker to download the files.
- Option B (Advanced): Build the OS image from scratch. The necessary scripts are provided for Linux-based systems in the src/docker/ directory. Please follow the instructions in the README.md file within that directory.
Allow Direnv to manage the environment:
```
direnv allow
```
This will automatically create a virtual environment for you based on the .envrc.example file.
Install Python dependencies with UV:
```
uv sync --all-extras
```
This command reads the pyproject.toml file and syncs the project environment for you.
Setup Pre-Commit
```
uv run pre-commit install
```
This will setup pre-commit, pre-commit will make sure that the code stays a bit clean..

📈 Usage

Running the Benchmark Orchestrator

The main entry point for running experiments is the orchestrator script. You can run all tasks sequentially or a specific subset.

cd src

# Example: Run all tasks defined in the index file
uv run cli.py \
    --task-index /path/to/your/task_list.json \
    --tasks-root /path/to/your/task_definitions/ \
    --results-root results/

Results, logs, and other artifacts for each task run will be saved to the results/ directory.

Analyzing Results

The repository includes Python scripts and notebooks to process the generated summary.json files, calculate aggregate metrics, and create plots to visualize the results.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

📄 Master Thesis Paper

This repository is the official implementation for the paper "Enhancing Multimodal GUI AI-Agent Benchmarking." For a detailed explanation of the methodology, experimental setup, and results, please refer to the paper.

➡️ Read the full paper (PDF)

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
.github/workflows		.github/workflows
.vscode		.vscode
docs		docs
media		media
overleaf @ d9f63e5		overleaf @ d9f63e5
src		src
tests		tests
.envrc.example		.envrc.example
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
paper.pdf		paper.pdf
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enhancing Multimodal GUI AI-Agent Benchmarking

🖼️ Architectural Overview

🔄 Agent Workflow

✨ Key Features

🚀 Getting Started

🛠️ Prerequisites

⚙️ Installation

📈 Usage

Running the Benchmark Orchestrator

Analyzing Results

📄 License

📄 Master Thesis Paper

About

Uh oh!

Releases 2

Uh oh!

Contributors 3

Uh oh!

Languages

License

melchiorhering/GUI-OS-AI-Agent-Benchmarking

Folders and files

Latest commit

History

Repository files navigation

Enhancing Multimodal GUI AI-Agent Benchmarking

🖼️ Architectural Overview

🔄 Agent Workflow

✨ Key Features

🚀 Getting Started

🛠️ Prerequisites

⚙️ Installation

📈 Usage

Running the Benchmark Orchestrator

Analyzing Results

📄 License

📄 Master Thesis Paper

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Uh oh!

Contributors 3

Uh oh!

Languages