GUI Agent

A desktop automation agent powered by multimodal vision-language models. Automates computer tasks through visual understanding and precise control.

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Configure API Key

cp .env.example .env

Edit .env:

DASHSCOPE_API_KEY=sk-your-api-key-here    # Alibaba Cloud DashScope
BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
MODEL_NAME=qwen3-vl-plus

Supported Models: GUI Agent works with any vision-language model that follows the OpenAI API format.

Provider	Models
Alibaba Cloud (DashScope)	`qwen3-vl-plus` (recommended), `qwen3-vl-flash`, `qwen-vl-max`
OpenAI	`gpt-4o`, `gpt-4o-mini`
Google	`gemini-1.5-pro`, `gemini-1.5-flash`
Others	Any model compatible with OpenAI API format

Get API Keys:

Alibaba Cloud: Bailian Console
OpenAI: Platform Console
Google: AI Studio

3. Run

python gui_agent.py

Then enter instructions like:

"Open Notepad"
"Search for Python tutorials on Google"
"Minimize all windows"

Demo

Basic Task: Open Calculator

Multi-step Task: Web Search

Advanced Task: Form Filling

Usage Examples

"Open Windows Calculator"
"Search for Python tutorials on Google"
"Type 'Hello, World!' in Notepad"
"Open browser, visit GitHub, and log in"

Supported Actions

Action	Description	Example
`CLICK`	Click at coordinates (0.0-1.0)	`CLICK(0.5, 0.5)`
`TYPE`	Type text via clipboard paste	`TYPE("Hello")`
`SCROLL`	Scroll mouse wheel	`SCROLL(-100)`
`HOTKEY`	Keyboard shortcuts	`HOTKEY("ctrl+c")`
`FOCUS`	Zoom-in for precision	`FOCUS(0.5, 0.5)`
`DONE`	Task completed	`DONE`

Configuration

Variable	Description	Default
`DASHSCOPE_API_KEY`	API Key (DashScope/OpenAI/etc.)	-
`BASE_URL`	API endpoint	`https://dashscope.aliyuncs.com/compatible-mode/v1`
`MODEL_NAME`	Model to use	`qwen3-vl-plus`
`MAX_ITERATIONS`	Max iterations per task	`50`

Recommended Models: qwen3-vl-plus (best for GUI tasks), gpt-4o (general purpose), qwen3-vl-flash (fast)

Architecture

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│ Perception  │───▶│   Brain     │───▶│   Action    │
│ ScreenCap   │    │  LLM Client │    │ Executor    │
└─────────────┘    └─────────────┘    └─────────────┘

Files:

gui_agent.py - Main agent with ReAct loop
main.py - Multi-step planner orchestrator
planner.py - Task decomposition

Safety

PyAutoGUI FailSafe: Move mouse to screen corner to emergency stop.

Always monitor agent operations

Test in safe environment first

Ready to press Ctrl+C to interrupt

Troubleshooting

API connection fails: Check API key and network connection.

Inaccurate actions: Use more detailed task descriptions.

Slow response: Try a faster model like qwen3-vl-flash or gpt-4o-mini.

License

MIT License - See LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github		.github
examples		examples
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
cli.py		cli.py
config.py		config.py
gui_agent.py		gui_agent.py
main.py		main.py
planner.py		planner.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GUI Agent

Quick Start

1. Install Dependencies

2. Configure API Key

3. Run

Demo

Basic Task: Open Calculator

Multi-step Task: Web Search

Advanced Task: Form Filling

Usage Examples

Supported Actions

Configuration

Architecture

Safety

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GUI Agent

Quick Start

1. Install Dependencies

2. Configure API Key

3. Run

Demo

Basic Task: Open Calculator

Multi-step Task: Web Search

Advanced Task: Form Filling

Usage Examples

Supported Actions

Configuration

Architecture

Safety

Troubleshooting

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages