A desktop automation agent powered by multimodal vision-language models. Automates computer tasks through visual understanding and precise control.
pip install -r requirements.txtcp .env.example .envEdit .env:
DASHSCOPE_API_KEY=sk-your-api-key-here # Alibaba Cloud DashScope
BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
MODEL_NAME=qwen3-vl-plusSupported Models: GUI Agent works with any vision-language model that follows the OpenAI API format.
| Provider | Models |
|---|---|
| Alibaba Cloud (DashScope) | qwen3-vl-plus (recommended), qwen3-vl-flash, qwen-vl-max |
| OpenAI | gpt-4o, gpt-4o-mini |
gemini-1.5-pro, gemini-1.5-flash |
|
| Others | Any model compatible with OpenAI API format |
Get API Keys:
- Alibaba Cloud: Bailian Console
- OpenAI: Platform Console
- Google: AI Studio
python gui_agent.pyThen enter instructions like:
- "Open Notepad"
- "Search for Python tutorials on Google"
- "Minimize all windows"
"Open Windows Calculator"
"Search for Python tutorials on Google"
"Type 'Hello, World!' in Notepad"
"Open browser, visit GitHub, and log in"| Action | Description | Example |
|---|---|---|
CLICK |
Click at coordinates (0.0-1.0) | CLICK(0.5, 0.5) |
TYPE |
Type text via clipboard paste | TYPE("Hello") |
SCROLL |
Scroll mouse wheel | SCROLL(-100) |
HOTKEY |
Keyboard shortcuts | HOTKEY("ctrl+c") |
FOCUS |
Zoom-in for precision | FOCUS(0.5, 0.5) |
DONE |
Task completed | DONE |
| Variable | Description | Default |
|---|---|---|
DASHSCOPE_API_KEY |
API Key (DashScope/OpenAI/etc.) | - |
BASE_URL |
API endpoint | https://dashscope.aliyuncs.com/compatible-mode/v1 |
MODEL_NAME |
Model to use | qwen3-vl-plus |
MAX_ITERATIONS |
Max iterations per task | 50 |
Recommended Models: qwen3-vl-plus (best for GUI tasks), gpt-4o (general purpose), qwen3-vl-flash (fast)
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Perception │───▶│ Brain │───▶│ Action │
│ ScreenCap │ │ LLM Client │ │ Executor │
└─────────────┘ └─────────────┘ └─────────────┘
Files:
gui_agent.py- Main agent with ReAct loopmain.py- Multi-step planner orchestratorplanner.py- Task decomposition
PyAutoGUI FailSafe: Move mouse to screen corner to emergency stop.
- Always monitor agent operations
- Test in safe environment first
- Ready to press Ctrl+C to interrupt
API connection fails: Check API key and network connection.
Inaccurate actions: Use more detailed task descriptions.
Slow response: Try a faster model like qwen3-vl-flash or gpt-4o-mini.
MIT License - See LICENSE file for details


