Skip to content

Autonomous browser testing agent powered by AI. Describe what you want to test in plain English - no selectors or manual scripting needed. Uses vision models to understand pages and LLMs to plan actions.

Notifications You must be signed in to change notification settings

kareem2002-k/browser-testing-agent

Repository files navigation

Browser Testing Agent

An autonomous browser testing agent powered by AI. It uses vision models and LLMs to interact with web applications and complete testing scenarios without manual scripting.

TypeScript Node.js pnpm License: MIT

What It Does

Instead of writing brittle selectors and waiting for elements, you just tell the agent what you want to test. It figures out how to do it by:

  • Looking at screenshots to understand the page
  • Finding buttons, inputs, and links automatically
  • Planning actions based on your goal
  • Executing them and checking if it worked

Features

  • Vision-based element detection - Uses screenshots to find interactive elements, no selectors needed
  • Multi-action planning - Fills entire forms in one go instead of field-by-field
  • Smart evaluation - Knows when a test actually passed vs just seeing a visual change
  • Multi-step workflows - Handles complex flows like signup processes from start to finish
  • Batch operations - Faster execution by grouping related actions
  • Screenshot debugging - Captures screenshots at each step so you can see what happened

Installation

You'll need:

  • Node.js 18+
  • pnpm 8+
  • A Google Gemini API key (get one here)
git clone https://github.com/yourusername/browser-testing-agent.git
cd browser-testing-agent
pnpm install
pnpm build

Create a .env file:

GOOGLE_API_KEY=your-api-key-here

Usage

Basic usage:

pnpm agent --url "https://example.com" --goal "click the login button"

Headed mode (see the browser):

pnpm agent --url "https://example.com" --goal "create an account" --headed

Verbose logging:

pnpm agent --url "https://example.com" --goal "fill the contact form" --verbose

Examples

Simple button click:

pnpm agent --url "https://example.com" --goal "click the sign up button"

Form filling:

pnpm agent --url "https://app.example.com" --goal "create an account by filling all required fields and completing signup"

The agent will fill all form fields in one batch, then click submit, then complete any additional steps.

Complex workflow:

pnpm agent --url "https://app.example.com" --goal "navigate to sign up, fill all the data needed, and complete the full signup process"

Architecture

This project uses Hexagonal Architecture (ports and adapters). The core domain logic is completely separate from external dependencies like Playwright or LLM APIs.

Domain (agent-core)
   ↑       ↑
Ports   Ports
   ↓       ↓
Adapters (browser-mcp, vision, llm)

Project Structure

browser-testing-agent/
├── apps/
│   └── agent-runner/          # CLI entry point
├── packages/
│   ├── agent-core/            # Domain logic (no external deps)
│   ├── browser-mcp/           # Playwright adapter
│   ├── vision/                # Vision analysis
│   └── llm/                   # LLM integration
└── scripts/                   # Utilities

How It Works

The agent runs a MAPE-K loop:

  1. Monitor - Takes a screenshot
  2. Analyze - Vision model extracts facts (buttons, inputs, links)
  3. Plan - Planner LLM decides what to do next
  4. Execute - Runs the browser action
  5. Evaluate - Checks if the goal was achieved

This repeats until the goal is complete or it hits the step limit.

Design Patterns

  • Hexagonal Architecture - Domain isolated from adapters
  • Command Pattern - Actions are commands (NavigateCommand, ClickCommand, TypeCommand)
  • State Pattern - Test lifecycle with explicit states
  • Strategy Pattern - Swappable LLM providers
  • Observer Pattern - Event system for observability

Development

Build:

pnpm build

Lint:

pnpm lint

Clean:

pnpm clean

Configuration

Screenshots

Screenshots are saved to ./screenshots by default. Each step creates:

  • step-N-before.png - Before the action
  • step-N-after.png - After the action

Models

Default models:

  • Planner: gemini-2.0-flash-lite (falls back to gemini-2.5-flash if needed)
  • Evaluator: gemini-2.0-flash-lite
  • Vision: gemini-2.0-flash-lite (falls back to full model)

You can configure these in the LLM adapter.

Performance

The agent is optimized for speed:

  • Batch form filling (all fields at once)
  • Heuristic-first evaluation (fewer LLM calls)
  • Tiered image comparison (pixel → lite vision → full vision)
  • Parallel screenshot and DOM extraction
  • Smart page ready detection

Contributing

Pull requests welcome!

  1. Fork the repo
  2. Create a feature branch
  3. Make your changes
  4. Submit a PR

Please:

  • Follow TypeScript best practices
  • Keep the hexagonal architecture
  • Add tests for new features
  • Update docs as needed

License

MIT

Credits

Documentation

Known Issues

  • Heavy JavaScript apps might need longer wait times
  • Vision accuracy depends on screenshot quality
  • Complex SPAs might need retries for element detection

Roadmap

  • More browser actions (scroll, hover, drag-and-drop)
  • Test result reporting (JSON, HTML, JUnit)
  • CI/CD examples
  • Docker support
  • Multiple LLM providers (OpenAI, Anthropic)
  • Visual regression testing
  • Test recording/replay
  • Parallel execution

Support

Found a bug? Have a question?

  1. Check existing issues
  2. Open a new issue with details
  3. Include screenshots and logs if possible

About

Autonomous browser testing agent powered by AI. Describe what you want to test in plain English - no selectors or manual scripting needed. Uses vision models to understand pages and LLMs to plan actions.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published