Goal: Measure how well robust detection finds elements at click locations.
Related Documents:
- Literature Review - SOTA analysis of UI grounding methods
- Experiment Plan - Detailed comparison of OmniParser vs UI-TARS
Based on our literature review, we are comparing multiple approaches:
| Method | Expected Accuracy | Notes |
|---|---|---|
| OmniParser (baseline) | ~40% | Current approach |
| OmniParser + ScreenSeekeR | ~55% | LLM-guided cropping |
| UI-TARS 1.5 | ~62% | SOTA model |
| UI-TARS + ScreenSeekeR | ~70%+ | Combined approach |
See the experiment plan for full methodology.
| Question | Metric |
|---|---|
| Can we find the clicked element? | Detection Rate |
| How accurate is the bounding box? | IoU (Intersection over Union) |
| How many attempts needed? | Attempts to Success |
| How long does it take? | Latency (ms) |
| Do we find the wrong element? | False Positive Rate |
For each test sample, we need:
- Screenshot image
- Click location (x, y in pixels)
- Ground truth bounding box of clicked element
- Element metadata (text, type)
evaluation/
├── datasets/
│ ├── synthetic/ # Generated UIs (automatic ground truth)
│ │ ├── samples/
│ │ │ ├── 001.png
│ │ │ ├── 002.png
│ │ │ └── ...
│ │ └── annotations.json
│ │
│ ├── curated/ # Real screenshots (manual annotation)
│ │ ├── samples/
│ │ └── annotations.json
│ │
│ └── recorded/ # From real user sessions
│ ├── samples/
│ └── annotations.json
│
├── results/ # Evaluation outputs
│ ├── baseline/
│ └── robust/
│
└── harness.py # Evaluation runner
{
"version": "1.0",
"dataset": "curated",
"samples": [
{
"id": "sample_001",
"image": "samples/001.png",
"width": 1920,
"height": 1080,
"elements": [
{
"id": "elem_001",
"bbox": [0.10, 0.20, 0.25, 0.28],
"text": "Login",
"type": "button",
"click_point": [0.175, 0.24]
},
{
"id": "elem_002",
"bbox": [0.10, 0.32, 0.25, 0.40],
"text": "Sign Up",
"type": "button",
"click_point": [0.175, 0.36]
}
]
}
]
}Note: All coordinates are normalized (0-1).
Generate fake UIs with known ground truth:
def generate_synthetic_sample(seed: int) -> dict:
"""Generate a synthetic UI with random buttons/text."""
img = Image.new('RGB', (800, 600), '#f0f0f0')
draw = ImageDraw.Draw(img)
elements = []
# Random buttons
for i in range(random.randint(3, 8)):
x = random.randint(50, 600)
y = random.randint(50, 500)
w = random.randint(80, 150)
h = random.randint(30, 50)
text = random.choice(["Submit", "Cancel", "OK", "Save", "Delete", ...])
draw.rectangle([x, y, x+w, y+h], fill=random_color())
draw.text((x+10, y+10), text, fill='white')
elements.append({
"bbox": [x/800, y/600, (x+w)/800, (y+h)/600],
"text": text,
"type": "button",
"click_point": [(x+w/2)/800, (y+h/2)/600]
})
return {"image": img, "elements": elements}Pros: Unlimited samples, perfect ground truth Cons: May not reflect real-world complexity
- Collect screenshots from real apps
- Run OmniParser to get initial detections
- Human review in annotation tool:
- Accept correct detections
- Adjust incorrect bounding boxes
- Add missed elements
- Mark representative click points
- Export to annotation format
Simple terminal-based annotation:
def annotate_sample(image_path: str) -> dict:
"""Interactive CLI annotation."""
img = Image.open(image_path)
elements = omniparser.parse(img)
print(f"OmniParser found {len(elements)} elements")
print("For each element, enter: [a]ccept, [r]eject, [e]dit, [s]kip")
accepted = []
for i, elem in enumerate(elements):
show_element(img, elem) # Display with bbox highlighted
choice = input(f"Element {i+1} '{elem.get('content', '')[:20]}': ")
if choice == 'a':
click = get_click_point(elem) # Center of bbox
accepted.append({**elem, "click_point": click})
elif choice == 'e':
edited = edit_element(elem) # Manual bbox adjustment
accepted.append(edited)
# Add missed elements
while input("Add missed element? [y/n]: ") == 'y':
elem = manual_annotate(img)
accepted.append(elem)
return {"image": image_path, "elements": accepted}Record real user interactions:
def record_session():
"""Record clicks with screenshots."""
samples = []
while True:
# Wait for click
click_xy = wait_for_click()
screenshot = capture_screenshot()
# Human labels what they clicked
label = input("What did you click? (text/description): ")
samples.append({
"image": save_screenshot(screenshot),
"click_xy": click_xy,
"label": label,
# Ground truth bbox determined by human later
})from dataclasses import dataclass
from typing import List
import json
import time
@dataclass
class EvalResult:
sample_id: str
element_id: str
detected: bool
iou: float
attempts: int
latency_ms: float
transform_used: str
def evaluate_dataset(
detector,
dataset_path: str,
output_path: str
) -> dict:
"""Run evaluation on a dataset."""
with open(dataset_path) as f:
dataset = json.load(f)
results = []
for sample in dataset["samples"]:
img = Image.open(sample["image"])
w, h = img.size
for elem in sample["elements"]:
# Convert normalized click to pixels
cx = int(elem["click_point"][0] * w)
cy = int(elem["click_point"][1] * h)
# Run detection
start = time.time()
result = detector.detect(img, (cx, cy))
latency = (time.time() - start) * 1000
# Calculate IoU if detected
iou = 0.0
if result.found and result.bbox:
iou = calculate_iou(elem["bbox"], result.bbox)
results.append(EvalResult(
sample_id=sample["id"],
element_id=elem["id"],
detected=result.found,
iou=iou,
attempts=result.attempts,
latency_ms=latency,
transform_used=result.successful_transform or "none"
))
# Aggregate metrics
metrics = compute_metrics(results)
# Save results
save_results(results, metrics, output_path)
return metrics
def compute_metrics(results: List[EvalResult]) -> dict:
"""Compute aggregate metrics."""
total = len(results)
detected = sum(1 for r in results if r.detected)
return {
"detection_rate": detected / total if total > 0 else 0,
"mean_iou": mean([r.iou for r in results if r.detected]),
"mean_attempts": mean([r.attempts for r in results]),
"mean_latency_ms": mean([r.latency_ms for r in results]),
"transform_breakdown": count_transforms(results),
}
def calculate_iou(bbox1, bbox2) -> float:
"""Calculate Intersection over Union."""
x1 = max(bbox1[0], bbox2[0])
y1 = max(bbox1[1], bbox2[1])
x2 = min(bbox1[2], bbox2[2])
y2 = min(bbox1[3], bbox2[3])
if x2 <= x1 or y2 <= y1:
return 0.0
intersection = (x2 - x1) * (y2 - y1)
area1 = (bbox1[2] - bbox1[0]) * (bbox1[3] - bbox1[1])
area2 = (bbox2[2] - bbox2[0]) * (bbox2[3] - bbox2[1])
union = area1 + area2 - intersection
return intersection / union if union > 0 else 0.0# Generate synthetic dataset
uv run python -m openadapt_grounding.eval generate --type synthetic --count 500
# Run OmniParser baseline
uv run python -m openadapt_grounding.eval run --method omniparser --dataset synthetic
# Run OmniParser with ScreenSeekeR cropping
uv run python -m openadapt_grounding.eval run --method omniparser-screenseeker --dataset synthetic
# Run UI-TARS baseline
uv run python -m openadapt_grounding.eval run --method uitars --dataset synthetic
# Compare all methods
uv run python -m openadapt_grounding.eval compare --output results/comparison.json============================================================
Evaluation Results: OmniParser vs UI-TARS
============================================================
Dataset: synthetic (500 samples)
Method Accuracy Latency
─────────────────────────────────────────────────────
OmniParser (baseline) ~40% 250ms
OmniParser + ScreenSeekeR ~55% 1200ms
UI-TARS 1.5 (baseline) ~62% 350ms
UI-TARS + ScreenSeekeR ~70%+ 1500ms
By Element Size:
Size OmniParser UI-TARS
<32px ~15% ~35%
32-100px ~45% ~65%
>100px ~70% ~85%
Failure Analysis:
- Small icons (<32px) remain hardest
- Text elements detected more reliably than icons
- ScreenSeekeR cropping helps most on small elements
┌─────────────────────────────────────────────────────────────┐
│ CURATION WORKFLOW │
└─────────────────────────────────────────────────────────────┘
1. COLLECT
├── Take screenshots of target applications
├── Include variety: light/dark themes, resolutions, apps
└── Aim for 50-100 screenshots initially
2. AUTO-ANNOTATE
├── Run OmniParser on each screenshot
└── Generate initial annotations.json
3. HUMAN REVIEW
├── For each detected element:
│ ├── Accept (bbox correct, element meaningful)
│ ├── Adjust (fix bbox boundaries)
│ └── Reject (false positive)
├── Add missed elements manually
└── Mark click points (typically bbox center)
4. VALIDATE
├── Run evaluation with known-good detector
├── Check for annotation errors
└── Fix outliers
5. VERSION & SHARE
├── Commit dataset to repo
├── Document collection methodology
└── Track changes over time
See experiment_plan.md for detailed implementation plan.
- Deploy UI-TARS 1.5 (GPU server)
- Implement unified evaluation harness
- Create synthetic dataset generator (500 samples)
- Download ScreenSpot subset (200 samples)
- Run OmniParser baseline on all datasets
- Run UI-TARS baseline on all datasets
- Generate initial comparison plots
- Identify failure cases
- Implement ScreenSeekeR-style cropping for both models
- Run full evaluation matrix
- Compare all 6 methods (2 models × 3 cropping strategies)
- Generate final plots (bar charts, scatter, line, confusion matrix)
- Analyze failure cases by element size/type
- Write recommendations for production
- Document findings
- Detection rate by dataset (synthetic, ScreenSpot, real)
- Detection rate by element size (<32px, 32-100px, >100px)
- Latency vs accuracy tradeoff
- Cropping strategy effectiveness