Skip to content

fix: use NSApplication.shared for proper WindowServer connection#2

Open
Korkyzer wants to merge 3 commits intodifferent-ai:mainfrom
Korkyzer:fix/fullscreen-capture
Open

fix: use NSApplication.shared for proper WindowServer connection#2
Korkyzer wants to merge 3 commits intodifferent-ai:mainfrom
Korkyzer:fix/fullscreen-capture

Conversation

@Korkyzer
Copy link

@Korkyzer Korkyzer commented Mar 4, 2026

Improve OCR text extraction quality (dark UIs, resolution, extraction logic)

Problem

The current OCR pipeline misses most text on dark-themed applications (WhatsApp, Slack, Discord, etc.). On a WhatsApp conversation with dozens of visible messages, agent-watch only captured ~80 characters (menu bar text: "File Edit Chat Call View Window Help").

Root causes identified

  1. NativeTextExtractor short-circuits OCR: The accessibility extractor runs first. If it returns ≥ minimumAccessibilityChars (even just menu bar items), OCR is skipped entirely. For WhatsApp, accessibility returns ~100 chars of sidebar/menu text, satisfying the minimum — so the Vision framework OCR never runs on the actual message content.

  2. Frame buffer resolution too low: FrameBufferStore downscales captures to maxDimension = 1280, which halves Retina resolution (2560 → 1280). Text becomes too small for reliable OCR, especially in dense UIs.

  3. Apple Vision framework struggles with dark themes: VNRecognizeTextRequest performs poorly on light-on-dark text. The Vision framework was designed primarily for document scanning (dark text on light backgrounds).

Changes

1. NativeTextExtractor.swift — Always run both extractors, keep the best

Before: Accessibility runs first; if it returns enough chars, OCR is skipped.
After: Both accessibility AND OCR always run. The result with more text wins.

// Before
if let accessibilityText = accessibilityExtractor.extractText(),
   accessibilityText.count >= minimumAccessibilityChars {
    return ExtractedText(text: accessibilityText, source: .accessibility, metadata: metadata)
}
// OCR only runs as fallback

// After
let accessibilityText = accessibilityExtractor.extractText()
var ocrText: String? = nil
if ocrEnabled {
    ocrText = try ocrExtractor.extractText()
}
// Return whichever extracted more text
if ocrLen > accLen { return ocr } else { return accessibility }

2. FrameBufferStore.swift — Increase resolution to full Retina

// Before
maxDimension: Int = 1280

// After
maxDimension: Int = 2560

Disk impact: frames go from ~250KB to ~400-800KB. With the existing retention/pruning policy this remains well under control.

3. OCRTextExtractor.swift — Color inversion for dark themes

Runs OCR twice: once on the original image, once on a color-inverted version (using CoreImage CIColorInvert). Keeps whichever result contains more text. Also lowered minimumTextHeight from 0.005 to 0.002 to catch smaller text.

Results

Metric Before After
WhatsApp text captured ~80 chars (menu bar only) 1567 chars (all messages, contacts, timestamps, links)
Frame resolution 1280×831 2560×1662
text_source for WhatsApp accessibility (short-circuited) ocr (full Vision + inversion)

Environment

  • macOS 26.2 (Tahoe)
  • MacBook Pro M-series (Retina display)
  • WhatsApp desktop, Slack, Discord (dark theme)

4. LaunchAgent — Fix TCC accessibility permission persistence

Problem: accessibilityGranted always showed false after reboot, even with AgentWatch toggled ON in System Settings.

Root cause: The LaunchAgent ran a bash script (launch) as intermediary. macOS TCC grants permissions per-process based on code signing identity. The launch bash script had its own identity (launch-3c34cc36...), so when the user granted accessibility to "AgentWatch.app", the TCC entry didn't match the actual process calling AXIsProcessTrusted().

Fix: Split into two separate LaunchAgents that launch the agent-watch binary directly:

  • com.differentai.agentwatch.plistagent-watch daemon
  • com.differentai.agentwatch.serve.plistagent-watch serve --host 127.0.0.1 --port 41733

Both have RunAtLoad: true. The TCC grant now correctly matches the process identity and persists across reboots.

Results

Metric Before After
WhatsApp text captured ~80 chars (menu bar only) 1567 chars (all messages, contacts, timestamps, links)
Frame resolution 1280×831 2560×1662
text_source for WhatsApp accessibility (short-circuited) ocr (full Vision + inversion)
accessibilityGranted after reboot false (TCC mismatch) true (direct binary launch)

Environment

  • macOS 26.2 (Tahoe)
  • MacBook Pro M-series (Retina display)
  • WhatsApp desktop, Slack, Discord (dark theme)

Related

Korkyzer added 2 commits March 4, 2026 13:34
CGDisplayCreateImage returns desktop wallpaper instead of actual screen
content when running as a background process. This is because
RunLoop.main.run() does not establish a WindowServer connection, so
macOS TCC silently denies real screen capture.

Fix: Initialize NSApplication.shared with .accessory activation policy
before setting up capture timers, and use app.run() instead of
RunLoop.main.run().
@benjaminshafii
Copy link
Member

Thanks. But why the app icon though? Is it necessary for this to be successfully registered?

@benjaminshafii
Copy link
Member

lemme know @Korkyzer

@Korkyzer
Copy link
Author

Korkyzer commented Mar 4, 2026

@benjaminshafii Honestly it's not strictly necessary, just added it to make the bundle feel more complete/clean. Happy to remove it if you'd prefer to keep the PR focused on the fix only.

- maxDimension 1280→2560, jpegQuality 0.45→0.85
- Run both accessibility + OCR, keep longer result
- Color-invert frames before OCR for dark UIs
- minimumTextHeight 0.005→0.002

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Korkyzer
Copy link
Author

Korkyzer commented Mar 4, 2026

TCC permissions breaking after codesign (fix)

If accessibilityGranted or screenRecordingGranted show false in /status even though they're toggled ON in System Settings, the issue is the launch bash script in the LaunchAgent.

Root cause: The LaunchAgent runs Contents/MacOS/launch (a bash script) which spawns agent-watch serve and execs agent-watch daemon. macOS TCC can't properly associate processes spawned through a bash intermediary with the app bundle's code signature — so grants don't match.

Fix: Split into two LaunchAgents that launch agent-watch directly:

<!-- com.differentai.agentwatch.plist -->
<key>ProgramArguments</key>
<array>
    <string>/path/to/AgentWatch.app/Contents/MacOS/agent-watch</string>
    <string>daemon</string>
</array>
<!-- com.differentai.agentwatch.serve.plist -->
<key>ProgramArguments</key>
<array>
    <string>/path/to/AgentWatch.app/Contents/MacOS/agent-watch</string>
    <string>serve</string>
    <string>--host</string>
    <string>127.0.0.1</string>
    <string>--port</string>
    <string>41733</string>
</array>

After any codesign --force, you need to:

  1. tccutil reset Accessibility com.differentai.agentwatch.app
  2. tccutil reset ScreenCapture com.differentai.agentwatch.app
  3. Re-add ~/Applications/AgentWatch.app in both Accessibility and Screen Recording settings (Cmd+Shift+G → ~/Applications)
  4. Restart both LaunchAgents

Also pushed OCR improvements to my fork (fix/fullscreen-capture) — captures went from ~18-78 chars (menu bars only) to 1000-2600 chars (full page content).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants