Take Picture and Click Skill - Skill

← Back to Workflow

You are an expert at locating and clicking precise UI elements on Windows using programmatic methods instead of visual coordinate guessing.

Why This Skill Exists

When an AI views a screenshot via view_file, the image is automatically downscaled by the vision pipeline. Pixel coordinates estimated from the displayed image are systematically wrong — potentially off by hundreds of pixels. This has caused the AI to click the wrong UI element (e.g., Antigravity's close button instead of an Edge tab), crashing the IDE.

Rule

To click a small UI element, NEVER guess absolute pixel coordinates from a full-screen screenshot. Use one of the methods below.

Method 0: MCP Vision Server (Zero-Code Extension)

If the MCP Vision Server is installed on your system (see new_machine_setup.md), you DO NOT need to use any of the manual scripts below. The MCP Vision Server natively provides OpenInterpreter-style mouse and keyboard control directly in your toolset. Always check if you have these tools available before writing manual Win32Mouse scripts.

Method 1: Coordinate Grid Overlay (Good — Vision with Anchors)

Draw labeled gridlines every 25px onto the screenshot BEFORE viewing it. Labels survive downscaling.

import cv2

img = cv2.imread('screenshot.png')
h, w = img.shape[:2]

step = 25
for x in range(0, w, step):
    cv2.line(img, (x, 0), (x, h), (0, 0, 255), 1)
    cv2.putText(img, str(x), (x + 5, 25), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 255), 2)

for y in range(0, h, step):
    cv2.line(img, (0, y), (w, y), (0, 0, 255), 1)
    cv2.putText(img, str(y), (5, y - 5), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 255), 2)

cv2.imwrite('screenshot_grid.png', img)

View screenshot_grid.png instead of the raw screenshot. Read coordinates from the grid labels.

Method 2: Crop and Zoom (Good — For Tiny Targets)

View the full screenshot to identify the rough area (e.g., "top-right quadrant")
Crop a small region (~600x300) from the original full-res image using Python
View the crop — small images bypass the vision downscaler and display at 1:1
Add the cropping offset back: Native_X = Offset_X + Local_X

Method 3: Programmatic Window Bounds (Minimum Viable)

Use GetWindowRect to get exact window bounds, then calculate positions relative to those bounds:

$proc = Get-Process msedge | Where-Object { $_.MainWindowTitle -ne "" } | Select-Object -First 1
# Then use GetWindowRect as shown in computer-access.md
# Tab strip is typically at: X = WindowLeft + offset, Y = WindowTop + 18

Method 4: Scaling Normalization (CRITICAL for Win32)

(Note: We are adding this as a safeguard, though we aren't 100% certain scaling is the ONLY cause of coordinate misses. However, tests confirm Windows actively mangles Python mouse coordinates if scaling is >100%.)

If the user's Windows display is scaled to 125% or 150%, the pixel coordinates from the image will not match physical Win32 screen coordinates, and your mouse clicks will completely miss the target. Rule: When executing pyautogui or ctypes mouse clicks, ALWAYS make your Python script DPI-aware at the very top of the script:

import ctypes
# Tell Windows not to scale our coordinate inputs
ctypes.windll.shcore.SetProcessDpiAwareness(2)

Universal Vision Utilities

Utility 1: Taking Screenshots

[!CAUTION] LIBERALLY TAKE PICTURES OF THE ENTIRE SCREEN. Unlike the browser-specific skills (which only capture the viewport), this skill relies on physical coordinates. Therefore, you MUST capture the entire physical screen.

CRITICAL: The user should NEVER catch you asking a question, having an issue, or failing a script because you failed to look at what was on the screen. Do not run blind scripts that assume the UI state. You must take a screenshot to understand what is going on before, during, and after your actions.

Capture the entire screen to visually verify your actions:

# Make DPI-aware so we capture full physical resolution on scaled displays
Add-Type -TypeDefinition 'using System; using System.Runtime.InteropServices; public class DpiAware { [DllImport("user32.dll")] public static extern bool SetProcessDPIAware(); }'
[DpiAware]::SetProcessDPIAware()

Add-Type -AssemblyName System.Windows.Forms
Add-Type -AssemblyName System.Drawing
$screen = [System.Windows.Forms.Screen]::PrimaryScreen.Bounds
$bitmap = New-Object System.Drawing.Bitmap($screen.Width, $screen.Height)
$graphics = [System.Drawing.Graphics]::FromImage($bitmap)
$graphics.CopyFromScreen($screen.Location, [System.Drawing.Point]::Empty, $screen.Size)
$bitmap.Save("C:\path\to\screenshot.png")
$graphics.Dispose()
$bitmap.Dispose()

Utility 2: Win32 Mouse & Keyboard Input

If using Physical Clicks:

Add-Type @"
using System; using System.Runtime.InteropServices;
public class Win32Mouse {
    [DllImport("user32.dll")] public static extern bool SetCursorPos(int X, int Y);
    [DllImport("user32.dll")] public static extern void mouse_event(uint dwFlags, uint dx, uint dy, uint cButtons, uint dwExtraInfo);
}
"@
[Win32Mouse]::SetCursorPos(1200, 310)
Start-Sleep -Milliseconds 200
[Win32Mouse]::mouse_event(0x02, 0, 0, 0, 0)  # LEFTDOWN
[Win32Mouse]::mouse_event(0x04, 0, 0, 0, 0)  # LEFTUP

For basic scrolling or typing, use SendKeys:

[System.Windows.Forms.SendKeys]::SendWait("^{HOME}")   # Ctrl+Home
[System.Windows.Forms.SendKeys]::SendWait("{PGDN}")     # Page Down

Utility 3: Clipboard-Mediated Input

For long strings (paths, large blocks of text) in GUI input fields, do not use SendKeys character-by-character. Load into clipboard and paste:

Set-Clipboard -Value "C:\Users\user\Documents\file.txt"
[System.Windows.Forms.SendKeys]::SendWait("^v")   # Paste (Ctrl+V)
Start-Sleep -Milliseconds 300
[System.Windows.Forms.SendKeys]::SendWait("{ENTER}")