Back to Archive

Point and Talk: How Clicky’s AI Interface Works

May 3, 2026

Clicky is an AI buddy made by Farza that lives on your Mac. You press a key, talk, and a glowing blue triangle flies across your screen, points at whatever you asked about, and talks you through the answer.

Most apps with an agent should have something like this. Point and talk is closer to how you'd ask a person sitting next to you for help than anything I do with a chat box. So I decided to figure out how it works.

Farza released an open-source prototype of clicky that this post is based on. The product version has more features and runs smoother (it can click things, spawn background agents, etc.).

Demo via Farza on X.

The triangle and the overlay

Three layers stacked on top of each other:

Clicky creates one transparent full-screen overlay window per screen. They let clicks pass through, float above everything (including menus and popups), and join all desktops so the buddy (flying blue triangle) follows you. Inside each window Clicky draws a blue triangle positioned wherever Clicky wants the buddy to be.

Loading...

This sidesteps the question of "how do you move a cursor in another app." Clicky draws its own blue triangle as its cursor, and the user’s cursor works normally underneath the invisible overlay. The triangle has no idea what app is below it. It's just in its own little window.

The flight animation makes it feel alive instead of mechanical. Clicky picks a midpoint between buddy and target, lifts it so the path arcs upward, and glides along that quadratic bezier curve. The triangle rotates to face its direction of travel each frame and scales up at mid-flight so it swoops.

One tag, one regex

Claude writes its answer like it would in a chat box, then types a coordinate tag at the end of the message. The app pulls that tag off with a regex. The text before the tag goes to text-to-speech. The numbers in the tag drive the cursor.

The instruction lives at the bottom of the system prompt:

when you point, append a coordinate tag at the very end of your response, AFTER your spoken text.

The prompt defines the coordinate space (origin top-left, x increases right, y increases down, dimensions taken from each screenshot's label) before getting to the format:

format: [POINT:x,y:label] where x,y are integer pixel coordinates in the screenshot's coordinate space, and label is a short 1-3 word description of the element (like "search bar" or "save button"). if the element is on the cursor's screen you can omit the screen number. if the element is on a DIFFERENT screen, append :screenN where N is the screen number from the image label (e.g. :screen2).

if pointing wouldn't help, append [POINT:none].

But Claude can only point at what it sees.

What Claude sees

When you release the hotkey, Clicky takes one screenshot per monitor. Clicky resizes each one to a max of 1280 pixels per dimension and encodes at 80% quality. That’s enough for Claude to read most UI text. Clicky filters out its own windows so Claude sees what the user sees, minus Clicky.

Each screenshot ships with a label:

screen 1 of 2 — cursor is on this screen (primary focus) (image dimensions: 1280x831 pixels)

The label does two things:

  1. Tells Claude which screen has the cursor and to prioritize it, because users normally ask about whatever they're looking at
  2. The dimensions tell Claude the coordinate space when it picks a point.

Alongside the screenshots come the user's transcript (from AssemblyAI streaming transcription) and the last 10 exchanges so Clicky can answer follow-ups like "what about the other one."

The same system prompt gives it the casual "buddy" vibes. I like that the prompt is written in the voice it asks for:

you're clicky, a friendly always-on companion that lives in the user's menu bar. the user just spoke to you via push-to-talk and you can see their screen(s). your reply will be spoken aloud via text-to-speech, so write the way you'd actually talk. this is an ongoing conversation — you remember everything they've said before.

rules:

  • default to one or two sentences. be direct and dense. BUT if the user asks you to explain more, go deeper, or elaborate, then go all out — give a thorough, detailed explanation with no length limit.
  • all lowercase, casual, warm. no emojis.
  • write for the ear, not the eye. short sentences. no lists, bullet points, markdown, or formatting — just natural speech.
  • never say "simply" or "just".
  • don't read out code verbatim. describe what the code does or what needs to change conversationally.

The rule I didn't expect was about how to end:

don't end with simple yes/no questions like "want me to explain more?" or "should i show you?" — those are dead ends that force the user to just say yes. instead, when it fits naturally, end by planting a seed — mention something bigger or more ambitious they could try, a related concept that goes deeper, or a next-level technique that builds on what you just explained.

Most assistants close with "want me to keep going?" Clicky is told to plant a hook instead. That's a small choice and it changes the whole texture of the conversation.

Clicky's text goes to ElevenLabs, which turns it into audio.

That's everything Claude gets:

  • a screenshot per monitor with the dimensions in the label
  • a transcript
  • the last 10 exchanges
  • a one-screen system prompt

The triangle is about to fly.

Coordinate math

Claude returns coordinates in the screenshot's pixel space. Getting the buddy to that spot on the display takes three transforms.

The screenshot is, say, 1280 by 831 pixels with the origin in the top-left. The display puts its origin in the bottom-left, so the Y axis is flipped. Clicky clamps the coordinate to the screenshot bounds, scales up to display size, and flips the Y axis. Then it adds the display's offset, because macOS lays out multiple monitors on one shared grid and each monitor's origin sits wherever macOS placed it.

That global coordinate goes to the overlay windows. Each one decides whether the target is on its screen and starts the bezier flight if it is.

The overlay then converts that global AppKit point into its local SwiftUI coordinate space, nudges the target right and down, and clamps it inside the screen padding so the triangle points beside the element instead of sitting on top of it.

None of this is exotic. It's just every coordinate system in macOS disagrees with every other one. Three disagreements in one pipeline: screenshot vs display Y-axis, per-monitor offsets on the global grid, and AppKit's point vs SwiftUI's point inside a single window. This is the only place doing real custom work between Claude's text output and the buddy moving across your screen.

The flow

Here it is in one picture:

That's the point-and-talk path. No accessibility UI-tree inspection for finding targets, no agent loop, no robot driving the real mouse. The app needs Accessibility permission, but only for the hotkey. The pointing is a screenshot and a text tag.

Two small things

The hotkey doesn't intercept. Clicky watches ctrl+option without claiming it. Hold the modifiers while typing in another app and that app still sees them.

The AI keys live in a proxy. A Cloudflare Worker proxies Anthropic and ElevenLabs and hands out short-lived AssemblyAI tokens for the websocket. The real AI keys never leave the server. For the core voice-and-vision path, the app calls Worker endpoints instead of shipping those provider keys.

Why point and talk

The chat box isn't a great interface for an LLM. You describe your screen in words to a model that can already see it, then translate the answer back into where to click. A blue triangle landing on a button is a more natural way to say "click here" than "in the top right, between the search field and the avatar, you'll see a small icon."

V1 Clicky is open source and MIT licensed. If you've been building yet another chat box, go read the code and steal the shape.

Weekly article

Get the weekly deep-dive on context, control, and workflows for useful agents.

5,000+ readers