teaching ai to navigate screens

# teaching ai to navigate screens i was listening to a [podcast by machine learning street talk](https://www.youtube.com/watch?v=3ZTNps2PraM) about how to generate programs with ai. the host tim scarfe was interviewing alessandro palmarini, a researcher who came to talk about his new paper on "decompiling dreams" ![[Pasted image 20250322204737.png]] his paper builds upon another paper from 2021 "[dreamcoder](https://arxiv.org/abs/2006.08381)", which main idea is to use library learning to improve the search in the space of all possible programs for a given task: ![[Pasted image 20250322203308.png]] dreamcoder is a metalearning algorithm that combines library learning and guided neural search. it creates a bunch of sample programs to solve a variety of tasks, then it looks through the successful programs and aggregates their building blocks. a simple example could be factoring out the addition function from the code and storing it in the library, so that the search can be simplified - in the next iteration dreamcoder can grab the library function instead of searching and recreating it from scratch alessandro proceeded with explaining his contribution, which i didn't get at the first listen, but thought it's a good idea to understand what he's saying. so I decided to download his paper and read it ![[2025-04-12 17-02-33.mkv]] why do i have to do it by hand? so where is my ai that i can press alt+d and tell "download his paper"? such that it is titled the way i like it, and saved in the right folder? it doesn't exist yet, and i think the reason is that the agent developers focus on the wrong level of abstraction ## primitives of screen navigation tools like openai operator, manus, genspark, anthropic computer use and other are fascinating. they understand the task, create a plan, follow it, try multiple times - and click all around their own little virtual machine. some of them, like manus or openai code interpreter, can create files, while coding agents like cursor are explicitly designed to do so. clicking + file operations tend to be called computer use, while visiting sites and filling out forms are known as browser use. those primitives are handled by the libraries like playwright, puppeteer, pyclick for browser and computer use, and os package in python + tools like [omniparser](https://github.com/microsoft/OmniParser) to read the information on the screen and identify control elements such as buttons and dropdowns but still, they fail often. they focus too much on the primitives of screen navigation. those agents know how to click, type, and scroll but miss the forest for the trees. yann lecun gives an example of a person planning a trip from new york to shanghai. said person does not think of it in terms of oh, and now i have to pick up my left foot and take a step, pick up my right foot and take another. rather, they say, i need to get to the airport --> board the plane --> to spend time in the plane --> to deboard in shanghai etc language models are not designed to think in terms of "left foot right foot" aka click here, parse the screen there. they are trained on human-written text (mostly) and human-written text does not operate in that level of detail. if it was a model trained from scratch, it would be better at it. but until then, if we want that screen navigation is - fast - precise - reproduceable we need to build a layer of abstraction in the middle - between clicking every button and making very high-level generic plans that language models gravitate towards thus, the role of screen parsers changes from striving for real-time to learning the best playwright-compatible representation of an interface, creating its imprint ## imprint of an interface imprint is a wrapper api that, on one side of the call, exposes endpoints to a language model that needs to navigate the screen, let's call it lm-in api lm says: ```lmspeak i need to go to youtube and search for machine learning street talk on decompiling dreams let me call browsing.youtube.search() ``` following this, a function call / tool use is initiated on the other side, there is a second part of the imprint api sandwich, one that has already defined and learned clicks on a given interface, let's call it ui-out api. it knows how to click on a search bar in youtube, doesn't have to find it every time - which is what agents currently do the idea is similar to [anthropic mcp](https://docs.anthropic.com/en/docs/agents-and-tools/mcp) (mediating data exchange) and [langchain agent protocol](https://github.com/langchain-ai/agent-protocol) (mediating task communication) but for interfaces - on the web, desktop, and mobile *nb: while writing this article, google released their own [agent2agent protocol](https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/), which has a chance of becoming industry standard due to google's weight* ```python # browser.youtube.py # made with sonnet37r import imprint from playwright.async_api import Browser as PlaywrightBrowser from playwright.async_api import ( Playwright, async_playwright, ) async def search(query: str): """ Search YouTube by clicking the interface and typing a query using Playwright's async API. This function: 1. Opens a browser 2. Navigates to YouTube 3. Clicks on the search bar 4. Types the query 5. Clicks the search button Args: query (str): Search term to type into YouTube """ async with async_playwright() as p: # Launch browser browser = await p.chromium.launch(headless=False) page = await browser.new_page() # Go to YouTube await page.goto("https://www.youtube.com") # Accept cookies if the dialog appears if await page.locator('button:has-text("Accept all")').is_visible(): await page.locator('button:has-text("Accept all")').click() # Find, click and type in the search bar search_bar = page.locator('input#search_query') await search_bar.click() await search_bar.fill(query) # Click the search button search_button = page.locator('button#Search') await search_button.click() # Wait for results page to load await page.wait_for_url("**/results?search_query=**") print(f"Successfully searched for: {query}") # Pause to see the results (5 seconds) await page.wait_for_timeout(5000) # Close browser await browser.close() imprint.run(search()) ``` that's just playwright code, certainly is. the trick isn't the code itself but how it is created (in the hypothetical scenario of generating an imprint of an interface) it wasn't written by a human and in this case it is designed to be so. a [[screen reader]] (which is a tool with a screen parser + language model inside) grabs a site, clicks all around it, and logs what clicks and keystrokes led to what results. this log is a library of primitives that the screen reader then abstracts away, generating functions similar to the one depicted above. later, when another language model, an agent, whatever, wants to click on that website, said agents doesn't need to learn the site's interface on the fly but relies on the imprint produced by the screen reader ahead of time imprint can be stored under www.site.com/ui-out (akin to open api spec) or on pypi / npm, based on an gpl licensed repository that any person or agent can create pull requests for - if the site changes div names, adds functionality, or the screen reader simply hasn't discovered smth at the initial run. maybe the imprint itself is an agent and has an endpoint www.site.com/ui-out?feedback= that anyone can POST to with their observations on how to improve the imprint ui-out api ** the other part of the imprint sandwich, lm-in api, is much more boring ```python import aiohttp import asyncio async def execute_action(tool_call_json): # API endpoint url = 'https://www.youtube.com/ui-out/execute' # Make the async POST request async with aiohttp.ClientSession() as session: async with session.post(url, json=tool_call_json) as response: # Handle the response if response.status == 200: return await response.json() else: text = await response.text() raise Exception(f"Error: {response.status}, {text}") ``` and is hidden behind a usual tool call ```json { "name": "youtube_search", "description": "get youtube videos on request", "input_schema": { "type": "object", "properties": { "query": "mlst palmarini", }, "required": ["query"] } } ``` ui-out can be learned, explicitly designed by ai for ai, and without ever touching the tool call of the llm, thus decoupling the two and creating a middleware layer ## screen recorder that isn't pure evil all those interface imprints are well and good but how do i tell ai what to do without writing out every single step, which it will then fail to interpret? movie makers say "show don't tell" and i think it's good advice every agent and their grandma are trying to automate booking flights or ordering groceries, with complete disregard for what sites people actually use, their preferred airlines, or past purchases. striking insight: people usually have ordered food before using openai operator for $200 a month, and likely want to keep doing that case in point: i use flink to order groceries and i hate their website. it is so bad and dysfunctional that i have a whole note logging its failures + ideas to fix them one of stupidest things is that they do not offer "buy again" button that would re-order everything from the last cart. they best they got is list of products you still have to manually click through. oh and the buttons to add an item don't work if you click too fast. agents, beware! ![[Pasted image 20250411171222.png]] it would be great if i could press my alt+d and say ```imaginary instruction hey order food like the last time but also add stuff from the grocery note ``` go ahead, try it. agents fail, manus does ![[Pasted image 20250411175033.png]] sites are not very agent-friendly for a second, let's assume the agent solves this captcha. still, the experience is slow and dumb as it requires the human to do a bunch of things that aren't in the usual process of ordering. since we know that people are lazy, i meant to say, we have evolved to preserve energy expenditure, why not create agent ux (aka [[ax]]) that mirrors that? ** imagine a screen recorder just like [obs](https://obsproject.com/), you press alt+r and then just do your thing ![[2025-04-11 17-58-57.mkv]] this recorder is similar to the universally hated [microsoft recall](https://www.theregister.com/2025/04/11/microsoft_windows_recall/) - but owned, local, and explicit. you decide when to run it and what to data to store. it is not used for spying and extorting money from the user but to help them automate then a screen reader that a screen parser plus a language model, reads the video created by the recorder. it generates a script with a bunch of commands that call lm-in api ```pseudocode site = open_browser("https://www.goflink.com/shop/de-DE/") products = load_from_history(site) for p in products: found = find_product(site, p) add_to_cart(found) ``` when run, this script replicates what the user has done on the interface smth like ``add_to_cart()`` does not click around, instead, it calls flink's ui-out api. if we would be searching for products on amazon or thomann, the lm-in script would stay the same or very similar, while the ui-out calls would be very different as flink, amazon, and thomann have their add to cart buttons in different places and named differently this script can and should be changed by the language model if the task changes, and most used scripts should be stored in the library - local to the user, as well as a public package for some use cases of broad interest if you're asking what's the point, consider how much faster and more reliable would the system be. language model could generate a simple script in style of above pseudocode in a second and run it with code interpreter, while all the clicking doesn't need to be figured out on the go, which takes trial and error - said trial and error happens beforehand while generating the ui-out api. at the first run, if there is no flink ui-out api in the local library or on pypi, it still would take a while to generate one, but the next time it can run in a fraction of a second instead of 5 minutes - and submit the prototype of the ui-out api to the repository, thus building the library for oneself and every other user. network effects at their best ## training on tutorials to make navigation verifiable generating lm-in scripts isn't an obvious thing for an llm, since there is no such thing in the training data as translating video recordings to some non-existent libraries how does a screen reader know what to do with the recording? it doesn't. but it can learn it is not wild to - seed a library with a handful examples of video-script pairs as examples - extract frames from a bunch of software tutorial videos, such that - frame 1 shows the initial state - frame 2 shows the state after the youtuber clicks a certain element - screen parser determines the coordinates of the control element and the interaction modality (click, scroll, type) - extract youtuber verbal comments on the action being taken - which tend to happen on the desired intermediate level of abstraction - not about clicking but about doing the thing - eg, "improve the clarity of your voice by adding a compressor to your audio track" - generate code that clicks in the interface until the desired effect is reached - video tutorial serves as the ground truth to enable self-supervised learning - remove a frame, make model predict it - remove a step from the prompt, make model predict it - allows creating verifiers in the domain of screen navigation that's quite fun! many apps and sites won't have tutorials. but we know that the performance of models improve with the diversity of data. in image recognition, people love to rotate and flip images to extend the training set. here we can do the same with flipping screens or changing resolutions but another massive source of entropy would be to train on different apps let's say you're collecting data from the interfaces of office apps. you should take microsoft word, libreoffice, google docs, and whatever the apple app is called, and extract data on similar actions from all. this will help the model learn the abstractions instead of exact clicks and generate ui-out scripts for any office app, even if the interface changes or for the new interfaces this approach can work well for desktop applications, but likely not for web. just recently, i saw a [post by a wikipedia board member](https://www.linkedin.com/posts/maciej-nadzikiewicz_costs-ofwikipediainfrastructure-have-grown-activity-7314897236524462081-N_U3) claiming their infrastructure costs have increased 50% in the last months due to a spike in ai bot traffic <iframe src="https://www.linkedin.com/embed/feed/update/urn:li:share:7314661038732337152" height="399" width="604" frameborder="0" allowfullscreen="" title="Embedded post"></iframe> this will likely remain a difficult issue with website owners and hosting providers facing conflicting incentives: - on one hand, it is beneficial for them to optimize their site for agents, as in the future, a large chunk of their traffic is likely to come from automated tools - on the other, there is no monetary incentive to do so now, and only a costs associated with it approach with the screen reader and imprint generation will require not just visiting a site and reading the information once, but clicking around for thousands, if not millions, of times (to feed the rl beast). plus, modern web apps are massive and speed of server-to-page is quite low to solve both predatory traffic & blockage, as well as performance problems, there is a need for another approach, namely the creation of simulated environments ## simulated environments for screen navigation when reinforcement learning people talk about environments, it tends to be a game like chess, go, or mario, or some sort of world simulation where the agent can run around, jump, fall etc. however, as the understanding of the word "agent" transcended the use in rl context, so did the understanding of environments if llm-powered ai agents run around our computers, clicking, typing, scrolling, manipulating files and opening webpages, our simulated environments have to allow for that as well ** a simulated app is a replica of an existing app but simplified - and created automatically just for fun, i did the following: ### screenshotted an interface ![[Pasted image 20250411185927.png]] nb: ableton live is a great program for making music ### extracted a detailed description ![[Pasted image 20250411185957.png]] see [[Detailed Visual Description and Image Generation Prompt]], made with [imaginator](https://poe.com/imaginator-x365) bot on poe ### generated requirements ![[Pasted image 20250411190052.png]] see [[REQUIREMENTS DOCUMENT ABLETON LIVE 11 REPLICA APPLICATION]], made with [requirements-engineer](https://poe.com/reqs-engineer) ### generated a web app ![[Pasted image 20250411190329.png]] it's horrible and nothing works but [bolt.new](https://bolt.new/~/sb1-reeaxmnh) in one shot - and ableton is one of the most complex app on par with photoshop and blender i bet $100 - in one year coding agents will build apps that are more than mildly functional in 2027, creating a simulated app will be a breeze - not as good, as functional, or as beautiful as the original - but good enough to train screen navigators now consider that - [[cost of intelligence]] of gpt4 level model has dropped 10x in the last 18 months - [[fast inference]] providers like cerebras, groq, and sambanova and just starting to gain traction - algorithmic improvements are happening and more will and creating hundreds of thousands of simulated apps for any purpose will become feasible in terms of model capability, price, and speed. this also includes some yet unknown environments that can help explore the problems of open-endedness to create truly creative and self-learning models bringing together accelerated usage through imprints, its full personalization through screen recorders & readers, as well as capability advancements thanks to rl-training on video & simulated environments - ## how fast will agents learn to navigate screens? --- // 12 apr 2025 #navigation #training #reinforcement_learning