screen navigation is a precursor of total automation

# screen navigation is a precursor of total automation everybody in their dog are trying to build the ai agents that would be able to click buttons on our screens reliably. the first examples like [vimgpt](https://github.com/ishan0102/vimGPT/commits/main/?after=cf443aa287793ebeab3443268ccbc447690dc2c2+34 ) and [tarsier](https://github.com/reworkd/tarsier/commits/main/?after=cc0888b1c3536c89e81d71d06def92324d7498b7+279 ) date early november 2023, the day after openai released gpt-4 with vision at their dev day. ![[tarsier.mp4]] then there were adept experiments, and logically, they started working on it much earlier>: ![[Pasted image 20250830235440.png]] ![[Pasted image 20250830235425.png]] A year later, there was a storm: - anthropic computer use was released in late october 2024, - google project mariner in mid-december 2024, and - screen navigation was brought to light in late january with openai operator (now deprecated in favor of openai agent), - quickly followed by products from perplexity, manus, tars, genspark, browser-use, and dozens of others the reasons for this gold rush are clear. in a world that consists of millions of programs originally built for mouse-moving and pixel-clicking monkeys, an ai that can click, scroll, and type through screens without being explicitly programmed, adaptive to every new interface or change in button position of an old one - carries immense opportunities to get rich. ** **screen navigation is a universal integration**. forget awkward csv exports or latency-ridden, rigid, and integration-demanding application programming interfaces - universal screen navigation makes every human interface, an ai interface. however, this approach ignores a simple understanding that if you automate a shitty process, you're just going to accelerate shit creation. the biggest leverage point to change a system is to shift its paradigm, so what if **creating crutches for ai systems to digest our vision-oriented interfaces is a fundamentally wrong idea**? what if the potential of ai-driven money making will be so appealing, while building or rewriting software by ai will become so impeccable, fast, and cheap - that over the long-term, there will be much more cumulative value in rebuilding all interfaces from scratch such that they are natively designed for agentic interactions? ** some people already talk about [[ax]] aka agent experience. i first saw it about four months ago on the announcement of netlify's partnership with windsurf. ![[Pasted image 20250408194511.png]] it entails building **applications usable by agents not humans**. many software products, especially new ones, have already adapted their documentations to be more markdown-based and thus llm-friendly. buncha companies has sprawled to do what they call geo for generative engine optimization, or aiseo, or llmseo and whatever the fuck. when the integration of shopping experiences into chatbot websites occurs en masse, the shops that do it first will benefit from increased revenue and lock-in effects, but fade into obscurity as people stop visiting shops (or any websites for that matter), relegating control to their AI assistants. for a variety of applications, optimizing for users will become irrelevant as more traffic and monetization opportunities will be driven by agents (even though there is a human at the bottom with an actual need to eat, fuck, travel, or do taxes). thus, i do not think that screen navigation is a way to build a generational company - although it might well be a massive opportunity get acquired by google for $10b, so shoot your shot. i see the future of screen navigation similar to prompting - a lot of people were predicting the end of prompting, and in some way, it has happened. reasoning-model powered ai agents understand much more much better, and what they don't understand they try to read from your repository. still not applicable for a wide audience, but it's coming, so i'll checkmark it. some thinkers extend the definition of self to the tools a human uses, and if we include ai, we'll see that we prompt more than ever before. it's just that we delegate the right to prompt to the ai in the process of [[metaprompting]]. all ai agents, complex prompt orchestrations, and all chatbot products use some form of metaprompting, as they understand that we humans are lazy and generally bad at prompting, just as we are generally bad at programming or writing machine code - communicating with computers is not natural or obvious for us. i think that screen navigation will be similar. people won't think about much and although it's going to take probably 30 years until all interfaces are agent-friendly, screen navigation will be a less and less important capability of ai agents. and here comes important distinction. ** the distinction is that what anthropic computers use, openai agent and manus are doing, does not make them agentic. screen navigation (i.e., clicking of interface elements) is just a capability, an acquired capability of a language model through the use of tools, the same way that those models perform web search or run python code. they navigate screens by sending text commands to a tool made by human developers that does the clicking and then gives feedback about what has been clicked. this distinction is probably clear to most experts but not everybody reading this. > there is no screen navigation, there is no web browsing, there is not function calling, there is just a language model generating sequences of words that we as developers then catch and see what the language model demands of us. > > we then do it and give it the results. ** as a next step of adopting our information systems to be consumed, managed, and built by ai, not only the interfaces have to be built with the idea of ax, but the whole information systems themselves. i call it **ai-first systems**. this word has been thrown around a lot in recent months, like, oh imagine a law firms which does not hire human lawyers but replaces them 100% with ai, wow, bro! that's positively not what i'm talking about. ai-first, ai-native, whatever gets the hype. what i am talking about is software built from the ground up **by ai and for ai**. it is built to be easily malleable. it is built to easily communicate. to use the advantages of ai systems that we humans do not have. for example, reading through tens of thousands of pages of text in a few seconds, shuffling massive data sets, and drawing conclusions from them in a highly parallel fashion, or getting rid of most of what we think of as transactional documents such as invoices, customer orders, requests and the like - because those, especially and excruciatingly in form of pdf documents, are completely unnecessary for ai-first systems. ultimately, gui will give way to machine-to-machine communication, and that will be the true end of work for humans. right now, we are just dropping ai into our inefficient, ill-defined, malfunctioning processes, mostly just accelerating garbage creation. but once it is economically viable to rebuild the system from scratch, we will do so, and in those systems, there will be no place for what a lot of consultants call "human-ai augmentation" to exist - because there will be no place for humans. there will be (mostly) no need for human actors to be part of those systems. **human actors would make it slower, less precise, more error-prone, and more expensive**. people are appalled by that - but ask yourself, deep in your heart - wouldn't you want to use products and services, and live a life that is faster, safer, cheaper, and better in every way. analogous to the [tragedy of the commons](https://en.wikipedia.org/wiki/Tragedy_of_the_commons), we can call it a [[tragedy of total automation]] -completely reasonable and desirable on the individual level, it might lead to complete joblessness, market collapse, and the dawn of civilization. but juuuust for today, we're gonna leave those big questions aside and stick to the topic of screen navigation. ** the question is: when will ai-first software become economically viable? obviously, rebuilding all of the graphical user interfaces in the world, as well as the entirety of work-related software, is by no means an easy feat. i'm far from believing that this is going to happen anytime soon and sometimes gotta laugh at people who talk about agi coming in the next two or three years and automating the shit out of economy. as i say, those people have never worked with a corporate it department that needs 2 months to approve vm access. however, two economic drivers are at play: 1. decreasing the cost of such software adaptations, i.e. automating the interface re-engineering 2. increasing the value from doing so on the topic of interface re-engineering you can read my article [[teaching ai to navigate screens]]. the increase in value is more tricky because it requires the network effects to kick in. what kind of network effects? well, for ai-first systems to make sense, there need to be many of them. imagine you have a single organization, its systems are fully automated and document-less, but it still somehow needs to call hundreds of humans on video calls, write thousands of emails and wait for all possible govt approvals and confirmations for months - just to run a business. it's not going to run faster when the whole economic system is as broken as it is right now (from a perspective of super-intelligent ai, of course. i actually think the system is relatively fine except for massive wealth inequality, complete lack of alignment on global needs, rising fascism, erosion of the lands, human dignity, and brain's attention. well!). much more value of such systems can be derived: 1. first, ai becomes better than humans at navigating screens. this is step one and it is likely 6-24 months away (as of sept 2025) 2. step two, we will begin to automate interaction between apps (any-to-any) at scale, which is already going on, due to the complexity and multitude of apps, they can easily take a dozen of years, although most important apps will be connected in the next 3 years 3. step three, machine-to-machine communication will rise up the levels of abstraction of the individual ai agents to the level of organizations - which could be companies, teams, or even individual humans (if thought of in the entirety of what they do every day - the kind of automations you'd want to tell your diary about. ). once m2m communication becomes org2org communication, the value of ai-first systems rises sharply. ** let's look at a practical case. over two years ago, i tried to acquire the german provider of market data statista as a customer with the following story. i wrote them an email and said: > imagine an analyst from bmw needs to know, how did the market of batteries in thailand change in the last few years? as of today, an analyst will need to understand the request of the manager, go to the statista website, search and load a bunch of documents, read through them, and compile a report. notwithstanding coffee drinking, ball scratching, and other breaks. in the world of org2org communication, they can simply ask their **ai@bmw** the same question: "how did the market of batteries develop in thailand last 10 years?", hit submit, and instead of an obnoxious web search aka deep research, the agent would send request to statista. the bmw agent is not a freaking research agent; the statista agent is. this agent is highly complex, sending 10 queries to 50 different databases to find the answer to the question, compile the result, and kindly send it back. and it can do so for a million requests per second. ai@bmw doesn't care and shouldn't care, and shouldn't be able to do all of that. delegation of work, responsibility, and power is just as applicable to artificial intelligent workloads as it is for human ones. in this situation everybody wins - statista gets more requests, can serve higher complexity, and better service for their customers. bmw gets the answers a hundred times faster and cheaper, can run market research unrivalled in comprehensiveness and complexity. maybe it will help them get back on their feet?? and the question why do we need to research battery markets in thailand, can be answered by ai@bmw itself. because of course it has access to all the production and transactional systems of the company, receives real-time reports from thousands of agents supporting and running the organization. at this point, does anybody really care about clicking screens in the world that is run through machine-to-machine communication? --- // 31 aug 2025, berlin