os-complete navigation environment

# os-complete navigation environment recently, i have been thinking about what i call navigation, or screen navigation, also known as computer use and browser use in the community, and is the foundation of anthropic computer use and openai operator the main idea is that a language model looks at a screen, invokes the tool of a screen parser like [microsoft omniparser](https://github.com/microsoft/OmniParser) to understand what elements are there, gets a textual representation of the element from the screen parser, reasons about where to click to complete the task, and commands where to click then, a py-click navigation tool it does the clicking on the screen ![[Pasted image 20250310003417.png]] that's current sota and there are multiple issues with it ## issues with computer use it isn't precise - often, the lm fails to generate the exact code to hit the correct element on screen. sometimes the language model reasoning fails, sometimes there are issues with parsing. it's very slow, because it needs to take screenshots continuously, and use vision capabilities to read them, + parsing, + interpretation, + generation of new commands. because of that, i was thinking about ways to make it faster one of them was the idea of creating the "imprint api". essentially, it's a javascript-based api that is exposed to the language model with simple endpoints that are translated into actual clicks on screen the language model learns which endpoints to hit and which actual elements with what coordinates are being clicked on is handled by the imprint api ## bridging the chasm between human ui & llm reasoning with imprint apis the imprint api is itself an artifact that is generated by a screen reader and a language vision model that is pre-trained and tuned on interface images and actions in those interfaces, going far beyond the regular screen parsing in that it learns to create latent representations of interfaces, such that, for example, the interfaces of different digital audio workstations like ableton or fl studio or reaper, while different on the ui level, are very close in the latent representation space, hence also translated into same or very similar endpoints of an imprint api varying ![[_media/Pasted image 20250310000335.png]] imprint api is an artifact generated by the screen reader. it is pre-computed. if a language model agent encounters a website it has never seen before it tells user to run the screen reader on the interface. the screen reader itself also rl fine-tuned, generating complex click paths within a variety website and app categories ## portables as simulation environments one ways to create a multitude of uis to ensure low latency and without spamming actual original website is to create simulation environments in form of what i call "portables" portable is a simple html file with clear interface reminiscent some other standard interface a simple portable i built for fun with claude35 about month ago while watching the tetris movie: ![[_media/Pasted image 20250310000436.png]] --- play it! [download the html file](https://drive.google.com/file/d/1xfh40A3FKTdBnMIIeLYe2wG-v7i4g-4F/view?usp=sharing) and open in any browser ![[_media/tetris.html]] --- the app is limited but captures the most important parts of the game. there is no pause buttons, there aren't any levels. this might make the game not exciting or sticky in the long term but gives it a certain beauty if you don't play - you lose there are no pauses, there are no compromises, there are no discussions kinda like the life itself ## darwin completeness i was listening to [jeff clune @ mlst podcast](https://www.youtube.com/watch?v=mw5WIDGRLnA) talk about his research in open-ended ai, when he mentioned "darwin complete environments" for ai agents, i made a connection to the navigation environments which are... complete over operation systems? a true [[screen navigation]] agent *must* be trained in an os-complete environment. by that we can encapsulate all possible, or at least all of the highly used apps on any os operating system, be it windows, linux, android, macos, and ios os-complete navigation environment would contain the portables (which are simulations), a screen reader model, and the traces of clicking and interfaces responses that can be used to tune a language model - and thus create a general representation of an operating system easily accessible to the model my guess is that it would improve both accuracy - from some 80% (???) on a task step to some 95%, and speed from 20 seconds per step to sub-second range ## conclusion i believe that a kind of os-complete navigation environment will be essential to train models to effectively use our systems: high scalability, latent representation, middleware layer to unify llm to ui communication - all predict that such tools will come anyone building them already? --- // 10 mar 2025 #navigation #reinforcement_learning