Before I start, I would like to say :
Congratulations to the winners and thank you for the competition. I don’t intend to disrespect or undermine the value of your accomplishment in any way (I kinda will as a byproduct, Im sorry) with this text that I am writing. I am still motivated by the personal emotions for writing but I hope I can share my thoughts without disturbing anyone in the process. I haven’t commented on the forums but during the competition process I looked as many messages as I could and spent enough time to have a general idea of the submissions. Jayu was one of the first that I’ve seen because of it’s popularity and it was surprising to see it as the overall winner. I don’t expect the results to change or to have any self gain other than my own emotional satisfaction and maybe to share some of my opinions in the process.
Important Note: "JAYU is a scapegoat since it is on a similar category as mine (ai assistant). I don’t think that it is the worst between the submissions. The person behind it looks like a nice guy. I just have more knowledge towards this specific area (compared to my knowledge on the other submission stuff) and it was chosen as the “overall winner” which is the main prize. "
About my submission and my understanding of the subject:
My submission was called ATLAS and it basically was a agentic system. Like how jayu is, a part of it is a chatbot wrapper with tool (function) calling. It is a basic idea that existed before chatgpt came out in different spices. They even tried these “computer use agents” with basic rl algorithms before transformers. But after chatgpt came out, the most popular approach on the topic was from this really popular github repo : GitHub - OthersideAI/self-operating-computer: A framework to enable multimodal models to operate a computer. and the project was called “self-operating-computer”.
It is simple: getting a screenshot of the screen, using ocr or different types of visual-to-text information extraction then using the llm model to use that textual representation to trigger code blocks based on the user request using and extracting the responses from a certain format. But the problems were :
- The os systems were really different from each other (visually) and even simple stuff like changing visual elements (such as desktop background color) would effect the performance of the system.
- Not just on the visual information extraction level, but based on the system, the steps that you are required to complete an action would change on the specific user since everyone uses their devices differently from each other and organizes stuff differently.
- There isn’t much “clean” step by step plan data on how to do specific actions on these “varying” computer systems.
- Even if there was data, llm models weren’t good at planning, reasoning and analogical thinking. They are highly language and specific context dependant. An action could require the same steps and be in same format but it wouldn’t be able to consistenly do it. (GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models - Apple Machine Learning Research)
- Even though libs/tools like “GitHub - microsoft/OmniParser: A simple screen parsing tool towards pure vision based GUI agent” got better overtime, the visual to text extraction process still wasn’t reliable especially back then.
- Apps get updates, new versions come out, the folder structure changes and so on. The computer system is dynamic. Which caused even more instability. Everytime a new version of an app came out, you would need to add a knowledge base of the new app structure or somehow finetune the model again or some other way that I can’t think of.
I might have missed some but basically visual approaches to computer control have their advantages but these problems made these already non-reliable systems even less reliable. After “GitHub - OthersideAI/self-operating-computer: A framework to enable multimodal models to operate a computer.”, there were a lot of other approaches like “https://www.multion.ai” which tried only focusing on web control since it didn’t strictly require visual approaches and handling only web control was a smaller chunk to eat. Or agentic flow systems like “GitHub - microsoft/autogen: A programming framework for agentic AI 🤖” which divided tasks into smaller chunks and handling these smaller chunks so that the full process was possible to debug chunk by chunk and so on. But these are just ones that I can think on the fly. There were A LOT A LOT more agentic, tool calling llm systems.
I am sorry if it was a bit boring but to be able to explain myself as clear as possible I had to give these background information. It was possible to skip these if you already knew all these ofc. I didn’t include a lot of stuff but while Im writing this section, it hasn’t been that much time since they announced the competition winners. I am already writing motivated by emotion so there is no need to turn this into a research on its own. Now that we are done with this general info, let me move onto the actual specific case.
Jayu and why it is MID (“not good” in new generation slang):
I am being a bit arrogant since Im talking about it without seeing the code of it. But from what I’ve seen in the forums, google judges seem to skip reviewing some people’s app’s codes anyways (mine didn’t have app statistics so idk what happened in my case) so I assume the first filtering was done solely based on the submission video.
The submission video of Jayu:
- So in the first section of the demo, we see it using chrome. It doesnt seem to be visual queues but basic function call for browser control. There are a lot of python libs and actual specific ai libs doing this and with more complex approaches other than basic “search in the browser”, “extract the text from the current page” functions or implementations.
- In the second section we see it running on the terminal. The user asks for it to “memorize something on their screen”. With basic prompting, when the user uses the keyword (memorize or smth similar) , it takes the screenshot, creates another model call with the screenshot attached, then that model call saves it as a memory since the user said “memorize”. Don’t forget llm’s are highly text dependant. (Will later explain how it was already implemented in chatgpt and why it wasn’t working out as well as it could)
- In the third demo it displays basic speech to text/text to speech (still a chatbot wrapper), the user selects the input box, then it is basic pyautogui “type text” implementation.
- The minecraft example is no different. Changing the place of the textbox on the screen doesnt mean much. You still use the exact same functions and logic. “Display the same function in 6 different demos, maybe they will think of it as being something more!”
- In the fifth demo there seems to be 2 model calls happening. First deciding to enable the “including the screenshot” process in the proceeding model calls, second for the actual function calling happening. Seems to loop until the model decides the action is completed (which is very unreliable (previous reasons)). Not sure if he uses a specific images (which would require everything to get imaged and would be really inconsistent), a cv lib that he made/trained/finetuned (I am not sure if 2-3 months was a enough time to make that robust) or smth revolutionary(…).
- Screen translation task. Basically taking screenshots, including in the model call with repeating model calls with the same prompt. Using the same functions over and over. Repeatedly send chatgpt screenshots of what you are reading and ask “translate this to {language}”.
- The rest are just… First using the same typing and screenshot function for the 10th time… then detecting lung cancer from an Xray? You are using a LLM with billions of params (prolly, idk) for detecting cancer? Wow bro you made it all. Well done fixing the problem of “The cancer detection models already have higher accuracy than most doctors but there are burocratic/responsibility issues” to “Lets use my screenshotting, pyautogui typing chatbot wrapper to fix it all!”. Then he shows gesture usage which I have no idea how it has anything to do with the llm…(Still cool tho. Do more CV stuff. I would watch it.)
-
There was literally a better version of this project in the competition called “Omni”(Omni 1 | Gemini API Developer Competition | Google AI for Developers). There was still a lot of dubious stuff and unsolved problems with that one too, since at the submission stage it wasnt that much different from jayu (at least had more functions) but I think the guy improved it afterwards with his bunny’s help and used a lot of functions rather than fully visual approach. There is no way he didn’t try visual stuff beforehand.
-
There are privacy concerns too (which is very hard to fix but apple tried stuff to fix that (Introducing Apple’s On-Device and Server Foundation Models - Apple Machine Learning Research)) but this is more of a llm api provider problem rather than being jayu’s problem.
Understanding what is going on:
-
So… We didn’t fix any of the previous problems with the llms (Which directly impacts our use case: still fine for some other projects like understanding dog’s emotions using llms or something, idk). Made a chatbot wrapper with tool calling, remade something that already existed 2 years ago in countless github repos but worse.
-
Lets say “yeah but your opinion doesnt matter! It was based on the competition criteria!”
I still don’t understand. Please help me do :
Category 1:
Impact Is the solution easy and enjoyable to use for everyone, including people with disabilities? (maximum 5 points)
Does this solution have potential to contribute meaningfully to environmental sustainability?(maximum 5 points)
Does this solution have potential to contribute meaningfully to improving people’s lives? (maximum 5 points)
Category 2:
Remarkability Is the submission surprising to those that are well-versed in Large Language Models (“LLM”)? (maximum 5 points)
Is the submission surprising to those that are not well-versed in LLM? (maximum 5 points) Category 3:
Creativity Does the submission differ from existing, well known, applications in functionality? (maximum 5 points)
Does the submission differ from existing, well known, applications in user experience? (maximum 5 points)
Is the submission implemented through the use of creative problem-solving approaches? (maximum 5 points)
Category 4:
Usefulness Does the submission include a well-defined target user persona/segmentation? (maximum 5 points)
Does the submission identify how the solution addresses specific user needs? (maximum 5 points)
How well does the solution, as implemented, help users meet these needs? (maximum 5 points)
Category 5:
Execution Is the solution well-designed and adhere to software engineering practices? (maximum 5 points)
Is the LLM component of the solution well-designed and adhere to Machine Learning (ML)/LLM best practices? (maximum 5 points)
Maximum score: 65
In the event of a tie(s), Sponsor will rejudge the Submissions for overall video impact to determine the applicable winner. Sponsor’s decisions are final and binding.Where did it get all of these points from? Was it that it was really surprising to see the same thing done for the 1000th time for LLM experts(well-versed with LLMs)? Did having an extra function call to write the email into the email box rather than copy pasting from the chat with the chatbot impacted enviromental sustainability? It is mind boggling to see the irrelevancy of the project and the competition evaluation criteria.
-
Why am I crying in a google forum where it wouldnt change the competition outcome:
- Here is the thing:
- 3.1k people took time from their own life, they were interested in the competition and trusted your ability/integrity on being able to evaluate their chances(talking to the judges or whoever responsible). If people thought that it was rigged, they most likely wouldn’t join in the first place right? (Ofc, we know the experience and the innovation that you can feel in your delicate soul is very enlightning and that was the sole reason you’ve joined the competition. But please focus on the argument points rather than the specific sentence structures that I’m forming based on practical requirements.) So please help me understand and help me build my trust with your integrity.
- In no way I am telling this was the case (being rigged). But I have to say this to remind the importance on the case. I find the case of there being “poor evaluation” by either incompetency or “bias(the bad, preventable type)” to be very disrespectful and not very honest. Yes, you did get 3.1k submissions and there was this and that and all those other excuses so you expect empathy and ofc you deserve it. BUT you need to empathize with the competitors to expect the same from them, right?
- To make it clear : Not everyone was going to win ofc and it is impossible to make a perfect analysis and evaluation. No sane person expected you to do so. And jury’s choices will be inherently subjective and you guys are GOOGLE so you took your time and had all the judges look at the submissions with all the detail. You were very eagered to inspect all of the submissions. You looked at the current ai scape of the approaches, looked at the competition criteria and finally after a 3 MONTHS of inspection period, you finally decided to choose A CHATBOT WRAPPER WITH TOOL CALLING THAT HAS MAXIMUM 4 TOOLS.
- Even tho I got hopeful after seeing the other submissions like Jayu, before that, I didn’t expect to ever win a competition with such a large prize pool. I am relatively new to a relatively new subfield and I am nowhere even close to being competent. I am giving my honest, intimate opinion and I repeat: I don’t have any gain other than my own emotional satisfaction since they won’t change the competition outcome. I hope this can give me a little bit more integrity about my honesty.
What was my project? What did I do to be able to judge the overall competition winner? (I am not required to be qualified in order to have opinion on something but since Im writing it arrogantly, I have to at least try to raise my ethos a bit) :
-
Since we needed to use a publicly available model (no finetuning based on my understanding) and only had 2-3 months, I had some “not the best” attempts at trying to fix the previous problems that I’ve mentioned as much as I could.
-
If you want to read, have a visual yourself here is a small visual that I’ve previously made to be able to explain it easier to friends and family : Vite + React
If you want to understand my project further for some reason, I really recommend taking a look at the link but here is a small summary with some insights and reasonings :
-
This was the small explanation that I’ve sent to google (There was a char limit. I didn’t think that would be a problem assuming they work at THESE FIELDS in GOOGLE(How else will you check if it was surprising to the experts/well versed with LLMs. That was the competition criteria that you gave…))(I didn’t send the “Specific Insights” parts, thought they would figure it out themselves xD):
-
ATLAS is an AI assistant that leverages the Gemini API to create a smart multi-agent system that learns and adapts to you.
-
Multi-Agent System: ATLAS coordinates specialized agents to handle complex tasks across various domains.
- Specific insights: Not novel. But the problem was that there are a lot of different types of operations that you can do on a os system. A single gemini model wasn’t able do manage all of these using 1 prompt. Yes, the model has a big context length. But the problem is: It isn’t able to intelligently use/process it’s context length as good as some of the other state of the art models. Context-usagewise feels more similar to a rag system. I don’t blame Google for this, it is already really hard to make/train/do inference on a model with this big of a context length and it would be hard to rlhf the model on this big of a context. What I found was (which is one of the reasons behind why companies are attracted to multi-agentic flows) is breaking it down to smaller “agents” helped the model perform better. All the model calls had their own “roles” which is something we hear a lot in these types of systems : “companies”, “workplaces”, “an orchestra” and so on. And making it a agentic system with specific but wide range of function calls that didn’t rely on visual queues even though was limiting, was much more reliable. (Still added all of the required function calls for visual interaction to the GUIAutomationAgent in the system tho.)
-
Memory System: Organizes information to provide context-aware, personalized assistance over time.
- Specific insights: Not novel. Even chatgpt had this feature but there were problems:
- The implementations I saw didn’t categorise and leverage different types of memory like how it is for humans. For example “short term” memories can have a expiration date that can be deleted by the system (simulating human forgetting time, you could make a system that forgets when a memory is used less and less too). When you tell the model “today’s date is 21.11.2024”, keeping that memory tomorrow would confuse the model which it did with chatgpt and other simple memory implemented llms. I added 4 different memories : Long term, short term, abilities memory and user preferences. It was a conceptual thing, I literally saved basic json files. It could ofc be further optimized. My implementation was nothing special(Relatively bad).
- Another problem that I saw with the other implementations was that they happened with the model checking and with the same model before the actual response you’ve asked for. I understand it from a computational efficiency perspective, but it isn’t that much work since you already limit and concatenate these memories. But it caused a delay whenever the model saved a memory before speaking and it was saving the stupidest stuff (since there is only 1 model call happening, the model is trying to handle both tasks in 1 response). I fixed this by running the memory call in parallel with the main call and the deciding process happening with a different prompt.
- Categorizing memories also helped the model to use them in the correct context.
- Abilities memory was especially good, since when it made a mistake in the process while doing it, after review of the process, it was able to understand what and where it went wrong (using the specific errors from the chat) and would see how it previously fixed it then saves how it did without the faulty part so that the model doesn’t repeat the mistakes it once did by seeing it in it’s memory. Made the action execution a lot more consistent for that specific action. Doesn’t cover for a actual human being supervising the process but I was surprised after seeing how well it worked in practice.
- Specific insights: Not novel. Even chatgpt had this feature but there were problems:
-
Train Function: Learns from your actions to streamline future tasks, similar to a plan retrieval system. (VM Learn, coming soon, will enhance this with simulated scenarios.)
- Specific insights: I think VM Learn is basic idea and was a unfinished promise (we don’t talk about it, I didn’t have time. ), I thought the Train function was relatively novel. Even tho it prolly exists somewhere in some corner of the internet, I haven’t seen a similar function anywhere.
- This was my fix to the previous problems that I’ve mentioned:
- “the steps that you are required to do an action would change on the specific user since everyone uses their devices differently from each other.”
- "There isn’t much “clean” step by step plan data on how to do specific actions on these “varying” computer systems. "
- Even if there was data, llm models weren’t good at planning, reasoning and analogical thinking. They are highly language and specific context dependant. An action could require the same steps but it wouldn’t be able to do it even tho it is basically the same thing. "
- From the people that are actually working at these fields, one of the easy fixes for the llm models were making a “plan retrieval system that uses human made plans”(https://youtu.be/y1WnHpedi2A?t=5366 in 1:33:16) and even the smallest human nudges seemed to help these models get a lot better (https://youtu.be/BGvQmHd4QPE?t=2481) and this was basically a way to bring the “human planning” into the decision making process of the ai model. I’ve had this idea for a long time using some combination of rl + lstm models but the problem was with the creation of the dataset. After the competition came out, that is when I’ve realized that we could create the dataset by a combination of the keyboard/mouse action recording and video understanding of the model. This was my way of implementing it. - “Apps get updates, new versions come out, the folder structure changes and so on. The os system is dynamic. Which caused even more instability. Everytime a new version of an app came out, you would need to add a knowledge base of the new app structure or somehow finetune the model again or some other way that I cant think of (xD).”
Since the process was learning just by watching the user doing it (kinda like an intern sitting at the desk watching you), it “kinda” fixed the app updates problem. Since you could just train without fine tuning or using the app documentation with rag which most of the time didn’t include every action in “step-by-step” format.
- This was my fix to the previous problems that I’ve mentioned:
- Specific insights: I think VM Learn is basic idea and was a unfinished promise (we don’t talk about it, I didn’t have time. ), I thought the Train function was relatively novel. Even tho it prolly exists somewhere in some corner of the internet, I haven’t seen a similar function anywhere.
-
Observe Function: Summarizes daily activities to provide insights and suggest improvements.
- Specific Insights: This isn’t revolutionary, just like Jayu. Even this basic feature has more novelty I guess but it is a privacy/security demolisher. It could still be useful for a lot of people tho. Basically 2 separate processes running on separate threads: One of them getting 1 min screen recordings in a loop, the other one summarizing the recording and pasting it in a file, while the other starts the next recording. So you can continously summarize the whole day in a single file without passing the 40min-1hour video limit(I don’t remember which one was true). The downsides are pretty clear. The positive is that you can use it to remember stuff, organize info and enable the ai system “provide tailored experiences” using this knowledge. It is basically good for the memory system, you can basically textualize your actions on your pc.
-
Live Fix: Delivers real-time problem-solving with continuous processing, unlike the discrete main multi-agent system.
- Specific Insights: Saying “discrete” was not fully correct. All pc processes are by nature discrete but what I meant by continous processing is basically that it can loop. There were screenshot loop system, loop until screenchange system and loop with time limit system, to be able to have the model handle tasks that needed looping (e.g. Tell me when there is a circle on my screen, tell me when I do smth wrong in the code etc.). It was still very unreliable since the current popular integration of llms with visual processing (Thus visual reasoning) is very primitive. We just plug them together(Even in most architectural approaches).
-
Agent Framework: Allows you to build and visually connect custom agents and workflows.
- Specific Insights: Back then only systems like “AutoGEN” existed and they weren’t that easy to use for general public. They still arent, but I wanted to make at least a “easy to use” version and one of my ideas for future was to make it so that advanced users could add their own agents through a validation process to the “main assistant”, expand/customize the capabilities of the general “ATLAS” system by themselves. But since I’ve made this and while I was making this a lot of new agent frameworks came out, “AutoGEN” was updated and so on. It was still kinda neat to integrate it with the other systems in “ATLAS” but that was all there is to it. There was the builder bot that could build these for you and you could test it in real time and so on. But those were optimizations to the user experience, the core idea didn’t change and was not novel.
-
Speech Control: Understands and acts on voice commands, enabling everyone to use it.
- Specific Insights: This wasn’t the cliche speech to text/text to speech, I had those options too, but those are too generic. You just copy paste some libs or if you don’t have anything better to do, you write the libs yourself. But I specifically implemented this for the “impact/accesibility” criteria. It is basically a feature that enables non-technical/“disabled” people (in the improved version of the feature) to use it using only speech. You can ask questions about the app (which I didn’t have the time to write the full documentation for) and you can ask it to do stuff for you in the app. It knows the current situation of the UI, which elements are apparent on your screen and doesn’t do that “VISUALLY” to make it more reliable.
-
ATLAS includes over 200 specialized functions and harnesses Gemini’s multi-modal capabilities to manage complex tasks. It improves over time and is customizable: you can add knowledge sources, select text-to-speech options, choose different AI models, work with various file types, and switch between safe and efficient modes—all while keeping your data secure on your local system.
Now time to humble myself (since the main point is not telling about my own project but criticize the competition/people responsible (NOT THE GOOGLE COMPANY, THIS IS A CRITIQUE TOWARDS THE JUDGES)):
- ATLAS has a lot of functionalities. This can be hard to use for an average person even with all the ai explanation(speech control feature). Average grandmother doesn’t know how to send a text message. This limits the user count.
- (Note: The competition expected for demos (specified in the video/rules). What I am talking about is an improvement in the user experience which can be counted as further refinement. You don’t expect that from an app that was done in 2-3 months. Jayu was running in a terminal?)
- The system got better the more you use it (thanks to memory, train and observe features) but the default/untrained system wasn’t that consistent. Because the default system without the other features at effect, was a multi agentic system still using the base gemini models which has all of the previous problems that I’ve listed. Most users don’t want to train their own stuff, so they would need to download with some memory and so on.
- It wasn’t refined. Even tho I can say “yea, but we only had 2-3 months!”, facts are facts. I fixed bugs from my perspective but I didn’t do any red teaming, in depth testing and so on. I only had a list of “test tasks” with varying complexities that I’ve changed from time to time, I had unit tests for the 200+ functions and I tested the general programme but that was it. I could still find 174147174178 more bugs if I’ve tried, especially in the untrained/default system.
- It ate tokens like crazy (Which can be good for marketing the gemini model? But caused by unrefined prompt in the system). I gave a really long prompt that was telling the models what to do without giving too many specific examples (since they got hyperfocused and got worse at other tasks when I gave specific examples on specific tasks), I had to create a lingual algorithmic guide that it should follow. I was using premium version or something with free trial on google cloud but after the free trial was over, I’ve realized that I couldn’t use it with free rate limits.
- The base intelligence doesn’t improve. It still highly depends on the model quality and capability. I can do whatever I want, all the prompting tricks and stuff which helps with task performance but in the end, the base model is same. I didn’t make a system that can discover a pattern that it didn’t have. Didn’t gain the capability to have higher abstraction and reasoning out of nowhere.
- It was slow. Even tho it was fast on simple tasks (e.g. Jayu typing text, Omni checking battery power), it still took around 14 mins in the task that I showed on the demo :
- “Research top 5 emerging technologies in renewable energy, get the top 3 results, analyze their potential impact on global carbon emissions, compile a word document with your findings and recommendations for future investments then create a comparative chat of their efficiencies using code, save the code file and the chart.png on my desktop.”
- For integrity of the demo I had to use the base system without any memories and so on:
- It took 3 tries. First It forgot that it needed to check my desktop path before the creation(even tho it is specified in the base prompt to add that step in file operations before doing it). Second it forgot to concantonate/summarize the text from the web and hit a rate limit.
- It is a privacy nightmare. Yea, I told about this before too (for Jayu) and said this was llm provider’s problem but this app is even more of a nightmare. It can access EVERYTHING. There are registry/file operation agents. Which isn’t a good selling point. In the demo I say stuff like “Yea, it saves it locally” but in the end : You are sending it to google’s servers + You are keeping all that new information locally at your computer (could be fixed with encryption or smth but still very dangerous).
For more information, this was the youtube video link of ATLAS: https://youtu.be/OfCHMGhtr_I?feature=shared
And so on…
Before I close it out I wanna say:
- Thank you if you have read this far “dear competitor”. I hope it was worth it and apologize if it wasn’t. Currently I am a human being with some emotions and I might be very biased. But I hope I could convey my points even with my current limitations of being a bit mad and confused. I don’t know you but I don’t lose something by wishing you good luck with your life.
- I am an amateur, maybe worse. Anything I say can change as soon as I learn something new. I don’t stand that firmly by my points and with sufficient logical counterarguments, I will apologize then run away.
- To the person that created the Jayu: I am really sorry. You didn’t do anything bad, at all. On the contrary you’ve made a great video and a great project. And compared to a lot of other projects, I would still place it high. I just wanted to criticize the judging process and the judges themselves but your project got between in the process. I couldn’t think of another way to write, it is my sole incapability and I am open to and welcome any bad words coming from you in my private messages.
- There could be some parts where it comes off as being overly emotional and non-professional and I understand/agree with that. But I am not a professional neither a coorporation. I competed in this competition/hackathon and wanted to give my opinions about it without directly insulting “specific people” but “concepts/processes”. I might’ve slipped up somewhere without realizing and I apologize if that is the case. “I looked through a lot of stuff and there was this and that reason, please excuse my behaviour”. I meant what I said in the text above and anger is a great tool. I am not going to give an ai generated, basic formal response. I am in my rebellious period(!).