Seeking feedback on the competition criteria (while using JAYU as scapegoat)

Kuzey_Akin · November 22, 2024, 12:20am

Before I start, I would like to say :
Congratulations to the winners and thank you for the competition. I don’t intend to disrespect or undermine the value of your accomplishment in any way (I kinda will as a byproduct, Im sorry) with this text that I am writing. I am still motivated by the personal emotions for writing but I hope I can share my thoughts without disturbing anyone in the process. I haven’t commented on the forums but during the competition process I looked as many messages as I could and spent enough time to have a general idea of the submissions. Jayu was one of the first that I’ve seen because of it’s popularity and it was surprising to see it as the overall winner. I don’t expect the results to change or to have any self gain other than my own emotional satisfaction and maybe to share some of my opinions in the process.

Important Note: "JAYU is a scapegoat since it is on a similar category as mine (ai assistant). I don’t think that it is the worst between the submissions. The person behind it looks like a nice guy. I just have more knowledge towards this specific area (compared to my knowledge on the other submission stuff) and it was chosen as the “overall winner” which is the main prize. "

About my submission and my understanding of the subject:
My submission was called ATLAS and it basically was a agentic system. Like how jayu is, a part of it is a chatbot wrapper with tool (function) calling. It is a basic idea that existed before chatgpt came out in different spices. They even tried these “computer use agents” with basic rl algorithms before transformers. But after chatgpt came out, the most popular approach on the topic was from this really popular github repo : GitHub - OthersideAI/self-operating-computer: A framework to enable multimodal models to operate a computer. and the project was called “self-operating-computer”.

It is simple: getting a screenshot of the screen, using ocr or different types of visual-to-text information extraction then using the llm model to use that textual representation to trigger code blocks based on the user request using and extracting the responses from a certain format. But the problems were :

The os systems were really different from each other (visually) and even simple stuff like changing visual elements (such as desktop background color) would effect the performance of the system.
Not just on the visual information extraction level, but based on the system, the steps that you are required to complete an action would change on the specific user since everyone uses their devices differently from each other and organizes stuff differently.
There isn’t much “clean” step by step plan data on how to do specific actions on these “varying” computer systems.
Even if there was data, llm models weren’t good at planning, reasoning and analogical thinking. They are highly language and specific context dependant. An action could require the same steps and be in same format but it wouldn’t be able to consistenly do it. (GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models - Apple Machine Learning Research)
Even though libs/tools like “GitHub - microsoft/OmniParser: A simple screen parsing tool towards pure vision based GUI agent” got better overtime, the visual to text extraction process still wasn’t reliable especially back then.
Apps get updates, new versions come out, the folder structure changes and so on. The computer system is dynamic. Which caused even more instability. Everytime a new version of an app came out, you would need to add a knowledge base of the new app structure or somehow finetune the model again or some other way that I can’t think of.

I might have missed some but basically visual approaches to computer control have their advantages but these problems made these already non-reliable systems even less reliable. After “GitHub - OthersideAI/self-operating-computer: A framework to enable multimodal models to operate a computer.”, there were a lot of other approaches like “https://www.multion.ai” which tried only focusing on web control since it didn’t strictly require visual approaches and handling only web control was a smaller chunk to eat. Or agentic flow systems like “GitHub - microsoft/autogen: A programming framework for agentic AI 🤖” which divided tasks into smaller chunks and handling these smaller chunks so that the full process was possible to debug chunk by chunk and so on. But these are just ones that I can think on the fly. There were A LOT A LOT more agentic, tool calling llm systems.

I am sorry if it was a bit boring but to be able to explain myself as clear as possible I had to give these background information. It was possible to skip these if you already knew all these ofc. I didn’t include a lot of stuff but while Im writing this section, it hasn’t been that much time since they announced the competition winners. I am already writing motivated by emotion so there is no need to turn this into a research on its own. Now that we are done with this general info, let me move onto the actual specific case.

Jayu and why it is MID (“not good” in new generation slang):
I am being a bit arrogant since Im talking about it without seeing the code of it. But from what I’ve seen in the forums, google judges seem to skip reviewing some people’s app’s codes anyways (mine didn’t have app statistics so idk what happened in my case) so I assume the first filtering was done solely based on the submission video.

The submission video of Jayu:

So in the first section of the demo, we see it using chrome. It doesnt seem to be visual queues but basic function call for browser control. There are a lot of python libs and actual specific ai libs doing this and with more complex approaches other than basic “search in the browser”, “extract the text from the current page” functions or implementations.
In the second section we see it running on the terminal. The user asks for it to “memorize something on their screen”. With basic prompting, when the user uses the keyword (memorize or smth similar) , it takes the screenshot, creates another model call with the screenshot attached, then that model call saves it as a memory since the user said “memorize”. Don’t forget llm’s are highly text dependant. (Will later explain how it was already implemented in chatgpt and why it wasn’t working out as well as it could)
In the third demo it displays basic speech to text/text to speech (still a chatbot wrapper), the user selects the input box, then it is basic pyautogui “type text” implementation.
The minecraft example is no different. Changing the place of the textbox on the screen doesnt mean much. You still use the exact same functions and logic. “Display the same function in 6 different demos, maybe they will think of it as being something more!”
In the fifth demo there seems to be 2 model calls happening. First deciding to enable the “including the screenshot” process in the proceeding model calls, second for the actual function calling happening. Seems to loop until the model decides the action is completed (which is very unreliable (previous reasons)). Not sure if he uses a specific images (which would require everything to get imaged and would be really inconsistent), a cv lib that he made/trained/finetuned (I am not sure if 2-3 months was a enough time to make that robust) or smth revolutionary(…).
Screen translation task. Basically taking screenshots, including in the model call with repeating model calls with the same prompt. Using the same functions over and over. Repeatedly send chatgpt screenshots of what you are reading and ask “translate this to {language}”.
The rest are just… First using the same typing and screenshot function for the 10th time… then detecting lung cancer from an Xray? You are using a LLM with billions of params (prolly, idk) for detecting cancer? Wow bro you made it all. Well done fixing the problem of “The cancer detection models already have higher accuracy than most doctors but there are burocratic/responsibility issues” to “Lets use my screenshotting, pyautogui typing chatbot wrapper to fix it all!”. Then he shows gesture usage which I have no idea how it has anything to do with the llm…(Still cool tho. Do more CV stuff. I would watch it.)

There was literally a better version of this project in the competition called “Omni”(Omni 1 | Gemini API Developer Competition | Google AI for Developers). There was still a lot of dubious stuff and unsolved problems with that one too, since at the submission stage it wasnt that much different from jayu (at least had more functions) but I think the guy improved it afterwards with his bunny’s help and used a lot of functions rather than fully visual approach. There is no way he didn’t try visual stuff beforehand.
There are privacy concerns too (which is very hard to fix but apple tried stuff to fix that (Introducing Apple’s On-Device and Server Foundation Models - Apple Machine Learning Research)) but this is more of a llm api provider problem rather than being jayu’s problem.

Understanding what is going on:
- So… We didn’t fix any of the previous problems with the llms (Which directly impacts our use case: still fine for some other projects like understanding dog’s emotions using llms or something, idk). Made a chatbot wrapper with tool calling, remade something that already existed 2 years ago in countless github repos but worse.
- Lets say “yeah but your opinion doesnt matter! It was based on the competition criteria!”
  
  I still don’t understand. Please help me do :
  
  Category 1:
  Impact Is the solution easy and enjoyable to use for everyone, including people with disabilities? (maximum 5 points)
  Does this solution have potential to contribute meaningfully to environmental sustainability?(maximum 5 points)
  Does this solution have potential to contribute meaningfully to improving people’s lives? (maximum 5 points)
  Category 2:
  Remarkability Is the submission surprising to those that are well-versed in Large Language Models (“LLM”)? (maximum 5 points)
  Is the submission surprising to those that are not well-versed in LLM? (maximum 5 points) Category 3:
  Creativity Does the submission differ from existing, well known, applications in functionality? (maximum 5 points)
  Does the submission differ from existing, well known, applications in user experience? (maximum 5 points)
  Is the submission implemented through the use of creative problem-solving approaches? (maximum 5 points)
  Category 4:
  Usefulness Does the submission include a well-defined target user persona/segmentation? (maximum 5 points)
  Does the submission identify how the solution addresses specific user needs? (maximum 5 points)
  How well does the solution, as implemented, help users meet these needs? (maximum 5 points)
  Category 5:
  Execution Is the solution well-designed and adhere to software engineering practices? (maximum 5 points)
  Is the LLM component of the solution well-designed and adhere to Machine Learning (ML)/LLM best practices? (maximum 5 points)
  Maximum score: 65
  In the event of a tie(s), Sponsor will rejudge the Submissions for overall video impact to determine the applicable winner. Sponsor’s decisions are final and binding.
  
  Where did it get all of these points from? Was it that it was really surprising to see the same thing done for the 1000th time for LLM experts(well-versed with LLMs)? Did having an extra function call to write the email into the email box rather than copy pasting from the chat with the chatbot impacted enviromental sustainability? It is mind boggling to see the irrelevancy of the project and the competition evaluation criteria.

Why am I crying in a google forum where it wouldnt change the competition outcome:

Here is the thing:
- 3.1k people took time from their own life, they were interested in the competition and trusted your ability/integrity on being able to evaluate their chances(talking to the judges or whoever responsible). If people thought that it was rigged, they most likely wouldn’t join in the first place right? (Ofc, we know the experience and the innovation that you can feel in your delicate soul is very enlightning and that was the sole reason you’ve joined the competition. But please focus on the argument points rather than the specific sentence structures that I’m forming based on practical requirements.) So please help me understand and help me build my trust with your integrity.
In no way I am telling this was the case (being rigged). But I have to say this to remind the importance on the case. I find the case of there being “poor evaluation” by either incompetency or “bias(the bad, preventable type)” to be very disrespectful and not very honest. Yes, you did get 3.1k submissions and there was this and that and all those other excuses so you expect empathy and ofc you deserve it. BUT you need to empathize with the competitors to expect the same from them, right?
To make it clear : Not everyone was going to win ofc and it is impossible to make a perfect analysis and evaluation. No sane person expected you to do so. And jury’s choices will be inherently subjective and you guys are GOOGLE so you took your time and had all the judges look at the submissions with all the detail. You were very eagered to inspect all of the submissions. You looked at the current ai scape of the approaches, looked at the competition criteria and finally after a 3 MONTHS of inspection period, you finally decided to choose A CHATBOT WRAPPER WITH TOOL CALLING THAT HAS MAXIMUM 4 TOOLS.
Even tho I got hopeful after seeing the other submissions like Jayu, before that, I didn’t expect to ever win a competition with such a large prize pool. I am relatively new to a relatively new subfield and I am nowhere even close to being competent. I am giving my honest, intimate opinion and I repeat: I don’t have any gain other than my own emotional satisfaction since they won’t change the competition outcome. I hope this can give me a little bit more integrity about my honesty.

What was my project? What did I do to be able to judge the overall competition winner? (I am not required to be qualified in order to have opinion on something but since Im writing it arrogantly, I have to at least try to raise my ethos a bit) :

Since we needed to use a publicly available model (no finetuning based on my understanding) and only had 2-3 months, I had some “not the best” attempts at trying to fix the previous problems that I’ve mentioned as much as I could.
If you want to read, have a visual yourself here is a small visual that I’ve previously made to be able to explain it easier to friends and family : Vite + React

If you want to understand my project further for some reason, I really recommend taking a look at the link but here is a small summary with some insights and reasonings :

This was the small explanation that I’ve sent to google (There was a char limit. I didn’t think that would be a problem assuming they work at THESE FIELDS in GOOGLE(How else will you check if it was surprising to the experts/well versed with LLMs. That was the competition criteria that you gave…))(I didn’t send the “Specific Insights” parts, thought they would figure it out themselves xD):
ATLAS is an AI assistant that leverages the Gemini API to create a smart multi-agent system that learns and adapts to you.
Multi-Agent System: ATLAS coordinates specialized agents to handle complex tasks across various domains.
- Specific insights: Not novel. But the problem was that there are a lot of different types of operations that you can do on a os system. A single gemini model wasn’t able do manage all of these using 1 prompt. Yes, the model has a big context length. But the problem is: It isn’t able to intelligently use/process it’s context length as good as some of the other state of the art models. Context-usagewise feels more similar to a rag system. I don’t blame Google for this, it is already really hard to make/train/do inference on a model with this big of a context length and it would be hard to rlhf the model on this big of a context. What I found was (which is one of the reasons behind why companies are attracted to multi-agentic flows) is breaking it down to smaller “agents” helped the model perform better. All the model calls had their own “roles” which is something we hear a lot in these types of systems : “companies”, “workplaces”, “an orchestra” and so on. And making it a agentic system with specific but wide range of function calls that didn’t rely on visual queues even though was limiting, was much more reliable. (Still added all of the required function calls for visual interaction to the GUIAutomationAgent in the system tho.)
Memory System: Organizes information to provide context-aware, personalized assistance over time.
- Specific insights: Not novel. Even chatgpt had this feature but there were problems:
  1. The implementations I saw didn’t categorise and leverage different types of memory like how it is for humans. For example “short term” memories can have a expiration date that can be deleted by the system (simulating human forgetting time, you could make a system that forgets when a memory is used less and less too). When you tell the model “today’s date is 21.11.2024”, keeping that memory tomorrow would confuse the model which it did with chatgpt and other simple memory implemented llms. I added 4 different memories : Long term, short term, abilities memory and user preferences. It was a conceptual thing, I literally saved basic json files. It could ofc be further optimized. My implementation was nothing special(Relatively bad).
  2. Another problem that I saw with the other implementations was that they happened with the model checking and with the same model before the actual response you’ve asked for. I understand it from a computational efficiency perspective, but it isn’t that much work since you already limit and concatenate these memories. But it caused a delay whenever the model saved a memory before speaking and it was saving the stupidest stuff (since there is only 1 model call happening, the model is trying to handle both tasks in 1 response). I fixed this by running the memory call in parallel with the main call and the deciding process happening with a different prompt.
  3. Categorizing memories also helped the model to use them in the correct context.
  4. Abilities memory was especially good, since when it made a mistake in the process while doing it, after review of the process, it was able to understand what and where it went wrong (using the specific errors from the chat) and would see how it previously fixed it then saves how it did without the faulty part so that the model doesn’t repeat the mistakes it once did by seeing it in it’s memory. Made the action execution a lot more consistent for that specific action. Doesn’t cover for a actual human being supervising the process but I was surprised after seeing how well it worked in practice.
Train Function: Learns from your actions to streamline future tasks, similar to a plan retrieval system. (VM Learn, coming soon, will enhance this with simulated scenarios.)
- Specific insights: I think VM Learn is basic idea and was a unfinished promise (we don’t talk about it, I didn’t have time. ), I thought the Train function was relatively novel. Even tho it prolly exists somewhere in some corner of the internet, I haven’t seen a similar function anywhere.
  - This was my fix to the previous problems that I’ve mentioned:
    - “the steps that you are required to do an action would change on the specific user since everyone uses their devices differently from each other.”
  - "There isn’t much “clean” step by step plan data on how to do specific actions on these “varying” computer systems. "
  - Even if there was data, llm models weren’t good at planning, reasoning and analogical thinking. They are highly language and specific context dependant. An action could require the same steps but it wouldn’t be able to do it even tho it is basically the same thing. "
    - From the people that are actually working at these fields, one of the easy fixes for the llm models were making a “plan retrieval system that uses human made plans”(https://youtu.be/y1WnHpedi2A?t=5366 in 1:33:16) and even the smallest human nudges seemed to help these models get a lot better (https://youtu.be/BGvQmHd4QPE?t=2481) and this was basically a way to bring the “human planning” into the decision making process of the ai model. I’ve had this idea for a long time using some combination of rl + lstm models but the problem was with the creation of the dataset. After the competition came out, that is when I’ve realized that we could create the dataset by a combination of the keyboard/mouse action recording and video understanding of the model. This was my way of implementing it.
  - “Apps get updates, new versions come out, the folder structure changes and so on. The os system is dynamic. Which caused even more instability. Everytime a new version of an app came out, you would need to add a knowledge base of the new app structure or somehow finetune the model again or some other way that I cant think of (xD).”
    Since the process was learning just by watching the user doing it (kinda like an intern sitting at the desk watching you), it “kinda” fixed the app updates problem. Since you could just train without fine tuning or using the app documentation with rag which most of the time didn’t include every action in “step-by-step” format.
Observe Function: Summarizes daily activities to provide insights and suggest improvements.
- Specific Insights: This isn’t revolutionary, just like Jayu. Even this basic feature has more novelty I guess but it is a privacy/security demolisher. It could still be useful for a lot of people tho. Basically 2 separate processes running on separate threads: One of them getting 1 min screen recordings in a loop, the other one summarizing the recording and pasting it in a file, while the other starts the next recording. So you can continously summarize the whole day in a single file without passing the 40min-1hour video limit(I don’t remember which one was true). The downsides are pretty clear. The positive is that you can use it to remember stuff, organize info and enable the ai system “provide tailored experiences” using this knowledge. It is basically good for the memory system, you can basically textualize your actions on your pc.
Live Fix: Delivers real-time problem-solving with continuous processing, unlike the discrete main multi-agent system.
- Specific Insights: Saying “discrete” was not fully correct. All pc processes are by nature discrete but what I meant by continous processing is basically that it can loop. There were screenshot loop system, loop until screenchange system and loop with time limit system, to be able to have the model handle tasks that needed looping (e.g. Tell me when there is a circle on my screen, tell me when I do smth wrong in the code etc.). It was still very unreliable since the current popular integration of llms with visual processing (Thus visual reasoning) is very primitive. We just plug them together(Even in most architectural approaches).
Agent Framework: Allows you to build and visually connect custom agents and workflows.
- Specific Insights: Back then only systems like “AutoGEN” existed and they weren’t that easy to use for general public. They still arent, but I wanted to make at least a “easy to use” version and one of my ideas for future was to make it so that advanced users could add their own agents through a validation process to the “main assistant”, expand/customize the capabilities of the general “ATLAS” system by themselves. But since I’ve made this and while I was making this a lot of new agent frameworks came out, “AutoGEN” was updated and so on. It was still kinda neat to integrate it with the other systems in “ATLAS” but that was all there is to it. There was the builder bot that could build these for you and you could test it in real time and so on. But those were optimizations to the user experience, the core idea didn’t change and was not novel.
Speech Control: Understands and acts on voice commands, enabling everyone to use it.
- Specific Insights: This wasn’t the cliche speech to text/text to speech, I had those options too, but those are too generic. You just copy paste some libs or if you don’t have anything better to do, you write the libs yourself. But I specifically implemented this for the “impact/accesibility” criteria. It is basically a feature that enables non-technical/“disabled” people (in the improved version of the feature) to use it using only speech. You can ask questions about the app (which I didn’t have the time to write the full documentation for) and you can ask it to do stuff for you in the app. It knows the current situation of the UI, which elements are apparent on your screen and doesn’t do that “VISUALLY” to make it more reliable.
ATLAS includes over 200 specialized functions and harnesses Gemini’s multi-modal capabilities to manage complex tasks. It improves over time and is customizable: you can add knowledge sources, select text-to-speech options, choose different AI models, work with various file types, and switch between safe and efficient modes—all while keeping your data secure on your local system.

Now time to humble myself (since the main point is not telling about my own project but criticize the competition/people responsible (NOT THE GOOGLE COMPANY, THIS IS A CRITIQUE TOWARDS THE JUDGES)):

ATLAS has a lot of functionalities. This can be hard to use for an average person even with all the ai explanation(speech control feature). Average grandmother doesn’t know how to send a text message. This limits the user count.
- (Note: The competition expected for demos (specified in the video/rules). What I am talking about is an improvement in the user experience which can be counted as further refinement. You don’t expect that from an app that was done in 2-3 months. Jayu was running in a terminal?)
The system got better the more you use it (thanks to memory, train and observe features) but the default/untrained system wasn’t that consistent. Because the default system without the other features at effect, was a multi agentic system still using the base gemini models which has all of the previous problems that I’ve listed. Most users don’t want to train their own stuff, so they would need to download with some memory and so on.
It wasn’t refined. Even tho I can say “yea, but we only had 2-3 months!”, facts are facts. I fixed bugs from my perspective but I didn’t do any red teaming, in depth testing and so on. I only had a list of “test tasks” with varying complexities that I’ve changed from time to time, I had unit tests for the 200+ functions and I tested the general programme but that was it. I could still find 174147174178 more bugs if I’ve tried, especially in the untrained/default system.
It ate tokens like crazy (Which can be good for marketing the gemini model? But caused by unrefined prompt in the system). I gave a really long prompt that was telling the models what to do without giving too many specific examples (since they got hyperfocused and got worse at other tasks when I gave specific examples on specific tasks), I had to create a lingual algorithmic guide that it should follow. I was using premium version or something with free trial on google cloud but after the free trial was over, I’ve realized that I couldn’t use it with free rate limits.
The base intelligence doesn’t improve. It still highly depends on the model quality and capability. I can do whatever I want, all the prompting tricks and stuff which helps with task performance but in the end, the base model is same. I didn’t make a system that can discover a pattern that it didn’t have. Didn’t gain the capability to have higher abstraction and reasoning out of nowhere.
It was slow. Even tho it was fast on simple tasks (e.g. Jayu typing text, Omni checking battery power), it still took around 14 mins in the task that I showed on the demo :
- “Research top 5 emerging technologies in renewable energy, get the top 3 results, analyze their potential impact on global carbon emissions, compile a word document with your findings and recommendations for future investments then create a comparative chat of their efficiencies using code, save the code file and the chart.png on my desktop.”
- For integrity of the demo I had to use the base system without any memories and so on:
  - It took 3 tries. First It forgot that it needed to check my desktop path before the creation(even tho it is specified in the base prompt to add that step in file operations before doing it). Second it forgot to concantonate/summarize the text from the web and hit a rate limit.
It is a privacy nightmare. Yea, I told about this before too (for Jayu) and said this was llm provider’s problem but this app is even more of a nightmare. It can access EVERYTHING. There are registry/file operation agents. Which isn’t a good selling point. In the demo I say stuff like “Yea, it saves it locally” but in the end : You are sending it to google’s servers + You are keeping all that new information locally at your computer (could be fixed with encryption or smth but still very dangerous).

For more information, this was the youtube video link of ATLAS: https://youtu.be/OfCHMGhtr_I?feature=shared

And so on…

Before I close it out I wanna say:

Thank you if you have read this far “dear competitor”. I hope it was worth it and apologize if it wasn’t. Currently I am a human being with some emotions and I might be very biased. But I hope I could convey my points even with my current limitations of being a bit mad and confused. I don’t know you but I don’t lose something by wishing you good luck with your life.
I am an amateur, maybe worse. Anything I say can change as soon as I learn something new. I don’t stand that firmly by my points and with sufficient logical counterarguments, I will apologize then run away.
To the person that created the Jayu: I am really sorry. You didn’t do anything bad, at all. On the contrary you’ve made a great video and a great project. And compared to a lot of other projects, I would still place it high. I just wanted to criticize the judging process and the judges themselves but your project got between in the process. I couldn’t think of another way to write, it is my sole incapability and I am open to and welcome any bad words coming from you in my private messages.
There could be some parts where it comes off as being overly emotional and non-professional and I understand/agree with that. But I am not a professional neither a coorporation. I competed in this competition/hackathon and wanted to give my opinions about it without directly insulting “specific people” but “concepts/processes”. I might’ve slipped up somewhere without realizing and I apologize if that is the case. “I looked through a lot of stuff and there was this and that reason, please excuse my behaviour”. I meant what I said in the text above and anger is a great tool. I am not going to give an ai generated, basic formal response. I am in my rebellious period(!).

Nithin_R · November 22, 2024, 6:10am

I transformed my submission to meet the competition rules. Jayu App nicely engineered it and presented it well, He deserves applause but overall winner with Google’s rules, I don’t think it makes sense. They just needed some fancy submissions to promote Gemini API after all.

tocsa · November 22, 2024, 7:11am

A question: how do you handle a possibility of 200 tool / function choices? I even had problems with only about 16 functions: the model started to forget one or the other.

Kuzey_Akin · November 22, 2024, 10:20am

I understand what you mean. But tbh even saying that it was “nicely engineered” feels to be too much courtesy. I can’t describe how much of these apps I’ve seen of the exact same type but a lot better. These have been out since chatgpt. I had my own exact version that had it’s own UI (didn’t run on a terminal) with the EXACT same features (but more) less than a month after gpt 4 visual support was out but then there were other people that did it even faster and better than I did. And I found out about the problems with the constant UI change, systems being very different from each other and so on. I tried to use that for CAD Designing (Since the model could generate the shape step by step, it was able to see the progress and fix) (And It was smaller chunk to eat) but the 3D understanding of the model’s from images were really bad and all of the previous problems that I’ve talked about applied(Could max generate simple shapes). The idea and the execution of jayu is very simple. I understand you are being nice but I don’t find jayu to be much different than the LaTeX guy. At least LaTeX guy’s project had very pretty UI and hasn’t been done so many times before. I understand how it might be suprising when you first see jayu. I feel cruel while saying this and welcome a counterargument but I can’t see a single part of it that is “well-engineered”. I’ve described my thoughts on how it works as much as I could. There are just too many problems to fix. But it is not the fault of the guy that created jayu. It is hard to keep track of the current progress of the ai scape and adjust yourself. But he was really late to the party. That is why this was very confusing to me.

Kuzey_Akin · November 22, 2024, 10:42am

I thought I’ve explained that in the text but now that I look at it again that part wasn’t clear enough. This was what I’ve said in the text:

What I found was (which is one of the reasons behind why companies are attracted to multi-agentic flows) is breaking it down to smaller “agents” helped the model perform better.
And making it an agentic system with specific but wide range of function calls that didn’t rely on visual queues even though was limiting, was much more reliable.

The LLMs really benefit from simple categorization. And you can categorize stuff in the prompt too. In the prompt of the main assistant, I defined each of the side agents as titles (their names being self-explanatory helps alot) and then the functions that those specific side agents can use under them. So it didn’t choose a function that wasn’t under that title and had to only “look” for the title, then choose between the few functions under that title.

One of the other things is that these models sometimes struggle with doing processes without generating the text of them (by nature) so in it’s response, I made it first define the key “which_agent_to_use” or smth similar, then it generated the “function_to_use”. And with a small instruction it would be sent to the agent with some other information then the agent would check if that function existed or not and took that function only as a suggestion. So the agents can decide themselves. I had a section that managed the context and these side agents could see the context and decide for themselves if something went wrong too. But most of the time, main assistant already got it right. But even if the main assistant chose a function that doesn’t exist (don’t remember it happening but sure) there are error debugging systems that I’ve made for most stuff. It would just see the mistake it did, then choose a function that already exists and save it’s mistake to the memory so that it pays more attention to not doing it the next time (highly language and context dependant, in a way prompting itself over time).
The improvement by generating relevant stuff before responding is mainly caused by how these decoder only transformer models work. Simply put whenever they write something, it adds that to their context to write the next thing. That is again one of the reasons why the o1-preview performs well. There are other several cases in which generating tokens to improve your context before generating the actual result helps. Just a cool trick. I got to this “key trick” through experience tho. Playing with these stuff. But I might just be making it up in my head and maybe there was something another I did that helped the model choose the correct agents/functions better.
(I forgot to add this previously) I’ve included different use cases of these agents/functions in the prompt too. What I can recommend here is that you need to be very careful that your examples are not close to each other and have variety because of 2 problems:
- After a point adding more examples doesn’t help.
- On the contrary adding similar examples causes model to hyperfocus on those use cases and forget the other stuff. Shows less generalization capability.

Hope it helps.

Sadjad · November 22, 2024, 11:15am

@Kuzey_Akin I’ve mentioned this in another thread, but you’ve analyzed the situation very well. The results suggest that the marketing strategy was the sole criterion for the competition, which feels completely unfair to everyone else. I’ve seen participants put significant effort into being innovative, creative, and impactful. Many tried to enhance the application of LLMs by solving existing problems or even introducing entirely new ideas. As programmers, we expected this to be a programming-focused competition. Why, then, would they choose an existing solution and frame it as though the concept of an AI personal assistant for personal computers was born during the Gemini competition? It undermines the truly innovative work that others have done.

And this isn’t just about the “Best Overall App.” Look at the other winners as well. For example, a drawing game with AI won “Most Creative Application of LLMs (Gemini)” it’s supposed to tackle a real-world problem? or change people lives? Seriously??

Kuzey_Akin · November 22, 2024, 11:27am

Thank you for your kind response but I think there is something that we don’t agree on here. Please look at my answer on this thread for that part, I think it would be pretty self-explanatory: Concerns About "Most Creative" Winner - #2 by Hok_Oma

About “Why Google did this?”, tbh I am not really sure and I am not the type of person to speculate. From a logical perspective " frame it as though the concept of an AI personal assistant for personal computers was born during the Gemini competition" doesn’t seem like a very likely case since a lot of previous versions of these approaches already reached to people even out of the technical space.

In the link that I send, I try to explain the reasoning behind the “a drawing game with AI” choice, but I too can’t seem to put my hand on some of the others very well. My opinion was:

For specific categories, since they were specified as being the “most” at that category, the other criteria wasn’t as relevant.
But even with this information, the results such as “Most Useful App” are still a bit weird, yes. As I’ve said, I had to use Jayu as a scapegoat. This was a critique on the general judging process.

I don’t know much to know the exact reasoning behind the judge’s actions.

Kuzey_Akin · November 22, 2024, 12:06pm

I understand but I am sorry that I can’t agree with you. I just don’t know the reason.

Nithin_R · November 22, 2024, 2:22pm

I can understand the feeling, bro. Most of the people here feel the same. I only liked that gaming app on the list. Take the Impactful App, for example—it’s probably just a publicity stunt. I think there are similar tools currently available that aren’t even AI-based.

On my app, I made it useful for visually impaired users too. I introduced a vibration feature in images with depth mapping to bring the concept of “feeling images” to those who can’t see. But the thing is, I wasn’t able to present it well. I’m actually a video creator, and I love creating videos. However, due to the rules, I focused on quality to make the app better and ended up spending less than 6 hours on the videos. That was my big mistake, I think. I believe a lot of people in this competition made the same mistake. I don’t regret it now—what’s done is done.

Just try to take it out of your mind, bro. It’ll haunt you for a little while because of the sweat and tension we all went through for months. It really hurts, but we have to move on. Just think of this competition as a way to refresh ourselves and bring out our creativity. Be positive and move forward. I 100% agree with what you said.

WasimSafdar · November 22, 2024, 2:51pm

I liked your app very much and even I advocated for it. It is not about the video. Good representable video is one factor but it is just one factor. The investor wants a quick return on their money. People who sponsor it or who are giving 300k are not doing it for free. Do you know, what kind of agreement was signed between winners and Google? Maybe a few winners are already in contact with Google before the competition. A lot of submissions, taking time for evaluation, etc are all wrong. Most very good apps with good videos are not event-tested. The investor wants a return on their investment and winning apps are those that can give them quick return. There is nothing its execution is good, it is very impactful, etc. There were apps to reduce Carbon footprint, for global warming, robotics, for children, blind people etc. None of these kinds of apps were selected. Bill Gates is investing in how to reduce Global warming, reducing carbon emissions, farming, etc. However, investors cannot get quick returns from these kinds of apps. These are for helping humans and the profit margin is less. They need long-term funding and planning. What happened to Nikola Tesla everybody knows. He wanted to help humanity. Release your software and do not worry about the results. You gained a nice experience.

Nithin_R · November 22, 2024, 3:10pm

Yeah, bro, absolutely right. In my case, I felt like I was buried in corporate nonsense for a decade. Well, the car in this competition gave me new life. Even though it’s gone now, I’m a believer, and that’s a good thing. We will get better, bro, and we will create better things for a brighter future.

Rapid_hunter · November 22, 2024, 3:49pm

@Kuzey_Akin Got it, bro! I’m also trying to figure out where the novelty is in the winner list, especially since there are thousands of next-gen applications. Why do they keep repeating the same winners? I just want to say that this seems to be Google’s approach. They did something similar before with the Fly Cup. How did the grand winner’s idea fit into the unique and novel submission criteria? It’s truly disappointing for those who participate in this competition. @google HA HA

Rapid_hunter · November 22, 2024, 3:52pm

I agree with your words. I faced the same issue during the Google Fly Cup; the result was completely nonsensical, and Google has done it again. @WasimSafdar

tocsa · November 22, 2024, 7:34pm

Thanks so much for your detailed answer. I have some more questions So in my view:

You are not relying on the tool use only. The model would not be able to deal with so many tools. You designed agents to handle the planning.
If I understand correctly the planning agent’s plan is just a suggestion, and all the target sub agents have an independence to refuse the task (if they find it unfitting themselves). In such case the planner agent finds other candidate agents for the task and falls back to them until an agent can perform the particular step?
Question: is the task plan serial flat, or are can there be any hierarchy at planning level?
Question: you may group all of those agent into categories if I understand it correctly? Are the agents within a category kind of cover the same purpose or are they just overlap around some concept? How does the planner deal with the categories?
One advantage of native tool use is that the LLM takes care of the function call parameter determination, extraction and handling. How do you deal with the parameter passing problem? For example one of my “hello world” tests with my embedded assistant is “what’s the weather tomorrow?”. Gemini 1.5 001 properly got the gps location from my location tool, got the current date from my datetime tool, and properly interpolated the next day’s date in the correct format to call the weather tool with the proper location and date. All of this would require a lot of parsing, interpolation and uncertainty along with those if I had to do it “manually” with a planner agent. (It’s another story how 002 is so cautious it doesn’t simply assume my current location but asks where (or doesn’t properly consider the location tool), and also cautious with the date, cannot handle date ranges like 001 did “what’s the weather next week?” → 001 properly interpolated the start and end date).
Question: Do you think in the future the native tooling and function calling of SOTA may catch up with our planner agent capabilities and large number of tools, or will the planning be always explicitly carried out by agentic system graphs?

MOHD_RAMLAN_BIN_M_RO · November 22, 2024, 11:50pm

Thanks for the insights, Wasim. I appreciate your perspective.
It’s disheartening to hear about these issues in another Google competition.
The lack of transparency around the judging process is concerning

zennn.mind · November 23, 2024, 1:04am

Maybe I’m paranoid, but it really seems to have something to do with politics. The latest competition about Gemini (Gemini in Chrome) is hosted on Devpost, which bans some Asian countries and some countries related to Russia. And the final results of the Gemini API Competition are: NO ASIAN (Japan is Asian, but they’re a first-world country, so that doesn’t count), NO INDIA, NO SOUTH AMERICAN, despite the dominance of submissions from those regions. But hey, their game, their rules, money from America stays in America …

Kuzey_Akin · November 23, 2024, 1:24pm

I think the text and the links that I’ve provided should have been enough. I recommend sending it as a context to a language model and asking the questions to them.

But still to answer your questions :

I didn’t implement the “multi-agentic” logic just to make tool use easier. To break down different stages of a task, improves the reliability of these systems. In this case:
- “Main assistant” creates the plan based on the user request.
- The responsible “side agent” for the 1st step in the plan gets called with their own system prompt, task context and instruction with other helpful multiple keys.
- After it does it’s action, the output of the funtion call with debug prints are sent to the main assistant. Based on the output, main assistant can adjust/revise it’s plan.
- It loops until the user request it achieved. For example it has “file_exists” function to be able to reliably check for a file creation request(can use cmd commands or py code execution too).

Side note: This is just one of the ways to respond. If the user request is impossible/really easy or doesnt require tool calling, it is handled differently.

Any type of error/refusal/problem can be handled in the main assistant getting the output and revising the plan stage. The main assistant has enough information and instructions to be able to handle those cases.
Idk what you mean by hierarchy in this case. The task execution is sequential and the plan is dynamically changed along the way. So the previous step’s outputs effect the rest of the plan. But this is not that much different (logicwise) from human decision making/planning process.
There is a misunderstanding here. By categorization I meant this:

Idk if you know how attention mechanism works, I suggest learning it. But my understanding is that while generating the first “which_agent_to_use” key, the model needs to only pay attention to the 19 titles in the prompt which are the side agents. So it is easier to choose in between. After generating the answer to this key, since now the answer to that key is added to the context, while generating the next key “which_function”, it is easier for the model to fill, since the functions are written under their respective agents and it can use it’s previous answer to fill this one.
I didn’t group the agents or smth.

If you are asking about json parsing, I explained it in the visualization. There are debugger functions and a model call with the specific json error if the debugging functions doesn’t work. But if you are asking about how the agents are able to use their functions, each agent gets the information about what each function does, which parameters it gets and how to use them, the output and so on. They specify these in the json response which is then automatically parsed and passed by the system. The main assistant is prompted to give the necessary parameters in the instruction to agent too. And agents can already see the task context which makes them more reliable.
After llms/different ai systems get alot better, I don’t know a single project that wont render useless. The project I did was my implementation of some “half” fixes and ideas. But personally I dont believe anything that was presented here would be useful in 5-10 years of time. If the model can write the software then write their implementation in it, can reason through how to solve the problems and can come up with useful ideas then what is holding them back? Tho Im an optimist and I don’t know the future. To respond to your question:

A really smart model with basic local code execution would be enough to beat everything. But until anything happens, this is only science fiction.

I wrote this on my phone so there might be typos/mistakes.

tocsa · November 23, 2024, 4:44pm

Thank you so much for your detailed answer again

RunyBlazy · November 23, 2024, 10:07pm

Let’s really not fall into this trap! It is easy to read your post and think mistakenly that “Japan does not count as Asia here”. Asian people (people from India, people from Japan, and co) all face discrimination in NA so let’s not verge on singling Japan out

zennn.mind · November 24, 2024, 1:25am

I mean Japan is still Asian people but they are in high priority in the list compared to other countries

Topic		Replies	Views
The Stats Behind the Winning Submission Gemini API Developer Competition	15	737	November 25, 2024
Not satisfied with the results! Gemini API Developer Competition	15	588	December 23, 2024
Heres an app to prejudge your submission Gemini API Developer Competition	18	446	November 28, 2024
Missed the Win—Could My App Still Be a Useful? I Need Your Honest Feedback! Gemini API Developer Competition	8	259	November 24, 2024
Seeking Feedback on Omni Gemini API Developer Competition	26	662	November 1, 2024

Seeking feedback on the competition criteria (while using JAYU as scapegoat)

I still don’t understand. Please help me do :

Related topics