I have been doing some intensive experiments with 1.5 Pro experimental for large python project development across multiple modules. My findings so far (testing on about 50,000 lines of python across a dozen modules)… the model is wonderful in that it has such a large context window. But it struggles with PEP 8 compliance, and in particular it is very weak when it comes to writing type safe python code, especially when there are complex types to consider. Getting correct type hints out of the model takes many iterations. It also struggles with avoiding circular imports between modules. I’ve had to develop a set of prompts for managing this - basically a virtual meta-environment for coding with the model - to help it keep track of rules it has to apply. My sense is that the model needs much more extensive training on python coding - and in particular type safe code and modular code. I tested debugging pylance and mypy errors with the model on a large codebase side-by-side against GPT 4o and 4o was far more efficient at resolving type errors - by at least 50% improvement in efficiency. One example - when debugging type errors, 1.5 Pro E tends to get stuck going in circles - trying solution A, failing, trying solution B, failing, then trying solution A again, etc. Another thing the model struggles with is consistently applying rules, fixes, patterns, and solutions across code - for example it might get it right a few times but then on the next 40 components it fails to apply it consistently. There are many dozens of basic principles the model seems to struggle with, yielding large volumes (1000’s) of type checker errors in code it writes. I think 1.5 Pro experimental is extremely promising for large code dev but it’s not trained well enough on larger and more complex code projects to really be efficient for it yet. Also I noticed it really starts to slow down a lot after about 500k tokens in a chat. (and Chrome gets slower and slower in the text input field as the number of fields in a chat increases too - that’s aparrently a known issue with chrome - not sure why that isn’t fixed yet but for a long context model a chat could have a hundred or more turns and this is a big issue there as it causes text input to grind to a halt and the browser eventually crashes every time). Anyway - kudos to to the team working on 1.5 Pro - it’s really great in many ways - but I would love to see it drastically improved on training for coding!
Welcome to the forum. I expect you tested model ‘gemini-1.5-pro-exp-0827’, correct?
I largely agree with your findings, and to be fair, Claude 3.5 Sonnet also beats Gemini, so Gemini ranks third among the big models in coding tasks.
I used a workaround to avoid the slowdown when the codebase is larger and the browser can’t handle it anymore: direct API calls. Wrote a Python script to open the project source files, assemble the prompt, set the system instruction (which iteratively grew larger as I was testing), and formulate the actual task in such a way that the model output stays within the output token limit. The response times remained acceptable with large sources, I suspect the browser slowing down has a significant impact on the perceived response time.
One of the tasks was “A number of support classes are required to implement PlusPy. Looking
at the following set of classes: IndexExpression, TupleExpression,
SetExpression, FilterExpression, NumberExpression and StringExpression,
the code is missing PEP 484 type hints. Rewrite the classes, adding the
type hints. Only rewrite the named classes.”
The model doesn’t do well on this type of task, it seems to have been trained on traditional Python code without type hints way more, and seems to rely more on documentation comments rather than type hints.
Thanks for the reply. The challenge I am undertaking is to try to do a refactor of a 50K line python project with 12 modules. In order to refactor any module it has to see all the other modules ideally. For that a long context window is essential (and a model that doesn’t limit user to just a handful of tokens per day… ). And yes I want strict type safety and type hints throughout as well as strict PEP 8 compliance, and of course no errors…
Claude is not bad but they limit usage so much that I can barely get through a few iterations of a few modules before I have to wait 6 hours for the rate limit to go away. That’s infuriating! I would gladly pay for unlimited token usage as needed (even though my project is completely open source and noncommercial). I can’t believe companies like Anthropic and OpenAI don’t let user just pay as they go - they are leaving so much money on the table. In any case, right now the only practical choice seems to be Google with 1.5 Pro Experimental because at least there you can truly be productive and get through large iterative tasks on lots of code without endless rate limit pauses.
That said, I would be curious to know which, if any, models are actually good / better for writing type safe python code AND that have sufficient token windows… I’m guessing there is not really any other choice.
And to be honest no other AI tool has a chat GUI that is as productive as Google so far. So there’s that too. For example the ability to delete or change earlier responses is key. Nobody else has that so far. I don’t want a watered down consumer GUI. And (LOL) just sharing pylance errors for the project uses up 100K+ tokens… so what I do is I put them in and have it analyze and classify them to come up with fix patterns, and then I delete them from the chat. It’s very useful.
Another interesting finding - if you put code modules into GPT 4o i… n the ChatGPT Pro GUI … it doesn’t really put them into context the way Google does. For example if you ask it to analyze them to see if a class is present GPT 4o confidently insists it isn’t there… but if you repeatedly ask it will eventually analyze a portion of the file and then another portion and if you are lucky it will find that the class is actually there - that’s useless and non-transparent model and UX behavior which leads to egregious errors! Google does NOT have this issue - the entire file is truly in context not in some mutated RAG system that misses stuff. That’s a huge benefit.