LLM APIs as programming functions, what is the best desgin?

These days, I am using LLM APIs as a programming function.

Especially using structured output and renderable prompts, I define the function logic via prompts and use it in my Python code to perform textual understanding or reasoning tasks. It is extremely helpful and easy to develop compared to traditional NLP functions.

However, I couldn’t find what is the best way when I need to run the function repeatedly. For example, if I have 1,000 inputs, what is the best (cost-efficient and output quality) solution?

  1. Just repeat the function 1,000 times:
    I assume output quality will be the best, but it is not a very cost-efficient way.

  2. Use context caching:
    This is the easiest way to reduce cost. However, maximum cost reduction is 4x (25% of original cost). There will be no degradation in performance.

  3. Batch input:
    Instead of inputting one item, I can use the prompt with the instruction that there will be a list of input items, ask for repeating tasks for each of the items, and output the structure as a list of the desired output structure.
    This is the way where I can acquire a large benefit of cost reduction. In theory, I can input 1,000 items in one single API call.
    I haven’t measured the performance degradation, but thanks to Gemini’s long context window, it could work well. Still, I am worried about degradation if I mistakenly put too many inputs.

For the best method, both #2 and #3 should be used. My final question is, how can I figure out the ideal number of inputs for a single API call?
Or, do you have better ideas?

Hello,

You briefed all three cases correctly.
Regarding your question about number of inputs or batch size is highly depandant on your use case and there is no one aswer to that. You might have to find the right balance for your use case by experimentation. This is a good blog to begin your research on this topic.

1 Like

Hi @Lalit_Kumar,

Thanks for your answer.

It is great to know that at least my concern is valid. It is much easier to find the solution to a given problem than figure out if it is a problem to solve or not.

I haven’t built a tool to measure the ideal batch size yet, but I do have some experience to get some hint about it.

I found the LLM sometimes returns a single object, even though I input 10 items. I see that as a symptom of LLM finding the task difficult and confusing, thus requiring a reduction in batch size.

Happy that I could help. Happy coding :+1:

1 Like