These days, I am using LLM APIs as a programming function.
Especially using structured output and renderable prompts, I define the function logic via prompts and use it in my Python code to perform textual understanding or reasoning tasks. It is extremely helpful and easy to develop compared to traditional NLP functions.
However, I couldn’t find what is the best way when I need to run the function repeatedly. For example, if I have 1,000 inputs, what is the best (cost-efficient and output quality) solution?
-
Just repeat the function 1,000 times:
I assume output quality will be the best, but it is not a very cost-efficient way. -
Use context caching:
This is the easiest way to reduce cost. However, maximum cost reduction is 4x (25% of original cost). There will be no degradation in performance. -
Batch input:
Instead of inputting one item, I can use the prompt with the instruction that there will be a list of input items, ask for repeating tasks for each of the items, and output the structure as a list of the desired output structure.
This is the way where I can acquire a large benefit of cost reduction. In theory, I can input 1,000 items in one single API call.
I haven’t measured the performance degradation, but thanks to Gemini’s long context window, it could work well. Still, I am worried about degradation if I mistakenly put too many inputs.
For the best method, both #2 and #3 should be used. My final question is, how can I figure out the ideal number of inputs for a single API call?
Or, do you have better ideas?