Hi,
I observed that for prompts of a certain complexity (a few hundred words) applied to large contexts (500K tokens) I frequently received 504 Deadline errors. I asked how to manage deadline, but got no answer. As I experimented I realized that reducing either prompt complexity or context size made the 504s go away.
There must be some rough mathematics that can be applied so we can figure out whether a given inference is likely to be accomplished within deadline. (I say “rough” because length of prompt != inferential complexity.)
Can someone provide insight on this?