its very important question for developers
hey i am using context caching , its looks good i am using it , but i have question in gemini pricing it has limit of 4 million token process per minute , lets say i have query of 500k text and with 500 users that means in same minute Gemini can process 8 request and it will block further request because 4 million token are utilized in that minute. so here is my question
“my context caching token will be part of this 4 million or not ?”
lets take 2 sitution
1st where i have 1 millon text as input and in minute it will process 4 request only so if 100 request are sent only first 4 will return and other will be rejected
2nd situation where 1 million token are context cached so even if 100 request received by user will it process all 100 request or it will process first 4 only
this situation directly affect my request handling
like even if it say gemini can handle 1000 request per minute per user and if my user prompt contain 50,000 token then it will process 80 request only and not 1000 users(the person with non technical knowledge will say that it is unable to handle 1000 request as said in pricing ) , because of 4 million context limit , it will effect user experience directly
on the other side context caching is not part of 4 million token then it can handle more than 80 requests easily it will improve amount of user request handle by ai