### Description of the bug:
I'm trying to create a cache by reading the con…tents of multiple PDF files, but when the total number of tokens within the files exceeds approximately 500,000 tokens, I receive a 503 error (Service Unavailable) from Google API Core.
It seems that the error isn't returning immediately, but rather after about 40 to 50 seconds. This might indicate that a timeout is occurring in Google API Core.
### Code
```
import google.generativeai as genai
import os
gemini_api_key = os.environ.get("GEMINI_API_KEY")
genai.configure(api_key=gemini_api_key)
documents = []
file_list = ["xxx.pdf", "yyy.pdf", ...]
for file in file_list:
gemini_file = genai.upload_file(path=file, display_name=file)
documents.append(gemini_file)
gemini_client = genai.GenerativeModel("models/gemini-1.5-flash-001")
total_token = gemini_client.count_tokens(documents).total_tokens)
print(f"total_token: {total_token}")
# total_token: 592403
gemini_cache = genai.caching.CachedContent.create(model=“models/gemini-1.5-flash-001”, display_name=“sample”, contents=documents)
```
### Version
- Python 3.9.19
- google==3.0.0
- google-ai-generativelanguage==0.6.6
- google-api-core==2.19.0
- google-api-python-client==2.105.0
- google-auth==2.29.0
- google-auth-httplib2==0.2.0
- google-generativeai==0.7.2
- googleapis-common-protos==1.63.0
### Actual vs expected behavior:
### Actual behavior
```
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/google/api_core/grpc_helpers.py", line 76, in error_remapped_callable
return callable_(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 1176, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.9/site-packages/grpc/_channel.py", line 1005, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "The service is currently unavailable."
debug_error_string = "UNKNOWN:Error received from peer ipv4:172.217.175.234:443 {created_time:"2024-08-06T13:37:03.077186006+09:00", grpc_status:14, grpc_message:"The service is currently unavailable."}"
>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.9/site-packages/google/generativeai/caching.py", line 219, in create
response = client.create_cached_content(request)
File "/usr/local/lib/python3.9/site-packages/google/ai/generativelanguage_v1beta/services/cache_service/client.py", line 874, in create_cached_content
response = rpc(
File "/usr/local/lib/python3.9/site-packages/google/api_core/gapic_v1/method.py", line 131, in __call__
return wrapped_func(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/google/api_core/grpc_helpers.py", line 78, in error_remapped_callable
raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.ServiceUnavailable: 503 The service is currently unavailable.
```
### Expected behavior
```
gemini_cache = genai.caching.CachedContent.create(model="models/gemini-1.5-flash-001", display_name="sample", contents=documents)
print(gemini_cache)
# CachedContent(
# name='cachedContents/l5ataay9naq2',
# model='models/gemini-1.5-flash-001',
# display_name='sample',
# usage_metadata={
# 'total_token_count': 592403,
# },
# create_time=2024-08-08 01:21:44.925021+00:00,
# update_time=2024-08-08 01:21:44.925021+00:00,
# expire_time=2024-08-08 02:21:43.787890+00:00
# )
```
### Any other information you'd like to share?
- https://ai.google.dev/gemini-api/docs/caching?lang=python#considerations
> The minimum input token count for context caching is 32,768, and the maximum is the same as the maximum for the given model. (For more on counting tokens, see the [Token guide](https://ai.google.dev/gemini-api/docs/tokens)).
Upon reviewing the Gemini API documentation, I noticed an interesting mismatch regarding token limits. While the maximum token count is described as being dependent on the specific model in use. In my case, I'm utilizing the `models/gemini-1.5-flash-001` model, which has a maximum input token limit of 1,048,576. Based on this information, I initially assumed that processing around 500,000 tokens should be working without any issues.
Moreover, I was able to successfully generate the cache even with token counts exceeding 800,000 when attempting to create a cache using a string. This leads me to suspect that there might be a bug specifically related to creating cache files with high token counts, as opposed to string-based caching.