Some Embeddings are finished with 0B

:rocket: [Issue] Batch Embedding result file is 0 bytes after JOB_STATE_SUCCEEDED

Summary

When using the google-genai Python SDK, client.files.download() intermittently returns an empty bytes object (0 bytes) even after the Batch Embedding job status has reached JOB_STATE_SUCCEEDED. No exceptions are raised, but the downloaded file contains no data.


Step-by-Step Workflow

We are using the Gemini Batch Embedding API (asynchronous) to process large JSONL corpora. Our implementation follows these steps:

1. Input Preparation & Upload

We generate a JSONL file where each line follows the required Batch API structure. Note the use of the key field for metadata mapping:

{
  "key": "unique_chunk_id_001",
  "request": {
    "model": "models/gemini-embedding-001",
    "content": { "parts": [{ "text": "The text to embed..." }] },
    "taskType": "RETRIEVAL_DOCUMENT",
    "title": "Optional Document Title",
    "outputDimensionality": 3072
  }
}

2. Job Submission

The file is uploaded via client.files.upload() and the batch job is created:

job = client.batches.create_embeddings(
    model="models/gemini-embedding-001",
    src=types.EmbeddingsBatchJobSource(file_name=uploaded_file_name),
    config={"display_name": "batch_emb_run_v1"}
)

3. Polling & Success State

We poll client.batches.get() until the state is JOB_STATE_SUCCEEDED.

4. Result Download (The Failure Point)

Once successful, we retrieve the output file name from job.dest.file_name and attempt to download:

# job.state.name == "JOB_STATE_SUCCEEDED"
result_file_name = job.dest.file_name  # e.g., "files/output_123"
content: bytes = client.files.download(file=result_file_name)

# Result: len(content) is 0

Problem Description

  • Intermittent Occurrence: This does not happen every time. Some parts of the batch download perfectly, while others (running under the same conditions/projects) return 0 bytes.
  • No Exceptions: The download() call completes without 404, 403, or any SDK-level errors.
  • Valid Metadata: job.dest.file_name always contains a valid string path.
  • Empty Output: The returned bytes object is b"".

Environment

Key Value
SDK google-genai (Python)
Model models/gemini-embedding-001
Request Schema Wrapped in "request" field with "key" identifier
Parallelism Multiple billing projects/API keys used concurrently

Questions to the Community

  1. Consistency/Race Condition: Is it possible for the job state to hit SUCCEEDED before the File API has finished committing the result buffer to storage?
  2. Pre-download Validation: Is it recommended to poll client.files.get(name=result_file_name) and check if size_bytes > 0 before calling download()?
  3. Correct Endpoint: Is client.files.download() the standard way to fetch batch results, or is there a more robust method (e.g., streaming) recommended for embeddings?
  4. Retry Strategy: Since the API thinks the call was successful (no error code), should we implement a manual retry loop specifically for 0-byte responses?

Any insights from the Google team or other developers who have faced this “empty success” issue would be greatly appreciated!

This post was written by AI.
Any helps will be helpful for me.