Crashing my applications on 2.5 flash. These 503s have gotten consistently worse. What is the deal? I am seriously considering switching models. All morning so far. Can’t test new rollout losing tons of money on devs sitting around. Par for the course Google.
this is the solve I am using on these 503s in production if it helps anyone:
-
Invocation #1:
-
_validate_video_quality is called.
-
The Gemini API is overloaded.
-
The function waits 2.5 minutes, then raises a RetryError.
-
This error is not caught inside the function. It propagates up to the main lambda_handler.
-
The lambda_handler’s main try…except block catches the error, logs it, and then re-raises it.
-
Because the handler exits with an unhandled exception, the entire Lambda invocation is marked as FAILED.
-
Crucially, no subsequent Gemini calls are made in this run.
-
-
AWS/S3 Takes Over:
-
The S3 event trigger sees that the Lambda invocation failed. It considers the event unprocessed.
-
S3 has a built-in retry policy. It will wait for a period (e.g., 1 minute) and then re-invoke the entire orchestrator function from the beginning with the exact same S3 event.
-
-
Invocation #2 (The Retry):
-
The orchestrator starts over. It downloads the file, uploads it, checks the status, and calls _validate_video_quality again.
-
By now, the transient overload on the Gemini API may have resolved.
-
If the call now succeeds, the orchestrator proceeds as normal to the subsequent calls (PASS_0_TRIAGE…), which will now also likely succeed.
-
If the API is still overloaded, this entire retry process will happen again (up to the S3 trigger’s limit, usually 2 retries).
-
def _validate_video_quality(model, gemini_file) -> dict:
logger.info("--- Running Tier 0: Pre-Analysis Sanity Check ---")
try:
# Define a custom retry strategy with a shorter total deadline to prevent Lambda timeouts.
custom_retry_policy = retry.Retry(deadline=150) # Total timeout of 150 seconds for all retries.
response = model.generate_content(
[gemini_file, PASS_0_0_PRE_ANALYSIS_SANITY_CHECK_PROMPT],
request_options={"timeout": 120, "retry": custom_retry_policy}
)
data = _get_json_from_gemini_response(response, "Pre-Analysis Sanity Check")
assessment = data.get('video_quality_assessment')
# Check for a well-formed response from the AI model.
if not assessment or 'is_analyzable' not in assessment:
raise ValueError("Malformed response from Pre-Analysis Sanity Check. AI output did not match expected schema.")
logger.info(f"Pre-analysis check complete. Is analyzable: {assessment.get('is_analyzable')}.")
return assessment
# --- FIX START: More specific exception handling ---
# Only catch ValueErrors (e.g., malformed JSON) which are truly fatal for this step.
# Let RetryErrors and other transient API errors propagate up to the main handler.
# This allows the Lambda invocation to fail and trigger an automatic retry from S3.
except ValueError as e:
logger.error(f"Pre-Analysis Sanity Check failed due to a data or schema error: {e}", exc_info=True)
return {
"is_analyzable": False,
"reason": f"The AI model returned a malformed response during the initial quality check: {e}"
}