503s on 2.5 flash

,

Crashing my applications on 2.5 flash. These 503s have gotten consistently worse. What is the deal? I am seriously considering switching models. All morning so far. Can’t test new rollout losing tons of money on devs sitting around. Par for the course Google.

this is the solve I am using on these 503s in production if it helps anyone:

  1. Invocation #1:

    • _validate_video_quality is called.

    • The Gemini API is overloaded.

    • The function waits 2.5 minutes, then raises a RetryError.

    • This error is not caught inside the function. It propagates up to the main lambda_handler.

    • The lambda_handler’s main try…except block catches the error, logs it, and then re-raises it.

    • Because the handler exits with an unhandled exception, the entire Lambda invocation is marked as FAILED.

    • Crucially, no subsequent Gemini calls are made in this run.

  2. AWS/S3 Takes Over:

    • The S3 event trigger sees that the Lambda invocation failed. It considers the event unprocessed.

    • S3 has a built-in retry policy. It will wait for a period (e.g., 1 minute) and then re-invoke the entire orchestrator function from the beginning with the exact same S3 event.

  3. Invocation #2 (The Retry):

    • The orchestrator starts over. It downloads the file, uploads it, checks the status, and calls _validate_video_quality again.

    • By now, the transient overload on the Gemini API may have resolved.

    • If the call now succeeds, the orchestrator proceeds as normal to the subsequent calls (PASS_0_TRIAGE…), which will now also likely succeed.

    • If the API is still overloaded, this entire retry process will happen again (up to the S3 trigger’s limit, usually 2 retries).

def _validate_video_quality(model, gemini_file) -> dict:
    logger.info("--- Running Tier 0: Pre-Analysis Sanity Check ---")
    try:
        # Define a custom retry strategy with a shorter total deadline to prevent Lambda timeouts.
        custom_retry_policy = retry.Retry(deadline=150) # Total timeout of 150 seconds for all retries.

        response = model.generate_content(
            [gemini_file, PASS_0_0_PRE_ANALYSIS_SANITY_CHECK_PROMPT],
            request_options={"timeout": 120, "retry": custom_retry_policy}
        )
        data = _get_json_from_gemini_response(response, "Pre-Analysis Sanity Check")
        assessment = data.get('video_quality_assessment')
        
        # Check for a well-formed response from the AI model.
        if not assessment or 'is_analyzable' not in assessment:
             raise ValueError("Malformed response from Pre-Analysis Sanity Check. AI output did not match expected schema.")
        
        logger.info(f"Pre-analysis check complete. Is analyzable: {assessment.get('is_analyzable')}.")
        return assessment

    # --- FIX START: More specific exception handling ---
    # Only catch ValueErrors (e.g., malformed JSON) which are truly fatal for this step.
    # Let RetryErrors and other transient API errors propagate up to the main handler.
    # This allows the Lambda invocation to fail and trigger an automatic retry from S3.
    except ValueError as e:
        logger.error(f"Pre-Analysis Sanity Check failed due to a data or schema error: {e}", exc_info=True)
        return {
            "is_analyzable": False,
            "reason": f"The AI model returned a malformed response during the initial quality check: {e}"
        }