Inquiry regarding the evaluation setup for FunctionGemma on BFCL benchmark

Filip_Fan · January 5, 2026, 8:48am

Hi there,

I am exploring the function-calling capabilities of FunctionGemma and noticed the impressive benchmark results on the BFCL listed in the Model Card.

As highlighted in the documentation (Base Prompt Structure), FunctionGemma employs a unique formatting convention using specific control tokens (e.g., <start_function_declaration>, <start_function_call>, and <escape> tokens) to delineate tool definitions and calls. To my knowledge, the standard BFCL evaluation framework does not natively support this specific prompt/output structure.

I would appreciate some clarification on the following:

Reproduction of Results: To reproduce the benchmark results mentioned in the Model Card locally, is it necessary to implement a custom model handler within the BFCL framework to adapt to FunctionGemma’s special tokens?
Evaluation Configuration: Were the reported results achieved using a specific prompt template or a modified version of the BFCL codebase?
Reference Implementation: Is there an official or recommended version/fork of the BFCL repository (or a specific model_handler script) that includes the pre-configured logic for FunctionGemma?

Thank you for providing these insights and for the great work on FunctionGemma!

Pannaga_J · January 8, 2026, 9:47am

Hi @Filip_Fan

Yes, To reproduce the BFCL benchmark results reported in the FunctionGemma Model Card, you need a small custom adapter / model handler in the BFCL framework that understands FunctionGemma’s control-token–based function-calling format.
Yes , the reported results in the Model Card (e.g., ~61.6% on BFCL Simple) were achieved using a specific prompt template that adheres strictly to the FunctionGemma formatting standards which are mentioned in the docs .
No public “official” fork of the BFCL repository currently includes a pre-configured model_handler for FunctionGemma.

Thanks

Filip_Fan · January 9, 2026, 12:37am

Thanks for your clarification

Topic		Replies	Views
GPQA-Diamond Benchmark Results Gemini API models , gemini-flash	2	480	August 26, 2025
Gemma 3 support for packed sequence training with FlashAttention 2? Gemma models	2	345	February 26, 2026
What is the recommended way of dealing with multiple functionCall requests? Gemini API api , models	1	151	May 14, 2025
Need help understanding function calling mixed with prompts answered without functions Gemini API function-calling	7	290	May 16, 2025
Integrating MedGemma with LangChain Challenges and Solutions HAI-DEF vertexai , gemma-3 , medgemma	3	586	August 18, 2025

Inquiry regarding the evaluation setup for FunctionGemma on BFCL benchmark

Related topics