Inquiry regarding the evaluation setup for FunctionGemma on BFCL benchmark

Hi there,

I am exploring the function-calling capabilities of FunctionGemma and noticed the impressive benchmark results on the BFCL listed in the Model Card.

As highlighted in the documentation (Base Prompt Structure), FunctionGemma employs a unique formatting convention using specific control tokens (e.g., <start_function_declaration>, <start_function_call>, and <escape> tokens) to delineate tool definitions and calls. To my knowledge, the standard BFCL evaluation framework does not natively support this specific prompt/output structure.

I would appreciate some clarification on the following:

  1. Reproduction of Results: To reproduce the benchmark results mentioned in the Model Card locally, is it necessary to implement a custom model handler within the BFCL framework to adapt to FunctionGemma’s special tokens?
  2. Evaluation Configuration: Were the reported results achieved using a specific prompt template or a modified version of the BFCL codebase?
  3. Reference Implementation: Is there an official or recommended version/fork of the BFCL repository (or a specific model_handler script) that includes the pre-configured logic for FunctionGemma?

Thank you for providing these insights and for the great work on FunctionGemma!

1 Like

Hi @Filip_Fan

  1. Yes, To reproduce the BFCL benchmark results reported in the FunctionGemma Model Card, you need a small custom adapter / model handler in the BFCL framework that understands FunctionGemma’s control-token–based function-calling format.

  2. Yes , the reported results in the Model Card (e.g., ~61.6% on BFCL Simple) were achieved using a specific prompt template that adheres strictly to the FunctionGemma formatting standards which are mentioned in the docs .

  3. No public “official” fork of the BFCL repository currently includes a pre-configured model_handler for FunctionGemma.

Thanks

Thanks for your clarification :smiley: