Berkeley Function Call Leaderboard (BFCL)

Since the end of 2022, Large Language Models (LLMs) have been in the spotlight for their powerful ability to perform generic tasks. It has become a trend to apply these models to the development of various AI applications and software (e.g., Langchain, Llama Index, AutoGPT, Voyager), not only for chatting applications.Models such as GPT, Gemini, Llama, Mistral, etc., show their great potential by interacting with the external world through, for example, function calls and executions. The

Berkeley Function-Calling Leaderboard (BFCL)[1] is an online platform for evaluating the ability of Large Language Models (LLMs) to accurately call functions or tools.BFCL is a platform for comprehensively evaluating the ability of Large Language Models (LLMs) to call functions and tools in a variety of programming languages and application scenarios. The platform evaluates the performance of the model in different scenarios, such as simple function calls, parallel function calls, and multiple function calls, by providing question-function-answer pairs. In addition, BFCL focuses on the cost and latency of models, and released the BFCL V2 dataset on August 19, 2024, which addresses bias and data contamination and focuses on dynamic real-world scenarios. Features of this leaderboard include:

Comprehensive assessment of LLM: It evaluates the function calling capabilities of large language models, covering a wide range of programming languages and application scenarios.
Real-world data: Assessments are conducted using actual datasets to ensure accuracy and relevance.
regular update: The rankings are updated with the latest advances in AI technology.
Detailed error analysis: Provide an in-depth analysis of the strengths and weaknesses of different models.
model comparison: Facilitates users to compare between models in order to make informed decisions.
Cost and delay estimates: Provide estimates for the economic efficiency and performance of the model.

The leaderboard includes 2,000 question-function-answer pairs across multiple programming languages (e.g., Python, Java, JavaScript, REST API) and complex usage scenarios. These scenarios include multiple function calls that require selecting one or more functions from multiple provided functions, and parallel function calls that require multiple function calls to be made simultaneously.

The Berkeley Function Call Leaderboard is an invaluable resource for researchers, developers, and educational institutions to help them compare and select the models that best fit their needs, and to evaluate the economic efficiency and performance of their models. The Berkeley Function Call Rankings are a key tool for the AI community, providing a transparent and data-driven evaluation methodology for assessing and selecting the most efficient large-scale language models for programming tasks. By providing comprehensive evaluations, real-world insights, and practical comparisons, it empowers users to make informed decisions that improve the efficiency and effectiveness of their AI applications.

Posted September 19, 2024Berkeley Function Call Leaderboard (BFCL V3)[2] is an important platform for evaluating the capabilities of Large Language Models (LLMs) in terms of multi-round and multi-step function calls (tool usage). It allows LLMs to interact back and forth, handle complex tasks by asking clarifying questions, and in particular introduces benchmarking of multi-round and multi-step function calls (tool usage).BFCL V3 makes key advances in evaluating how Large Language Models (LLMs) can interact with a wide range of scenarios by invoking the correct functions. It allows models to interact back and forth with users to navigate complex tasks by asking clarifying questions. In addition, for the first time, BFCL V3 performs API state validation as real-world scenario validation, rather than just parameter matching through an Abstract Syntax Tree (AST) and a list of possible answers.Key features of BFCL V3 include:

multiround function call: Allows the model to interact back and forth with the user, processing inputs and thus handling more dynamic and realistic user interactions.
multistep function call: LLM can decompose a response into multiple steps, modeling the way real-world AI assistants may need to plan execution paths, request and extract critical information, and process sequential function calls to complete tasks.
API Status Validation: BFCL V3 performs API state validation as ground truth for the first time.
Data set composition: Includes multiple categories such as basic multirounds, augmented multirounds, missing parameters, missing functions, long context multirounds, and combinations designed to create extremely difficult but important challenges.
Multi-round model inference and execution: Evaluate multiple rounds of function call models, including function call models and cueing models, and how they differ in the reasoning process.
State base assessment: Evaluate model performance by comparing the final state of the instance after each dialog round.

These innovations and improvements in BFCL V3 make it an important tool for evaluating LLM's ability to call functions in complex scenarios, and more detailed information and datasets can be found on the official website of the University of Berkeley.

[1] Berkeley Function-Calling Leaderboard (BFCL)./
[2] Berkeley Function Call List (BFCL V3)./blogs/13_bfcl_v3_multi_turn.html