Location>code7788 >text

OpenCompass Uses LawBench Data to Measure Local Qwen Large Models

Popularity:448 ℃/2024-11-19 17:44:08

I. Mind mapping presentation

 

Introduction to OpenCompass

OpenCompass is a large model evaluation system, open source and efficient. It also integrates CompassKit evaluation tool, CompassHub evaluation set community, CompassRank evaluation list.

Official website address: /home

OpenCompass Installation

3.1 Creating a virtual environment

conda create --name opencompass python=3.10 -y
conda activate opencompass

3.2 Installing OpenCompass via pip

# Supports most data sets and models
pip install -U opencompass

# Full installation (more datasets supported)
# pip install "opencompass[full]"

# model inference backends, and since these inference backends often have conflicting dependencies, it is recommended to use different virtual environments to manage them.
# pip install "opencompass[lmdeploy]"
# pip install "opencompass[vllm]"

# API testing (e.g. OpenAI, Qwen)
# pip install "opencompass[api]"

3.3 Installing OpenCompass Based on Source Code

git clone https:///open-compass/opencompass opencompass
cd opencompass
pip install -e .
# pip install -e ".[full]"
# pip install -e ".[vllm]"

3.4 Downloading system data sets (optional)

Since we use our own downloaded dataset, the system's dataset, is not necessary, but it is still recommended for the robustness of the original program, as I have not verified that it is not downloaded.

# Download the dataset to data/ be in a position of
wget https:///open-compass/opencompass/releases/download/0.2.2.rc1/

# After unzipping it is a folder calleddata(I'll tell you about that later.datafPut it there.,Remember this.datafile (paper))
unzip OpenCompassData-core-20240207.zip

3.5 Automatic download of models and data using ModelScope (optional)

Because we also use the local model, do not need the program to download their own, if you do online testing, you can configure the

pip install modelscope
export DATASET_SOURCE=ModelScope

3.6 Online assessment (optional)

At this point if, for example, you have the conditions for FQ, you can just take the online test.

IV. OpenCompass online assessment (optional)

Because online assessment of many models from huggingface direct download, and then assessed, the need for FQ, I do not demonstrate here, directly to the official online testing process to bring over to show, if you do not need online testing can be directly skipped.

4.1 First Assessment

OpenCompass supports setting up configurations via the command line interface (CLI) or Python scripts. For simple evaluation setups, we recommend using the CLI; for more complex evaluations, the scripting approach is recommended. You can find more scripting examples in the configs folder.

# Command Line Interface (CLI)
opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen

# Python Script
opencompass ./configs/eval_chat_demo.py

More sample scripts can be found in the configs folder.

4.2 API Measurement

OpenCompass is not designed to distinguish between open source models and API models. You can evaluate both types of models in the same way or even in the same setup.

export OPENAI_API_KEY="YOUR_OPEN_API_KEY"
# Command Line Interface (CLI)
opencompass --models gpt_4o_2024_05_13 --datasets demo_gsm8k_chat_gen

# Python Script
opencompass  ./configs/eval_api_demo.py

# The o1_mini_2024_09_12/o1_preview_2024_09_12 models are now supported, with max_completion_tokens=8192 by default.

4.3 Back-end reasoning

If you want to use an inference backend other than HuggingFace for accelerated evaluation, such as LMDeploy or vLLM, you can do so with the following command.

opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen -a lmdeploy

V. Loading local test datasets

5.1 Download the LawBench dataset we want to use locally via git

git clone /ljn20001229/

Note: 1 place is the dataset zip downloaded through git, you need to unzip it into the data at the same level labeled 2 place.

Note: The 3 LawBench locations are unzipped files.

At this point our customized local data will be downloaded and put in place.

VI. Configuring the local Qwen model

6.1 Model Download. Download directly from modelscope.

6.2 Add the downloaded model to the project directory.

VII. Writing local assessment scripts

7.1 Create eval_local_qwen_1_8b_chat.py in the configs folder of the root directory to be used as the evaluation startup script for our Qwen1.8B model with the following code:

 1 # eval_local_qwen_1_8b_chat.py
 2 
 3 from  import read_base
 4 
 5 with read_base():
 6     # Importing data sets
 7     from ..lawbench_zero_shot_gen_002588 import lawbench_datasets as zero
 8     from ..lawbench_one_shot_gen_002588  import lawbench_datasets as one
 9     # Import model
10     from  .local_qwen_1_8b_chat import models
11 datasets = [*zero, *one]

7.2 modifications from ..lawbench_zero_shot_gen_002588 import lawbench_datasets as zero hit the nail on the head lawbench_zero_shot_gen_002588 file:

7.3 similarly modify from ..lawbench_one_shot_gen_002588 import lawbench_datasets as one hit the nail on the head lawbench_one_shot_gen_002588 file

7.4 Creating the qwen.local_qwen_1_8b_chat file from .local_qwen_1_8b_chat import models

VIII. Launching local assessment

Local evaluation is done directly using python by executing the configs/eval_local_qwen_1_8b_chat.py file that we created with the following arguments

 python  configs/eval_local_qwen_1_8b_chat.py  --debug

IX. Explanation of measurement parameters

  • --debug: debug mode, there will be log messages output on the console
  • --dry-run: This test will only load the dataset, but it will not be used again in the evaluation.
  • --accelerator vllm: vllm acceleration for local deployment of large models
  • --reuse: whether to reuse historical results
  • --work-dir: the path to store the results, default is outputs/default.
  • --max-num-worker: for data parallelism

X. Assessment results

So far the use of OpenCompass through the local dataset LawBench evaluation of the local model Qwen1.8B_chat model record is complete, thank you for looking at the granddaddy, look so long! Quill!!!