-
Notifications
You must be signed in to change notification settings - Fork 395
Integrate Terminal Bench Evaluation #1154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
XinyuJiangCMU
wants to merge
6
commits into
THUDM:main
Choose a base branch
from
XinyuJiangCMU:feat/tb-eval-integration
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
610e766
Add TerminalBench eval scaffold
e98c713
feat(eval): add Terminal Bench eval delegate
JessicaJiang-123 d99048b
successfully integrate tb in slime delegate eval with train
JessicaJiang-123 75540ce
write quick-start for slime + tb delegate eval
XinyuJiangCMU 98b5ce4
modify code and quick-start based on review comments
JessicaJiang-123 1fd519d
add README-cn.md
XinyuJiangCMU File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| eval: | ||
| defaults: | ||
| n_samples_per_eval_prompt: 1 | ||
| temperature: 0.6 | ||
| top_p: 0.95 | ||
| top_k: -1 | ||
| max_response_len: 24576 | ||
| datasets: # these eval tasks go through slime dataset config and default rollout function (slime.rollout.sglang_rollout.generate_rollout) | ||
| - name: gpqa # huggingface-cli download --repo-type dataset zyzshishui0627/gpqa_diamond --local-dir /root/gpqa | ||
| path: /root/gpqa/gpqa_eval.jsonl | ||
| rm_type: gpqa | ||
| n_samples_per_eval_prompt: 2 | ||
| - name: ifbench # huggingface-cli download --repo-type dataset zyzshishui0627/IFBench --local-dir /root/ifbench | ||
| path: /root/ifbench/IFBench_eval.jsonl | ||
| rm_type: ifbench | ||
| n_samples_per_eval_prompt: 1 | ||
| delegate: | ||
| - name: terminal_bench | ||
| url: http://172.17.0.1:9051 # Port must match the TB server running on the host machine | ||
| timeout_secs: 86400 # 24 hours | ||
| max_retries: 1 # HTTP request retries from Slime to the TB server | ||
| model_name: qwen3-8b | ||
| api_base: http://127.0.0.1:30005/v1 # Port must match the sglang router port set in run-eval-tb-qwen.sh | ||
| dataset_path: /mnt/data/xinyu/program/slime-tb/terminal-bench/tasks # Dataset path on the host machine | ||
| # task_ids: | ||
| # - hello-world | ||
| # n_tasks: 10 | ||
| n_attempts: 1 # TB task-level retries (per task within tb run) | ||
| n_concurrent: 8 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,159 @@ | ||
| #!/bin/bash | ||
|
|
||
| # Example launcher that reuses the Qwen3-8B recipe but delegates evaluation to an | ||
| # external Terminal Bench server via the eval_delegate_rollout wrapper. | ||
|
|
||
| # Clean up any stale processes from a previous run. | ||
| pkill -9 sglang | ||
| sleep 3 | ||
| ray stop --force | ||
| pkill -9 ray | ||
| pkill -9 python | ||
| sleep 3 | ||
| pkill -9 ray | ||
| pkill -9 python | ||
|
|
||
| set -ex | ||
|
|
||
| export PYTHONBUFFERED=16 | ||
| export SLIME_HOST_IP=${SLIME_HOST_IP:-"127.0.0.1"} | ||
|
|
||
| MODEL_DIR="${MODEL_DIR:-/root/.cache}" | ||
| export MODEL_DIR | ||
|
|
||
| NVLINK_COUNT=$(nvidia-smi topo -m 2>/dev/null | grep -o 'NV[0-9][0-9]*' | wc -l) | ||
| if [ "$NVLINK_COUNT" -gt 0 ]; then | ||
| HAS_NVLINK=1 | ||
| else | ||
| HAS_NVLINK=0 | ||
| fi | ||
| echo "HAS_NVLINK: $HAS_NVLINK (detected $NVLINK_COUNT NVLink references)" | ||
|
|
||
| SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)" | ||
| REPO_ROOT="$(cd -- "${SCRIPT_DIR}/../../.." &>/dev/null && pwd)" | ||
| source "${REPO_ROOT}/scripts/models/qwen3-8B.sh" | ||
|
|
||
| # Store eval/delegate settings in a YAML config similar to examples/eval_multi_task. | ||
| EVAL_CONFIG_PATH=${TB_EVAL_CONFIG_PATH:-"${REPO_ROOT}/examples/eval/scripts/eval_tb_example.yaml"} | ||
|
|
||
| CKPT_ARGS=( | ||
| --hf-checkpoint ${MODEL_DIR}/OpenThinker-Agent-v1 # huggingface-cli download open-thoughts/OpenThinker-Agent-v1 | ||
| --ref-load ${MODEL_DIR}/OpenThinker-Agent-v1_torch_dist | ||
| # --load ${MODEL_DIR}/OpenThinker-Agent-v1_slime/ | ||
| --save ${MODEL_DIR}/OpenThinker-Agent-v1_slime/ | ||
| --save-interval 20 | ||
| ) | ||
|
|
||
| ROLLOUT_ARGS=( | ||
| --prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl | ||
| --input-key prompt | ||
| --label-key label | ||
| --apply-chat-template | ||
| --rollout-shuffle | ||
| --rm-type deepscaler | ||
| --num-rollout 3000 | ||
| --rollout-batch-size 32 | ||
| --n-samples-per-prompt 8 | ||
| --rollout-max-response-len 8192 | ||
| --rollout-temperature 0.8 | ||
| --global-batch-size 256 | ||
| --balance-data | ||
| ) | ||
|
|
||
| EVAL_ARGS=( | ||
| --eval-interval 5 | ||
| --eval-config "${EVAL_CONFIG_PATH}" | ||
| --eval-function-path examples.eval.eval_delegate_rollout.generate_rollout | ||
| ) | ||
|
|
||
| PERF_ARGS=( | ||
| --tensor-model-parallel-size 1 | ||
| --pipeline-model-parallel-size 1 | ||
| --context-parallel-size 1 | ||
| --expert-model-parallel-size 1 | ||
| --expert-tensor-parallel-size 1 | ||
|
|
||
| --recompute-granularity full | ||
| --recompute-method uniform | ||
| --recompute-num-layers 1 | ||
|
|
||
| --use-dynamic-batch-size | ||
| --max-tokens-per-gpu 9216 | ||
| ) | ||
|
|
||
| GRPO_ARGS=( | ||
| --advantage-estimator grpo | ||
| --use-kl-loss | ||
| --kl-loss-coef 0.00 | ||
| --kl-loss-type low_var_kl | ||
| --entropy-coef 0.00 | ||
| --eps-clip 0.2 | ||
| --eps-clip-high 0.28 | ||
| ) | ||
|
|
||
| OPTIMIZER_ARGS=( | ||
| --optimizer adam | ||
| --lr 1e-6 | ||
| --lr-decay-style constant | ||
| --weight-decay 0.1 | ||
| --adam-beta1 0.9 | ||
| --adam-beta2 0.98 | ||
| ) | ||
|
|
||
| WANDB_ARGS=( | ||
| --use-wandb | ||
| --wandb-project slime-eval | ||
| --wandb-group qwen3-8b-eval | ||
| --wandb-key ${WANDB_KEY} # export WANDB_KEY="your_key" | ||
| ) | ||
|
|
||
| SGLANG_ARGS=( | ||
| --rollout-num-gpus-per-engine 1 | ||
| --sglang-mem-fraction-static 0.7 | ||
| --sglang-router-port 30005 | ||
| ) | ||
|
|
||
| MISC_ARGS=( | ||
| --attention-dropout 0.0 | ||
| --hidden-dropout 0.0 | ||
| --accumulate-allreduce-grads-in-fp32 | ||
| --attention-softmax-in-fp32 | ||
| --attention-backend flash | ||
| ) | ||
|
|
||
| export MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"} | ||
| export CUDA_VISIBLE_DEVICES=0,1 | ||
|
|
||
| ray start --head --node-ip-address ${MASTER_ADDR} --port 6380 --num-gpus 2 \ | ||
| --disable-usage-stats \ | ||
| --dashboard-host=0.0.0.0 \ | ||
| --dashboard-port=8266 \ | ||
| --dashboard-agent-listen-port 52366 \ | ||
| --dashboard-agent-grpc-port 52367 \ | ||
| --runtime-env-agent-port 52368 | ||
|
|
||
|
|
||
| RUNTIME_ENV_JSON="{ | ||
| \"env_vars\": { | ||
| \"PYTHONPATH\": \"/root/Megatron-LM/\", | ||
| \"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\" | ||
| } | ||
| }" | ||
|
|
||
| ray job submit --address="http://${MASTER_ADDR}:8266" \ | ||
| --working-dir "${REPO_ROOT}" \ | ||
| --runtime-env-json="${RUNTIME_ENV_JSON}" \ | ||
| -- python3 train.py \ | ||
| --actor-num-nodes 1 \ | ||
| --actor-num-gpus-per-node 2 \ | ||
| --colocate \ | ||
| ${MODEL_ARGS[@]} \ | ||
| ${CKPT_ARGS[@]} \ | ||
| ${ROLLOUT_ARGS[@]} \ | ||
| ${OPTIMIZER_ARGS[@]} \ | ||
| ${GRPO_ARGS[@]} \ | ||
| ${WANDB_ARGS[@]} \ | ||
| ${PERF_ARGS[@]} \ | ||
| ${EVAL_ARGS[@]} \ | ||
| ${SGLANG_ARGS[@]} \ | ||
| ${MISC_ARGS[@]} | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,122 @@ | ||
| # Terminal Bench 评估集成 | ||
|
|
||
| 本目录将 Terminal Bench (TB) 封装为 Slime 的评估委托(Eval Delegate)。评估过程在宿主机(Host)上通过 `tb` CLI 执行,Slime 负责读取并汇总各项指标,包括 `accuracy`、`n_resolved`、`n_unresolved`、`pass_at_k/*` 以及 Token 统计数据(如 `total_input_tokens_mean/median` 和 `total_output_tokens_mean/median`)。 | ||
|
|
||
| ## 运行架构 | ||
|
|
||
| * **Slime 内部**:运行训练/评估主循环;调用 TB delegate client。 | ||
| * **宿主机(Host)**:运行 TB delegate server (`tb_server.py`),由其执行 `tb run ...`。 | ||
| * **Server逻辑**:读取最新的 TB JSON 结果并将各项指标返回给 Slime。 | ||
|
|
||
| ## 1) 获取代码 (宿主机) | ||
|
|
||
| ```bash | ||
| mkdir slime-tb | ||
| cd slime-tb | ||
| git clone https://github.com/THUDM/slime.git | ||
| git clone https://github.com/laude-institute/terminal-bench | ||
| ``` | ||
|
|
||
| ## 2) 启动 Slime 容器 | ||
|
|
||
| ```bash | ||
| docker run \ | ||
| -itd \ | ||
| --gpus all \ | ||
| --shm-size 32g \ | ||
| --network host \ | ||
| --ipc=host \ | ||
| --privileged \ | ||
| --ulimit memlock=-1 \ | ||
| --ulimit stack=67108864 \ | ||
| --ulimit nofile=65536:65536 \ | ||
| -v /mnt/data/.cache:/root/.cache \ | ||
| -v $(pwd):/shared/slime-tb \ | ||
| --name <slime_container_name> \ | ||
| slimerl/slime:latest \ | ||
| /bin/bash | ||
| ``` | ||
|
|
||
| ## 3) 进入 Slime 容器 | ||
|
|
||
| ```bash | ||
| docker exec -it <slime_container_name> /bin/bash | ||
| ``` | ||
|
|
||
| ## 4) 配置 Terminal Bench 环境 (宿主机) | ||
|
|
||
| 在运行 `tb_server.py` 的宿主机上执行: | ||
|
|
||
| ```bash | ||
| # 在宿主机终端执行(非 Docker 内部) | ||
| uv venv --python 3.13 .venv | ||
| source .venv/bin/activate | ||
| uv pip install terminal-bench/. | ||
| uv pip install -r slime/examples/eval/terminal_bench/requirements.txt | ||
| ``` | ||
|
|
||
| *如果仓库路径不是 `./slime` 和 `./terminal-bench`,请根据实际路径调整。* | ||
|
|
||
| ## 5) 启动 Terminal Bench server | ||
|
|
||
| 在宿主机上启动(即 `tb` 命令可用的环境): | ||
|
|
||
| ```bash | ||
| python slime/examples/eval/terminal_bench/tb_server.py \ | ||
| --host 0.0.0.0 --port 9051 \ | ||
| --output-root tb_eval_output | ||
| ``` | ||
|
|
||
| **该脚本的功能:** | ||
|
|
||
| * 默认设置 `OPENAI_API_KEY=EMPTY`。 | ||
| * 执行 `tb run -a terminus-2 -m openai/<model> ... --n-concurrent 8`。 | ||
| * 等待运行完成后,返回 `accuracy`、`pass_at_k` 以及 Token 消耗等统计数据。 | ||
|
|
||
| ## 6) 运行评估脚本 (示例) | ||
|
|
||
| 如果使用提供的 Qwen 评估启动脚本 (`run-eval-tb-qwen.sh`),请按以下步骤操作: | ||
|
|
||
| **更新路径**:将 `eval_tb_example.yaml` 中的 `dataset_path` 修改为宿主机上 `terminal-bench/tasks` 的**绝对路径**(注意不是 Docker 内部路径)。 | ||
|
|
||
| **下载模型**:在 Slime 容器内下载 HuggingFace 权重: | ||
| ```bash | ||
| huggingface-cli download open-thoughts/OpenThinker-Agent-v1 \ | ||
| --local-dir /root/.cache/OpenThinker-Agent-v1 | ||
| ``` | ||
|
|
||
| **格式转换**:将 HuggingFace 权重转换为 Slime 的 torch distributed 格式。在 Slime 根目录下执行: | ||
| ```bash | ||
| cd /shared/slime-tb/slime | ||
| source scripts/models/qwen3-8B.sh | ||
|
|
||
| export PYTHONPATH=/root/Megatron-LM:/shared/slime-tb/slime | ||
|
|
||
| python tools/convert_hf_to_torch_dist.py \ | ||
| ${MODEL_ARGS[@]} \ | ||
| --hf-checkpoint /root/.cache/OpenThinker-Agent-v1 \ | ||
| --save /root/.cache/OpenThinker-Agent-v1_torch_dist | ||
| ``` | ||
|
|
||
| **开始评估**:在 Slime 容器内运行: | ||
| ```bash | ||
| bash slime/examples/eval/scripts/run-eval-tb-qwen.sh 2>&1 | tee run.log | ||
| ``` | ||
|
|
||
| *为了快速测试,可以在 `eval_tb_example.yaml` 中通过 `task_ids` 指定特定任务,或通过 `n_tasks` 限制评估任务的数量。* | ||
|
|
||
| ## 7) 常见问题 | ||
|
|
||
| 当在 Docker 容器中使用 `--network host` 运行 Slime 时,Ray 可能由于与宿主机共享网络而出现端口冲突。 | ||
|
|
||
| 这会导致 Ray 启动失败,或报 Redis/会话相关错误。通常可以在启动 Ray head 时显式指定未占用端口来解决,比如设置非默认的 `--port` 和 `--dashboard-port`。 | ||
|
|
||
| 有时甚至会导致 Ray job 提交失败,提示没有可用 agent 接受任务。这通常是 dashboard agent 或 runtime env agent 的端口也发生冲突。此时可在启动 Ray 时指定这些端口(如 `--dashboard-agent-listen-port`、`--dashboard-agent-grpc-port`、`--runtime-env-agent-port`)来解决。 | ||
|
|
||
| 如果 TB server无法通过 sglang router 连接到 Slime(`InternalServerError`),请检查 router 端口(例如 30005)实际监听的地址,并更新 `eval_tb_example.yaml` 中的 `api_base`: | ||
|
|
||
| ```bash | ||
| ss -lntp | grep 30005 | ||
| ``` | ||
|
|
||
| TB server开始接受请求后,可能会在输出中看到 `Parser warnings`、`Context length exceeded`、`Command 1 should end with newline`、`Harness execution failed`等。这些是Terminal Bench 的警告,如果正常运行可以忽略。 |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment here. About port conflict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this in the quick-start README.