Skip to content

Conversation

@digitsisyph
Copy link

Add JSC eval setup

@gemini-code-assist
Copy link

Summary of Changes

Hello @digitsisyph, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes the foundational setup for conducting SWE-bench evaluations using the Harbor framework on JSC supercomputers. It provides all necessary documentation, configuration files, and an automated SLURM script to streamline the process from environment preparation to running benchmarks and optionally uploading results, ensuring a robust and reproducible evaluation pipeline.

Highlights

  • JSC Evaluation Setup Documentation: A comprehensive README.md has been added, detailing the complete process for setting up the environment and running SWE-bench evaluations on JSC supercomputers. This includes prerequisites, one-time setup steps, execution commands, monitoring, troubleshooting, and file structure.
  • Harbor Evaluation Configuration: A new jsc_eval_config.yaml file defines the configuration for Harbor evaluations, specifying parameters such as job directory, number of attempts, timeout settings, orchestrator type (local with concurrent trials and retry logic), and environment type (Daytona).
  • Automated SLURM Evaluation Script: A jsc_eval_harbor.sbatch SLURM script has been introduced to automate the entire evaluation workflow. This script handles environment activation, cache configuration for various tools (vLLM, Triton, UV, Hugging Face), secret loading, vLLM server management, execution of Harbor evaluation commands, metadata saving, and optional result uploading to HuggingFace.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@digitsisyph
Copy link
Author

Hi @neginraoof , this is the setup and code for jsc eval.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a setup for running evaluations on JSC supercomputers. The changes include a detailed README guide, a SLURM batch script, and a Harbor configuration file. My review focuses on ensuring the setup instructions are correct and the scripts are robust. I've found a critical inconsistency between the Python version specified in the setup guide (3.11) and the one required by the evaluation script (3.13), which will cause issues for users. I've provided suggestions to align the documentation with the script's requirements. Additionally, I've identified a critical bug in the SLURM script's logic for counting errors, which would prevent the auto-upload skipping feature from working correctly. I've also included suggestions to improve the script's robustness and clarity by refining how it locates dependencies and manages environment modules. Overall, the contribution is valuable, and addressing these points will significantly improve its reliability and user experience.

Comment on lines +44 to +49
### 4. Create Python 3.11 Environment
```bash
eval "$(${MAMBA_INSTALL}/bin/conda shell.bash hook)"
export EVAL_ENV="/p/project1/ccstdl/${USER}/eval_py311"
${MAMBA_INSTALL}/bin/mamba create --prefix ${EVAL_ENV} python=3.11 -c conda-forge --override-channels -y
source ${MAMBA_INSTALL}/bin/activate ${EVAL_ENV}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The setup instructions guide the user to create a Python 3.11 environment, but the jsc_eval_harbor.sbatch script is configured for and recommends Python 3.13. This inconsistency will cause errors and confusion. To fix this, the instructions should be updated to create a Python 3.13 environment.

Suggested change
### 4. Create Python 3.11 Environment
```bash
eval "$(${MAMBA_INSTALL}/bin/conda shell.bash hook)"
export EVAL_ENV="/p/project1/ccstdl/${USER}/eval_py311"
${MAMBA_INSTALL}/bin/mamba create --prefix ${EVAL_ENV} python=3.11 -c conda-forge --override-channels -y
source ${MAMBA_INSTALL}/bin/activate ${EVAL_ENV}
### 4. Create Python 3.13 Environment
```bash
eval "$(${MAMBA_INSTALL}/bin/conda shell.bash hook)"
export EVAL_ENV="/p/project1/ccstdl/${USER}/eval_py313"
${MAMBA_INSTALL}/bin/mamba create --prefix ${EVAL_ENV} python=3.13 -c conda-forge --override-channels -y
source ${MAMBA_INSTALL}/bin/activate ${EVAL_ENV}

Comment on lines +71 to +73
export MAMBA_INSTALL="/p/project1/ccstdl/${USER}/mamba"
export EVAL_ENV="/p/project1/ccstdl/${USER}/eval_py311"
alias activate_eval='source ${MAMBA_INSTALL}/bin/activate ${EVAL_ENV}'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

To maintain consistency with the required Python 3.13 environment, the EVAL_ENV variable configured in .bashrc should point to the eval_py313 directory.

Suggested change
export MAMBA_INSTALL="/p/project1/ccstdl/${USER}/mamba"
export EVAL_ENV="/p/project1/ccstdl/${USER}/eval_py311"
alias activate_eval='source ${MAMBA_INSTALL}/bin/activate ${EVAL_ENV}'
export MAMBA_INSTALL="/p/project1/ccstdl/${USER}/mamba"
export EVAL_ENV="/p/project1/ccstdl/${USER}/eval_py313"
alias activate_eval='source ${MAMBA_INSTALL}/bin/activate ${EVAL_ENV}'

Comment on lines +447 to +475
DAYTONA_COUNT=$(python3 -c "
import json
import sys

try:
with open('$RESULT_FILE', 'r') as f:
data = json.load(f)

total_errors = 0
error_ids = []

if 'stats' in data and 'evals' in data['stats']:
for eval_key, eval_data in data['stats']['evals'].items():
if 'exception_stats' in eval_data and 'DaytonaError' in eval_data['exception_stats']:
daytona_errors = eval_data['exception_stats']['DaytonaError']
if isinstance(daytona_errors, list):
total_errors += len(daytona_errors)
error_ids.extend(daytona_errors)

print(total_errors)

# Also write detailed info to stderr for logging
if total_errors > 0:
print(f'Found {total_errors} DaytonaError(s): {error_ids[:5]}...', file=sys.stderr)

except Exception as e:
print(f'Error parsing result.json: {e}', file=sys.stderr)
print('0')
" 2>&1 | tail -n 1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The command used to capture DAYTONA_COUNT is buggy. By redirecting stderr to stdout (2>&1) and taking the last line (tail -n 1), you are capturing a log message from stderr instead of the actual count printed to stdout. This will cause the subsequent integer comparison to fail.

The Python script correctly prints only the count to stdout. You should capture stdout directly by removing the pipe.

            DAYTONA_COUNT=$(python3 -c "
import json
import sys

try:
    with open('$RESULT_FILE', 'r') as f:
        data = json.load(f)

    total_errors = 0
    error_ids = []

    if 'stats' in data and 'evals' in data['stats']:
        for eval_key, eval_data in data['stats']['evals'].items():
            if 'exception_stats' in eval_data and 'DaytonaError' in eval_data['exception_stats']:
                daytona_errors = eval_data['exception_stats']['DaytonaError']
                if isinstance(daytona_errors, list):
                    total_errors += len(daytona_errors)
                    error_ids.extend(daytona_errors)

    print(total_errors)

    # Also write detailed info to stderr for logging
    if total_errors > 0:
        print(f'Found {total_errors} DaytonaError(s): {error_ids[:5]}...', file=sys.stderr)

except Exception as e:
    print(f'Error parsing result.json: {e}', file=sys.stderr)
    print('0')
")

```
/p/project1/ccstdl/<username>/
├── mamba/ # Miniforge installation
├── eval_py311/ # Python 3.11 environment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To reflect the correct Python 3.13 environment, this line in the file structure diagram should be updated.

Suggested change
├── eval_py311/ # Python 3.11 environment
├── eval_py313/ # Python 3.13 environment

# Option 1: Use latest 2024 stack
ml Stages/2024
ml StdEnv/2024
ml Python/3.11.3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script loads the Python/3.11.3 module but then activates a Conda environment with Python 3.13. This is confusing and potentially problematic if there are conflicting libraries. Since the Conda environment provides the Python interpreter and its dependencies, loading the system Python module is likely unnecessary and could be removed to avoid confusion and potential conflicts.

Comment on lines +268 to +272
SNAPSHOT_SCRIPT="${SLURM_SUBMIT_DIR:-$(pwd)}/snapshot_download.py"
if [ ! -f "$SNAPSHOT_SCRIPT" ]; then
# Try parent directory (tacc/)
SNAPSHOT_SCRIPT="${SLURM_SUBMIT_DIR:-$(pwd)}/../tacc/snapshot_download.py"
fi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Locating snapshot_download.py relative to SLURM_SUBMIT_DIR is fragile because it depends on the directory where sbatch is executed. A more robust approach is to define the path relative to this script's own location. This makes the dependency on the tacc directory's script explicit and reliable, regardless of the submission directory.

Suggested change
SNAPSHOT_SCRIPT="${SLURM_SUBMIT_DIR:-$(pwd)}/snapshot_download.py"
if [ ! -f "$SNAPSHOT_SCRIPT" ]; then
# Try parent directory (tacc/)
SNAPSHOT_SCRIPT="${SLURM_SUBMIT_DIR:-$(pwd)}/../tacc/snapshot_download.py"
fi
# Use the snapshot_download.py script from the sibling tacc directory
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
SNAPSHOT_SCRIPT="${SCRIPT_DIR}/../tacc/snapshot_download.py"

@penfever
Copy link
Collaborator

@digitsisyph Please resolve the outstanding comments and request a review to keep this moving!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants