Skip to content

hubstrauss/mle-bench-hpc

 
 

Repository files navigation

Running MLE-Bench on a Slurm-based cluster with Apptainer

The goal of this fork is to run MLE-Bench agents on a Slurm-based cluster that use Apptainer instead of Docker, and circumvent the root/user separation.

What To Look Out For

The agent must not have access to private test answers. This is why the root/user separation exists in the initial MLE-Bench project (grading server and agent are being run inside the same container). To go around it, we:

  • Run the grading server as in an Apptainer container with access to private data
  • Run the agent in a different Apptainer container without private data mounted
  • Agent validates submissions via HTTP (http://<grading-server>:5000/validate)

Pre-requisites

We assume familiarity with MLE-Bench; if you need setup information, see the MLE-Bench ReadMe.

Step 1: Build Apptainer Image

Note: If you are on an arm64 machine, you need to add --platform=linux/amd64 when building locally.

On a machine with Docker access:

# Build Docker images
docker build -t mlebench-env -f environment/Dockerfile .
docker build -t aide agents/aide/ \
    --build-arg SUBMISSION_DIR=/home/submission \
    --build-arg LOGS_DIR=/home/logs \
    --build-arg CODE_DIR=/home/code \
    --build-arg AGENT_DIR=/home/agent

Then you can save your docker as .tar file, transfer to HPC and convert:

Transfer to HPC and convert:

apptainer build mlebench-env.sif docker-archive://mlebench-env.tar
apptainer build aide.sif docker-archive://aide.tar
For Princeton University users ```scp aide.tar netid@della.princeton.edu:/home/netid/path/to/save/```

Step 2: Start Grading Server (Manual Method)

Note: If using the heterogeneous job script (scripts_hpc/slurm_hetjob.sh), skip to Step 4. The script handles Steps 2-3 automatically.

Option A: SLURM Job

# Edit paths in script first, then:
sbatch scripts_hpc/slurm_grading_server.sh spaceship-titanic

# Check output for the grading server URL
cat slurm_output/mlebench/grading-<jobid>.out

Option B: Interactive

On a node that has access to the private test data:

COMPETITION="spaceship-titanic"
DATA_DIR="/path/to/mlebench/data"
SIF_IMAGE="/path/to/mlebench-env.sif"

apptainer exec \
    --contain \
    --cleanenv \
    --no-home \
    --bind ${DATA_DIR}:/data:ro \
    ${SIF_IMAGE} \
    /opt/conda/bin/conda run -n mleb python /mlebench/environment/run_grading_server.py \
        --competition-id ${COMPETITION} \
        --data-dir /data \
        --host 0.0.0.0 \
        --port 5000

Step 3: Run Agent (Manual Method)

Option A: SLURM Job

# With explicit grading server URL:
sbatch scripts_hpc/slurm_agent.sh spaceship-titanic http://node123:5000

# Or auto-discover from grading job ID:
sbatch scripts_hpc/slurm_agent.sh spaceship-titanic auto:<grading-job-id>

Add --nv flag for GPU support.

Step 4: Grade Submission

After the agent finishes:

mlebench grade \
    --submission ${OUTPUT_DIR}/submission/submission.csv \
    --competition ${COMPETITION}

SLURM Heterogeneous Job

Use a heterogeneous job to schedule grading server on CPU and agent on GPUs together:

sbatch scripts_hpc/slurm_hetjob.sh spaceship-titanic

Make sure to edit scripts_hpc/slurm_hetjob.sh to set your paths:

  • MLEBENCH_DIR: path to mle-bench repo
  • DATA_DIR: path to data
  • SIF_IMAGE: path to Apptainer image
  • OUTPUT_BASE: base output directory

Todo

  • Update and test heterogeneous scripts on della

About

MLE-bench-HPC is MLE-bench adapted to work on HPC

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.3%
  • Shell 4.4%
  • Dockerfile 1.2%
  • JavaScript 0.1%