Skip to content

Make use of self hosted runners#848

Merged
vgvassilev merged 1 commit intocompiler-research:mainfrom
mcbarton:cuda-runners
Mar 21, 2026
Merged

Make use of self hosted runners#848
vgvassilev merged 1 commit intocompiler-research:mainfrom
mcbarton:cuda-runners

Conversation

@mcbarton
Copy link
Copy Markdown
Collaborator

closes #847

@vgvassilev
Copy link
Copy Markdown
Contributor

Note that the self hosted bots are not ephemeral and anything installed stays until next regeneration

@mcbarton
Copy link
Copy Markdown
Collaborator Author

mcbarton commented Mar 10, 2026

Note that the self hosted bots are not ephemeral and anything installed stays until next regeneration

Is it possible for someone to ssh onto the runner and manually install cuda 12 and 13 using the commands in this PR? I can then change the PR to use the correct cuda depending on the cuda matrix option.

@mcbarton
Copy link
Copy Markdown
Collaborator Author

Note that the self hosted bots are not ephemeral and anything installed stays until next regeneration

Alternatively (probably will take some thinking about how to implement correctly), what about using docker on the self hosted runner to provide a clean environment each time someone runs a ci job. We can could then run the ci inside the container and delete the image at the end of the job to stop us from filling up the runner disk space. That way no ones PR risks influencing anyone elses who runs on the self hosted runner.

@vgvassilev
Copy link
Copy Markdown
Contributor

Note that the self hosted bots are not ephemeral and anything installed stays until next regeneration

Alternatively (probably will take some thinking about how to implement correctly), what about using docker on the self hosted runner to provide a clean environment each time someone runs a ci job. We can could then run the ci inside the container and delete the image at the end of the job to stop us from filling up the runner disk space. That way no ones PR risks influencing anyone elses who runs on the self hosted runner.

It is stateful on purpose because it takes much longer to install cuda, etc.

@vgvassilev
Copy link
Copy Markdown
Contributor

Note that these systems are cuda ready -- take a look at how they are used in clad.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.56%. Comparing base (852f6d4) to head (57b5d13).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #848   +/-   ##
=======================================
  Coverage   79.56%   79.56%           
=======================================
  Files          11       11           
  Lines        4013     4013           
=======================================
  Hits         3193     3193           
  Misses        820      820           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vgvassilev
Copy link
Copy Markdown
Contributor

Any news on that?

@mcbarton
Copy link
Copy Markdown
Collaborator Author

mcbarton commented Mar 15, 2026

Any news on that?

I will finish this PR off either later today or tomorrow morning

@mcbarton mcbarton force-pushed the cuda-runners branch 3 times, most recently from 69d8790 to 3a1b4a5 Compare March 15, 2026 18:27
@mcbarton mcbarton changed the title Make use of self hosted runners to have cuda 12.6 and cuda 13.2 jobs Make use of self hosted runners Mar 15, 2026
Comment on lines +18 to +59
prepare-dell:
name: Activate self-host infrastructure
runs-on: self-hosted
steps:
- name: Send Magic Packet
env:
TARGET_IP: 192.168.100.30
MAC_ADDR: a4:bb:6d:51:d5:d2
# The container has no ping, emulate it.
run: |
# Mask the IP and potential broadcast to keep logs clean
echo "::add-mask::$MAC_ADDR"
echo "::add-mask::$BROADCAST"
echo "::add-mask::$TARGET_IP"
BROADCAST=$(echo $TARGET_IP | sed 's/\.[0-9]*$/ .255/' | tr -d ' ')
PING="timeout 1 bash -c 'cat < /dev/null > /dev/tcp/$TARGET_IP/22' 2>/dev/null"

# Install tool silently
sudo apt-get update -qq && sudo apt-get install -y -qq wakeonlan > /dev/null

# Check if already awake (using the Bash TCP PING variable)
if eval "$PING"; then
echo "Target machine is already awake. Exiting."
exit 0
fi

# If offline, send WoL
echo "Machine is offline. Sending WoL..."
wakeonlan -i $BROADCAST $MAC_ADDR > /dev/null

# Wait & Verify Loop (checks every 10s for 4 minutes)
echo "Waiting for response (checking Port 22)..."
for i in {1..24}; do
if eval "$PING"; then
echo "Machine is online and SSH is ready."
exit 0
fi
sleep 10
done

echo "Error: Target hardware did not respond within the timeout period."
exit 1
Copy link
Copy Markdown
Collaborator Author

@mcbarton mcbarton Mar 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this is actually doing, and how it activates the self hosted infrastructure. Took it from Clads workflows.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The self hosted runner will now try to run this section, but it fails. Since I don't know exactly what it does (I'm guessing the the comments it allows you to ssh into the runner for debug builds if needed), I don't know how to fix.

cancel-in-progress: true

jobs:
prepare-dell:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not running at the moment. Just displaying the below message. I was able to use the self hosted runners the other day, so hopefully this is just some Github issue, and will run soon.

Image

Copy link
Copy Markdown
Contributor

@vgvassilev vgvassilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vgvassilev vgvassilev merged commit 4c6e2c2 into compiler-research:main Mar 21, 2026
11 of 15 checks passed
@mcbarton
Copy link
Copy Markdown
Collaborator Author

mcbarton commented Mar 21, 2026

@vgvassilev @aaronj0 this got merged in, but this PR was broken. It repeated what clad had but the first stage (prepare-dell) doesn't pass the ci. The ci on main will not pass now this been merged in. I am away from a computer at the moment, so one of you will need to make the reversion PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Add a cuda builder in our ci.

2 participants