Make use of self hosted runners#848
Conversation
|
Note that the self hosted bots are not ephemeral and anything installed stays until next regeneration |
Is it possible for someone to ssh onto the runner and manually install cuda 12 and 13 using the commands in this PR? I can then change the PR to use the correct cuda depending on the cuda matrix option. |
Alternatively (probably will take some thinking about how to implement correctly), what about using docker on the self hosted runner to provide a clean environment each time someone runs a ci job. We can could then run the ci inside the container and delete the image at the end of the job to stop us from filling up the runner disk space. That way no ones PR risks influencing anyone elses who runs on the self hosted runner. |
It is stateful on purpose because it takes much longer to install cuda, etc. |
|
Note that these systems are cuda ready -- take a look at how they are used in clad. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #848 +/- ##
=======================================
Coverage 79.56% 79.56%
=======================================
Files 11 11
Lines 4013 4013
=======================================
Hits 3193 3193
Misses 820 820 🚀 New features to boost your workflow:
|
|
Any news on that? |
I will finish this PR off either later today or tomorrow morning |
69d8790 to
3a1b4a5
Compare
| prepare-dell: | ||
| name: Activate self-host infrastructure | ||
| runs-on: self-hosted | ||
| steps: | ||
| - name: Send Magic Packet | ||
| env: | ||
| TARGET_IP: 192.168.100.30 | ||
| MAC_ADDR: a4:bb:6d:51:d5:d2 | ||
| # The container has no ping, emulate it. | ||
| run: | | ||
| # Mask the IP and potential broadcast to keep logs clean | ||
| echo "::add-mask::$MAC_ADDR" | ||
| echo "::add-mask::$BROADCAST" | ||
| echo "::add-mask::$TARGET_IP" | ||
| BROADCAST=$(echo $TARGET_IP | sed 's/\.[0-9]*$/ .255/' | tr -d ' ') | ||
| PING="timeout 1 bash -c 'cat < /dev/null > /dev/tcp/$TARGET_IP/22' 2>/dev/null" | ||
|
|
||
| # Install tool silently | ||
| sudo apt-get update -qq && sudo apt-get install -y -qq wakeonlan > /dev/null | ||
|
|
||
| # Check if already awake (using the Bash TCP PING variable) | ||
| if eval "$PING"; then | ||
| echo "Target machine is already awake. Exiting." | ||
| exit 0 | ||
| fi | ||
|
|
||
| # If offline, send WoL | ||
| echo "Machine is offline. Sending WoL..." | ||
| wakeonlan -i $BROADCAST $MAC_ADDR > /dev/null | ||
|
|
||
| # Wait & Verify Loop (checks every 10s for 4 minutes) | ||
| echo "Waiting for response (checking Port 22)..." | ||
| for i in {1..24}; do | ||
| if eval "$PING"; then | ||
| echo "Machine is online and SSH is ready." | ||
| exit 0 | ||
| fi | ||
| sleep 10 | ||
| done | ||
|
|
||
| echo "Error: Target hardware did not respond within the timeout period." | ||
| exit 1 |
There was a problem hiding this comment.
Not sure what this is actually doing, and how it activates the self hosted infrastructure. Took it from Clads workflows.
There was a problem hiding this comment.
The self hosted runner will now try to run this section, but it fails. Since I don't know exactly what it does (I'm guessing the the comments it allows you to ssh into the runner for debug builds if needed), I don't know how to fix.
| cancel-in-progress: true | ||
|
|
||
| jobs: | ||
| prepare-dell: |
|
@vgvassilev @aaronj0 this got merged in, but this PR was broken. It repeated what clad had but the first stage (prepare-dell) doesn't pass the ci. The ci on main will not pass now this been merged in. I am away from a computer at the moment, so one of you will need to make the reversion PR. |

closes #847