Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
195 changes: 195 additions & 0 deletions scripts/ISOLATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
# Test Isolation for floxenvs

For the full Nix-native isolation proposal (recommended),
see [nix-isolation-proposal.md](nix-isolation-proposal.md).

## Problem

When multiple environment tests run on the same builder
machine, services fight over ports and leave orphaned
processes:

- MySQL tests bind to port 3306 -- two concurrent runs clash
- Elasticsearch binds to its configured port -- same problem
- PostgreSQL on port 15432 -- same problem
- Orphaned service processes from failed runs block new runs

The Tailscale+SSH runner infrastructure works reliably. The
isolation between tests does not.

## Approaches Considered

### Option A: Linux namespaces (chosen for prototype)

Use `unshare` to give each test its own network and PID
namespace. Each test gets its own loopback interface with its
own port space -- no conflicts possible.

Pros:

- Lightest option, no VM overhead
- No KVM required, works on any Linux kernel 3.8+
- ~0ms startup overhead
- PID namespace auto-kills orphaned services on exit
- Available on all current builders

Cons:

- Shared kernel (less isolation than VMs)
- User namespaces may be restricted on some kernels
(`sysctl kernel.unprivileged_userns_clone`)
- No GPU isolation (but GPU envs are future work)
- Linux only (not applicable to macOS builders)

### Option B: systemd-nspawn containers

Lightweight containers using systemd-nspawn. Full filesystem
and network isolation.

Pros:

- Stronger isolation than raw namespaces
- Built into systemd (available on most Linux builders)
- Supports private networking and filesystem overlay

Cons:

- Requires systemd (not available on all build systems)
- More complex setup than raw unshare
- Heavier than namespaces for the problem we're solving

### Option C: Firecracker microVMs

Full VM isolation with ~125ms boot time.

Pros:

- Strongest isolation (separate kernel)
- ~125ms boot, ~5MB memory overhead
- Battle-tested (used by AWS Lambda)

Cons:

- Requires KVM (`/dev/kvm`)
- More complex provisioning (needs kernel + rootfs images)
- Overkill for port conflict isolation
- No GPU passthrough

### Option D: QEMU/KVM with VFIO

Full VMs with GPU passthrough capability.

Pros:

- Supports GPU passthrough via VFIO
- Full OS isolation
- NixOS test framework provides declarative QEMU VMs

Cons:

- Heaviest option (~50-200ms boot)
- Requires dedicated GPU hardware for passthrough
- Complex setup

## Prototype: isolated-test.sh

The `scripts/isolated-test.sh` wrapper implements Option A
(Linux namespaces). It:

1. Detects available namespace support (unprivileged user
namespaces, root, or none)
2. Creates a new namespace with:
- Network namespace (own loopback, own port space)
- PID namespace (orphaned services killed on exit)
- Mount namespace (if root available)
3. Sets up loopback networking inside the namespace
4. Copies the environment to an isolated temp directory
5. Runs `flox activate [--start-services] -- bash test.sh`
6. Cleans up on exit

### Usage

```bash
# Run postgres test in isolation
./scripts/isolated-test.sh postgres --start-services

# Run go test (no services)
./scripts/isolated-test.sh go

# Run mysql test in isolation
./scripts/isolated-test.sh mysql --start-services
```

### Requirements

- Linux with user namespace support (kernel 3.8+)
- `unshare` from util-linux
- For full isolation: root or CAP_SYS_ADMIN
- Falls back gracefully if namespaces unavailable

## CI Integration

To use `isolated-test.sh` in the current CI workflow, the
test execution in `ci.yml` would change from running the Nix
flake app directly to running it through the isolation
wrapper.

### Current flow (flake.nix test runner)

```
ssh remote-server nix run .#apps.system.test-env -- true
```

The Nix flake app (`flake.nix`) copies the environment to
a temp dir and runs `flox activate -- bash test.sh`.

### Proposed flow (with isolation)

The isolation wrapper can be integrated at two levels:

**Level 1: Wrap the flake app (minimal change)**

Modify `flake.nix` `mkFloxEnvPkg` to run the test inside
an `unshare` namespace. This is the smallest CI change --
the SSH+Tailscale infrastructure stays identical.

**Level 2: Replace the flake app (bigger change)**

Use `isolated-test.sh` directly over SSH instead of the
Nix flake app. This simplifies the test runner but requires
Flox to be installed on the builder (instead of using Nix
to provide it).

### Recommended: Level 1

Modify the shell script inside `mkFloxEnvPkg` in
`flake.nix` to wrap the `flox activate` call with
`unshare --net --pid --fork`. This keeps the existing CI
pipeline intact and adds isolation transparently.

## Future Directions

### Phase 2: GPU testing (VFIO passthrough)

For CUDA/GPU environments (ollama, pytorch), we need
dedicated hardware with VFIO passthrough. This requires:

- Bare metal servers with NVIDIA GPUs
- IOMMU enabled (Intel VT-d or AMD-Vi)
- One physical GPU per test VM
- QEMU/KVM as the VM layer (Firecracker has no GPU support)

### Phase 3: macOS isolation (Tart)

For macOS environment testing:

- Mac mini fleet with Apple Silicon
- Tart (by Cirrus Labs) for macOS VMs
- ~15-30s boot time per VM
- No GPU passthrough (Metal requires bare metal)

### Hardware provisioning

Phases 2 and 3 require hardware procurement -- tracked as a
separate planning item in the environment health testing
effort.
92 changes: 92 additions & 0 deletions scripts/flake-unshare-patch.nix
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Prototype: patched mkFloxEnvPkg with namespace isolation
#
# This shows the minimal change to flake.nix that adds
# network + PID namespace isolation to every test run.
#
# Diff from current:
# + util-linux and iproute2 in packages
# + unshare wrapper around flox activate
# + loopback setup inside namespace
#
# To apply: replace mkFloxEnvPkg in flake.nix with this version.

# mkFloxEnvPkg = name: {
# path ? "${inputs.self}/${name}",
# packages ? with pkgs; [
# coreutils
# util-linux # provides unshare
# iproute2 # provides ip (for loopback)
# flox.packages."${system}".default
# ],
# isolated ? true,
# }: pkgs.writeShellScriptBin "test-${name}" ''
# set -exo pipefail
#
# export FLOX_DISABLE_METRICS=true
# export FLOX_ENVS_TESTING=1
# export PATH="${lib.makeBinPath packages}:$PATH"
# export LANG=
# export LC_COLLATE="C"
# export LC_CTYPE="C"
# export LC_MESSAGES="C"
# export LC_MONETARY="C"
# export LC_NUMERIC="C"
# export LC_TIME="C"
# export LC_ALL=
#
# mkdir -p /tmp/floxenvs
# export TESTDIR="$(mktemp --directory --tmpdir=/tmp/floxenvs --suffix floxenvs-${name}-example || mktemp --directory --tmpdir=/tmp --suffix floxenvs-${name}-example )"
# ret=$?
# if [ $ret -ne 0 ] || [ "$TESTDIR" = ""] ; then
# echo "Error: unable to create temp directory"
# exit $ret
# fi
#
# chmod g=rwx "$TESTDIR"
# cp -R ${path}/* $TESTDIR
# cp -R ${path}/.flox* $TESTDIR
# if [ -f ${path}/.env ]; then
# cp -R ${path}/.env $TESTDIR
# fi
# chown -R $(whoami) $TESTDIR/.flox*
# chmod -R a+w,g+rw $TESTDIR/.flox*
#
# cd $TESTDIR
# echo "Running tests in $TESTDIR"
#
# start_services=""
# if [ "$1" == "true" ]; then
# start_services=" --start-services"
# fi
#
# if [ ! -f test.sh ]; then
# echo "Error: No test.sh script found"
# exit 1
# fi
#
# echo "Running ${name} test..."
#
# # --- ISOLATION WRAPPER ---
# # On Linux with namespace support, wrap in unshare for
# # network + PID isolation. On Darwin or if unshare fails,
# # fall back to direct execution.
# if [ "$(uname)" == "Linux" ] && command -v unshare >/dev/null 2>&1; then
# echo "Isolating test in network+PID namespace..."
# exec unshare --net --pid --fork \
# ${pkgs.bashInteractive}/bin/bash -c '
# # Set up loopback in the new network namespace
# ip link set lo up 2>/dev/null || true
# cd "'"$TESTDIR"'"
# flox activate'"$start_services"' -- ${pkgs.bashInteractive}/bin/bash test.sh
# '
# else
# echo "No namespace support, running without isolation..."
# flox activate$start_services -- ${pkgs.bashInteractive}/bin/bash test.sh
# fi
#
# ret=$?
# if [ $ret -ne 0 ]; then
# echo "Error: Tests failed"
# exit $ret
# fi
# '';
Loading