Skip to content

Respect device selection in system stats#72

Merged
loribonna merged 2 commits intoaimagelab:masterfrom
lxr2:device-aware-stats
Nov 22, 2025
Merged

Respect device selection in system stats#72
loribonna merged 2 commits intoaimagelab:masterfrom
lxr2:device-aware-stats

Conversation

@lxr2
Copy link
Copy Markdown
Contributor

@lxr2 lxr2 commented Nov 20, 2025

Hi, I encountered severe performance degradation when enabling --device

log uv run main.py --model sgd --dataset seq-cifar100 --device 1 --n_epochs 1 --lr 1e-3
Task 1 - Epoch 1: 100%|███████████████████████████| 157/157 [02:45<00:00,  1.05s/it, loss=2.24, lr=0.001]STOP ITERATION
Task 1 - Epoch 1: 100%|███████████████████████████| 157/157 [02:45<00:00,  1.05s/it, loss=2.24, lr=0.001]
Evaluating Task 1: 100%|████████████████████████████████| 32/32 [00:00<00:00, 61.30it/s, acc_task_1=33.7]
Accuracy for 1 task(s):          [Class-IL]: 33.7 %      [Task-IL]: 33.7 %
[INFO] 20-Nov-25 02:50:41 - Accuracy for 1 task(s):      [Class-IL]: 33.7 %      [Task-IL]: 33.7 %
        Raw accuracy values: Class-IL [33.7] | Task-IL [33.7]
[INFO] 20-Nov-25 02:50:41 -     Raw accuracy values: Class-IL [33.7] | Task-IL [33.7]


[INFO] 20-Nov-25 02:50:42 - Using 8 workers for the dataloader.
[INFO] 20-Nov-25 02:50:42 - Using 8 workers for the dataloader.
Task 2 - Epoch 1: 100%|███████████████████████████| 157/157 [02:43<00:00,  1.01s/it, loss=2.45, lr=0.001]STOP ITERATION
Task 2 - Epoch 1: 100%|███████████████████████████| 157/157 [02:43<00:00,  1.04s/it, loss=2.45, lr=0.001]
Evaluating Task 2: 100%|████████████████████████████████| 64/64 [00:00<00:00, 68.21it/s, acc_task_2=32.4]
Accuracy for 2 task(s):          [Class-IL]: 16.2 %      [Task-IL]: 31.85 %
[INFO] 20-Nov-25 02:53:27 - Accuracy for 2 task(s):      [Class-IL]: 16.2 %      [Task-IL]: 31.85 %
        Raw accuracy values: Class-IL [0.0, 32.4] | Task-IL [31.3, 32.4]
[INFO] 20-Nov-25 02:53:27 -     Raw accuracy values: Class-IL [0.0, 32.4] | Task-IL [31.3, 32.4

CUDA_VISIBLE_DEVICES=1 uv run main.py --model sgd --dataset seq-cifar100 --n_epochs 1 --lr 1e-3

Task 1 - Epoch 1: 100%|███████████████████████████| 157/157 [00:25<00:00,  6.05it/s, loss=2.36, lr=0.001]STOP ITERATION
Task 1 - Epoch 1: 100%|███████████████████████████| 157/157 [00:25<00:00,  6.14it/s, loss=2.36, lr=0.001]
Evaluating Task 1: 100%|████████████████████████████████| 32/32 [00:00<00:00, 83.61it/s, acc_task_1=30.4]
Accuracy for 1 task(s):          [Class-IL]: 30.4 %      [Task-IL]: 30.4 %
[INFO] 20-Nov-25 02:54:18 - Accuracy for 1 task(s):      [Class-IL]: 30.4 %      [Task-IL]: 30.4 %
        Raw accuracy values: Class-IL [30.4] | Task-IL [30.4]
[INFO] 20-Nov-25 02:54:18 -     Raw accuracy values: Class-IL [30.4] | Task-IL [30.4]


[INFO] 20-Nov-25 02:54:19 - Using 8 workers for the dataloader.
[INFO] 20-Nov-25 02:54:19 - Using 8 workers for the dataloader.
Task 2 - Epoch 1: 100%|███████████████████████████| 157/157 [00:23<00:00,  6.99it/s, loss=2.36, lr=0.001]STOP ITERATION
Task 2 - Epoch 1: 100%|███████████████████████████| 157/157 [00:23<00:00,  6.55it/s, loss=2.36, lr=0.001]
Evaluating Task 2: 100%|████████████████████████████████| 64/64 [00:00<00:00, 83.98it/s, acc_task_2=28.9]
Accuracy for 2 task(s):          [Class-IL]: 14.45 %     [Task-IL]: 28.75 %
[INFO] 20-Nov-25 02:54:44 - Accuracy for 2 task(s):      [Class-IL]: 14.45 %     [Task-IL]: 28.75 %
        Raw accuracy values: Class-IL [0.0, 28.9] | Task-IL [28.599999999999998, 28.9]
[INFO] 20-Nov-25 02:54:44 -     Raw accuracy values: Class-IL [0.0, 28.9] | Task-IL [28.599999999999998, 28.9]

As you can see, specifying --device is 6.6 times slower than using CUDA_VISIBLE_DEVICES.
The PR is the GPT-5.1-Codex solution. You can merge or refer to it. I tested it and it worked well.


Summary

  • limit system stats GPU sampling to the user-selected device(s) instead of every visible GPU
  • parse device strings/lists into CUDA ids and map GPU readings by id for logging
  • plumb args.device into stats tracking in the training loops (including deprecated path)

Context

When all GPUs are visible, stats collection scanned every GPU each step and flushed CUDA caches, making --device 1 slower than masking via CUDA_VISIBLE_DEVICES=1.

@loribonna loribonna merged commit ea967c3 into aimagelab:master Nov 22, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants