Bug: Process + children CPU usage systematically undercounted in resource monitor

## Problem Description

The resource utilization plot generated by `monitor/monitor.py` shows that **system total CPU usage is consistently much higher than process + children CPU usage**. Based on observation, the system total CPU metric appears accurate, which means the process + children CPU usage is being systematically undercounted.

This is a monitoring bug that gives an inaccurate picture of resource utilization during pipeline execution.

## Evidence

In the resource utilization plot, the orange "System Total" CPU line is consistently and significantly higher than the blue "Process + Children" CPU line, even though the process tree should account for most/all of the machine's CPU usage. This is especially the case during multiprocessing work, where many worker processes are spawned & saturated.

Expected behavior: Process + children CPU should track closely with system total CPU during active pipeline work (accounting for baseline system overhead).

Actual behavior: Process + children CPU is substantially lower, suggesting significant undercounting.

## Root Cause Analysis

The issue is in the `get_process_tree_stats()` function in `monitor/monitor.py:32-97`, specifically in how child process CPU usage is measured.

### The Core Problem

```python
# Lines 49-51: Fresh Process objects created every monitoring interval
processes = [process]
with contextlib.suppress(psutil.NoSuchProcess):
    processes.extend(process.children(recursive=True))

# Lines 57-61: cpu_percent() called on newly created Process objects  
for proc in processes:
    try:
        cpu = proc.cpu_percent(interval=0.0)  # Returns 0.0 on first call!
        total_cpu += cpu
```

**Key issue**: Every monitoring interval (every ~1s), we call `process.children(recursive=True)`, which creates **brand new `psutil.Process` objects** for all child processes. 

When `cpu_percent(interval=0.0)` is called on a **newly created** `Process` object for the first time, psutil has no baseline CPU measurements to compare against, so it returns `0.0` or a meaningless value. This is documented behavior in psutil.

Since we can have 96 child processes (multiprocessing workers) and we recreate Process objects every interval, we're systematically measuring `0.0` CPU for most children, leading to severe undercounting.

### Why System Total Works Correctly

In contrast, system total CPU (line 317) uses:
```python
psutil.cpu_percent(interval=0.1)
```

This measures CPU usage over a 0.1 second window by blocking and comparing CPU times at the start vs. end of the interval. This gives accurate real-time measurements without needing a baseline from previous calls.

## Proposed Solutions

### Option 1: Cache Process Objects (Recommended)

**Approach**: Maintain a persistent cache of Process objects across monitoring intervals.

**Implementation**:
1. Add `self._process_cache = {}  # Maps PID -> Process object` to `ResourceMonitor.__init__()`
2. Modify `get_process_tree_stats()` to:
   - Get current PIDs in process tree
   - For each PID:
     - If in cache: Reuse existing Process object (cpu_percent will work correctly)
     - If new: Create Process object, add to cache (cpu_percent returns 0.0 first time, but accurate thereafter)
   - Remove dead PIDs from cache
   - Aggregate CPU and RAM from cached Process objects

**Pros**:
- ✅ Accurate measurements after first interval (one-time 0.0 for new processes)
- ✅ No blocking delays
- ✅ Handles dynamic process creation/termination gracefully
- ✅ Minimal performance overhead

**Cons**:
- Slightly more complex code
- Requires cache management

### Option 2: Use cpu_times() with Manual Calculation

**Approach**: Replace `cpu_percent()` with `cpu_times()` and manually calculate CPU percentage based on time deltas.

**Implementation**:
1. Store previous `(cpu_times, timestamp)` for each process
2. On each interval:
   - Get current `cpu_times` and `timestamp`
   - Calculate delta: `(cpu_times_now - cpu_times_prev) / (timestamp_now - timestamp_prev)`
   - Convert to percentage

**Pros**:
- ✅ Full control over calculation
- ✅ No dependency on psutil's internal state

**Cons**:
- ❌ More complex calculation logic
- ❌ Need to handle edge cases (new processes, clock changes, etc.)
- ❌ Essentially reimplementing what psutil already does

### Option 3: Block on Each Process (Not Recommended)

**Approach**: Use `cpu_percent(interval=0.1)` for each child process.

**Pros**:
- ✅ Simple, accurate measurements

**Cons**:
- ❌ **CRITICAL**: With 100+ child processes, this would block for 100 * 0.1 = 10+ seconds per monitoring interval
- ❌ Completely unacceptable for real-time monitoring
- ❌ Would severely impact performance

## Recommendation

**Implement Option 1 (Cache Process Objects)**. This provides accurate measurements with minimal complexity and no blocking issues. The approach is idiomatic with psutil's design and handles the dynamic nature of multiprocessing workers gracefully.

## Related Code

- `monitor/monitor.py:32-97` - `get_process_tree_stats()` function
- `monitor/monitor.py:226-234` - `_get_process_tree_stats()` wrapper
- `monitor/monitor.py:301-382` - `_monitor_loop()` where monitoring happens
- `monitor/monitor.py:317` - System total CPU measurement (works correctly)

## Additional Context

This bug was discovered when examining resource utilization plots and noticing that system total CPU was consistently 2-3x higher than process + children CPU during active pipeline processing, which shouldn't be possible given that the pipeline is the primary workload on the system.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Process + children CPU usage systematically undercounted in resource monitor #12

Problem Description

Evidence

Root Cause Analysis

The Core Problem

Why System Total Works Correctly

Proposed Solutions

Option 1: Cache Process Objects (Recommended)

Option 2: Use cpu_times() with Manual Calculation

Option 3: Block on Each Process (Not Recommended)

Recommendation

Related Code

Additional Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bug: Process + children CPU usage systematically undercounted in resource monitor #12

Description

Problem Description

Evidence

Root Cause Analysis

The Core Problem

Why System Total Works Correctly

Proposed Solutions

Option 1: Cache Process Objects (Recommended)

Option 2: Use cpu_times() with Manual Calculation

Option 3: Block on Each Process (Not Recommended)

Recommendation

Related Code

Additional Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions