-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem Description
The resource utilization plot generated by monitor/monitor.py shows that system total CPU usage is consistently much higher than process + children CPU usage. Based on observation, the system total CPU metric appears accurate, which means the process + children CPU usage is being systematically undercounted.
This is a monitoring bug that gives an inaccurate picture of resource utilization during pipeline execution.
Evidence
In the resource utilization plot, the orange "System Total" CPU line is consistently and significantly higher than the blue "Process + Children" CPU line, even though the process tree should account for most/all of the machine's CPU usage. This is especially the case during multiprocessing work, where many worker processes are spawned & saturated.
Expected behavior: Process + children CPU should track closely with system total CPU during active pipeline work (accounting for baseline system overhead).
Actual behavior: Process + children CPU is substantially lower, suggesting significant undercounting.
Root Cause Analysis
The issue is in the get_process_tree_stats() function in monitor/monitor.py:32-97, specifically in how child process CPU usage is measured.
The Core Problem
# Lines 49-51: Fresh Process objects created every monitoring interval
processes = [process]
with contextlib.suppress(psutil.NoSuchProcess):
processes.extend(process.children(recursive=True))
# Lines 57-61: cpu_percent() called on newly created Process objects
for proc in processes:
try:
cpu = proc.cpu_percent(interval=0.0) # Returns 0.0 on first call!
total_cpu += cpuKey issue: Every monitoring interval (every ~1s), we call process.children(recursive=True), which creates brand new psutil.Process objects for all child processes.
When cpu_percent(interval=0.0) is called on a newly created Process object for the first time, psutil has no baseline CPU measurements to compare against, so it returns 0.0 or a meaningless value. This is documented behavior in psutil.
Since we can have 96 child processes (multiprocessing workers) and we recreate Process objects every interval, we're systematically measuring 0.0 CPU for most children, leading to severe undercounting.
Why System Total Works Correctly
In contrast, system total CPU (line 317) uses:
psutil.cpu_percent(interval=0.1)This measures CPU usage over a 0.1 second window by blocking and comparing CPU times at the start vs. end of the interval. This gives accurate real-time measurements without needing a baseline from previous calls.
Proposed Solutions
Option 1: Cache Process Objects (Recommended)
Approach: Maintain a persistent cache of Process objects across monitoring intervals.
Implementation:
- Add
self._process_cache = {} # Maps PID -> Process objecttoResourceMonitor.__init__() - Modify
get_process_tree_stats()to:- Get current PIDs in process tree
- For each PID:
- If in cache: Reuse existing Process object (cpu_percent will work correctly)
- If new: Create Process object, add to cache (cpu_percent returns 0.0 first time, but accurate thereafter)
- Remove dead PIDs from cache
- Aggregate CPU and RAM from cached Process objects
Pros:
- ✅ Accurate measurements after first interval (one-time 0.0 for new processes)
- ✅ No blocking delays
- ✅ Handles dynamic process creation/termination gracefully
- ✅ Minimal performance overhead
Cons:
- Slightly more complex code
- Requires cache management
Option 2: Use cpu_times() with Manual Calculation
Approach: Replace cpu_percent() with cpu_times() and manually calculate CPU percentage based on time deltas.
Implementation:
- Store previous
(cpu_times, timestamp)for each process - On each interval:
- Get current
cpu_timesandtimestamp - Calculate delta:
(cpu_times_now - cpu_times_prev) / (timestamp_now - timestamp_prev) - Convert to percentage
- Get current
Pros:
- ✅ Full control over calculation
- ✅ No dependency on psutil's internal state
Cons:
- ❌ More complex calculation logic
- ❌ Need to handle edge cases (new processes, clock changes, etc.)
- ❌ Essentially reimplementing what psutil already does
Option 3: Block on Each Process (Not Recommended)
Approach: Use cpu_percent(interval=0.1) for each child process.
Pros:
- ✅ Simple, accurate measurements
Cons:
- ❌ CRITICAL: With 100+ child processes, this would block for 100 * 0.1 = 10+ seconds per monitoring interval
- ❌ Completely unacceptable for real-time monitoring
- ❌ Would severely impact performance
Recommendation
Implement Option 1 (Cache Process Objects). This provides accurate measurements with minimal complexity and no blocking issues. The approach is idiomatic with psutil's design and handles the dynamic nature of multiprocessing workers gracefully.
Related Code
monitor/monitor.py:32-97-get_process_tree_stats()functionmonitor/monitor.py:226-234-_get_process_tree_stats()wrappermonitor/monitor.py:301-382-_monitor_loop()where monitoring happensmonitor/monitor.py:317- System total CPU measurement (works correctly)
Additional Context
This bug was discovered when examining resource utilization plots and noticing that system total CPU was consistently 2-3x higher than process + children CPU during active pipeline processing, which shouldn't be possible given that the pipeline is the primary workload on the system.