Skip to content

Bug: Process + children CPU usage systematically undercounted in resource monitor #12

@zachtheyek

Description

@zachtheyek

Problem Description

The resource utilization plot generated by monitor/monitor.py shows that system total CPU usage is consistently much higher than process + children CPU usage. Based on observation, the system total CPU metric appears accurate, which means the process + children CPU usage is being systematically undercounted.

This is a monitoring bug that gives an inaccurate picture of resource utilization during pipeline execution.

Evidence

In the resource utilization plot, the orange "System Total" CPU line is consistently and significantly higher than the blue "Process + Children" CPU line, even though the process tree should account for most/all of the machine's CPU usage. This is especially the case during multiprocessing work, where many worker processes are spawned & saturated.

Expected behavior: Process + children CPU should track closely with system total CPU during active pipeline work (accounting for baseline system overhead).

Actual behavior: Process + children CPU is substantially lower, suggesting significant undercounting.

Root Cause Analysis

The issue is in the get_process_tree_stats() function in monitor/monitor.py:32-97, specifically in how child process CPU usage is measured.

The Core Problem

# Lines 49-51: Fresh Process objects created every monitoring interval
processes = [process]
with contextlib.suppress(psutil.NoSuchProcess):
    processes.extend(process.children(recursive=True))

# Lines 57-61: cpu_percent() called on newly created Process objects  
for proc in processes:
    try:
        cpu = proc.cpu_percent(interval=0.0)  # Returns 0.0 on first call!
        total_cpu += cpu

Key issue: Every monitoring interval (every ~1s), we call process.children(recursive=True), which creates brand new psutil.Process objects for all child processes.

When cpu_percent(interval=0.0) is called on a newly created Process object for the first time, psutil has no baseline CPU measurements to compare against, so it returns 0.0 or a meaningless value. This is documented behavior in psutil.

Since we can have 96 child processes (multiprocessing workers) and we recreate Process objects every interval, we're systematically measuring 0.0 CPU for most children, leading to severe undercounting.

Why System Total Works Correctly

In contrast, system total CPU (line 317) uses:

psutil.cpu_percent(interval=0.1)

This measures CPU usage over a 0.1 second window by blocking and comparing CPU times at the start vs. end of the interval. This gives accurate real-time measurements without needing a baseline from previous calls.

Proposed Solutions

Option 1: Cache Process Objects (Recommended)

Approach: Maintain a persistent cache of Process objects across monitoring intervals.

Implementation:

  1. Add self._process_cache = {} # Maps PID -> Process object to ResourceMonitor.__init__()
  2. Modify get_process_tree_stats() to:
    • Get current PIDs in process tree
    • For each PID:
      • If in cache: Reuse existing Process object (cpu_percent will work correctly)
      • If new: Create Process object, add to cache (cpu_percent returns 0.0 first time, but accurate thereafter)
    • Remove dead PIDs from cache
    • Aggregate CPU and RAM from cached Process objects

Pros:

  • ✅ Accurate measurements after first interval (one-time 0.0 for new processes)
  • ✅ No blocking delays
  • ✅ Handles dynamic process creation/termination gracefully
  • ✅ Minimal performance overhead

Cons:

  • Slightly more complex code
  • Requires cache management

Option 2: Use cpu_times() with Manual Calculation

Approach: Replace cpu_percent() with cpu_times() and manually calculate CPU percentage based on time deltas.

Implementation:

  1. Store previous (cpu_times, timestamp) for each process
  2. On each interval:
    • Get current cpu_times and timestamp
    • Calculate delta: (cpu_times_now - cpu_times_prev) / (timestamp_now - timestamp_prev)
    • Convert to percentage

Pros:

  • ✅ Full control over calculation
  • ✅ No dependency on psutil's internal state

Cons:

  • ❌ More complex calculation logic
  • ❌ Need to handle edge cases (new processes, clock changes, etc.)
  • ❌ Essentially reimplementing what psutil already does

Option 3: Block on Each Process (Not Recommended)

Approach: Use cpu_percent(interval=0.1) for each child process.

Pros:

  • ✅ Simple, accurate measurements

Cons:

  • CRITICAL: With 100+ child processes, this would block for 100 * 0.1 = 10+ seconds per monitoring interval
  • ❌ Completely unacceptable for real-time monitoring
  • ❌ Would severely impact performance

Recommendation

Implement Option 1 (Cache Process Objects). This provides accurate measurements with minimal complexity and no blocking issues. The approach is idiomatic with psutil's design and handles the dynamic nature of multiprocessing workers gracefully.

Related Code

  • monitor/monitor.py:32-97 - get_process_tree_stats() function
  • monitor/monitor.py:226-234 - _get_process_tree_stats() wrapper
  • monitor/monitor.py:301-382 - _monitor_loop() where monitoring happens
  • monitor/monitor.py:317 - System total CPU measurement (works correctly)

Additional Context

This bug was discovered when examining resource utilization plots and noticing that system total CPU was consistently 2-3x higher than process + children CPU during active pipeline processing, which shouldn't be possible given that the pipeline is the primary workload on the system.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingneeds-discussionIssue lacks a preceding discussion

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions