Skip to content

fix: prevent duplicate model downloads with atomic check-and-set#578

Open
ianliuy wants to merge 1 commit intoome-projects:mainfrom
ianliuy:fix/issue-308
Open

fix: prevent duplicate model downloads with atomic check-and-set#578
ianliuy wants to merge 1 commit intoome-projects:mainfrom
ianliuy:fix/issue-308

Conversation

@ianliuy
Copy link
Copy Markdown

@ianliuy ianliuy commented Apr 16, 2026

What's broken?

When a new ClusterBaseModel is created with --num-download-worker=2 or more, two download workers both start downloading the same model simultaneously. This duplicates all model shards and can produce a corrupted config.json from concatenated writes.

Who is affected?

Any OME deployment with multiple download workers (--num-download-worker>=2). Single-worker deployments are unaffected.

When does it trigger?

When a Download and DownloadOverride task for the same model are processed within milliseconds of each other — both workers enter processTask() before either registers in activeDownloads.

Where is the bug?

pkg/modelagent/gopher.go, lines 278-284 (before this fix). The context creation and activeDownloads registration were separate from any existence check:

// Both workers reach here before either registers
ctx, cancel = context.WithCancel(context.Background())
s.activeDownloadsMutex.Lock()
s.activeDownloads[modelUID] = cancel  // No check before registration
s.activeDownloadsMutex.Unlock()

Why does it happen?

The check-then-act on activeDownloads was not atomic. Worker 1 could create its context and be about to lock the mutex, while Worker 2 also creates its context — neither has registered yet, so neither sees the other.

How did we fix it?

Atomic check-and-set inside the mutex: lock → check if model is already downloading → if not, create context and register → unlock. If another worker is already downloading the model, the duplicate task returns early with an info log.

s.activeDownloadsMutex.Lock()
if _, isDownloading := s.activeDownloads[modelUID]; isDownloading {
    s.activeDownloadsMutex.Unlock()
    s.logger.Infof("Model %s is already being downloaded, skipping duplicate download", modelInfo)
    return nil
}
ctx, cancel = context.WithCancel(context.Background())
s.activeDownloads[modelUID] = cancel
s.activeDownloadsMutex.Unlock()

This is the minimal fix — 1 file, 12 insertions / 8 deletions. The existing defer cleanup and Delete task cancellation logic are unchanged.

How do we know it works?

The fix ensures the TOCTOU window is eliminated: no two workers can both pass the "not downloading" check for the same model, because the check and registration happen inside the same lock acquisition.

Fixes #308

Move the activeDownloads check and registration into the same mutex-locked
critical section in processTask(). Previously, two download workers could
both pass the check before either registered, causing duplicate downloads
of the same model.

The fix performs an atomic check-and-set: lock the mutex, check if the
model is already downloading, and if not, register the cancel function
before releasing the lock. If another worker is already downloading the
model, the duplicate task is skipped with an info log.

Fixes ome-projects#308

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the model-agent Model agent changes label Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model-agent Model agent changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Duplicate model downloads when creating new ClusterBaseModel

1 participant