-
Notifications
You must be signed in to change notification settings - Fork 54
InProcessTestHost Orchestration Not Resuming After Activity Completion Following ContinueAsNew #689
Description
Environment
- Package:
Microsoft.DurableTask.InProcessTestHost - Version:
0.2.1-preview.1 - Runtime: .NET 8.0
- Host: Azure Functions Worker (Isolated) v2.1.0
- DurableTask Version: 1.23.0
Summary
Orchestrations using ContinueAsNew intermittently fail to resume after an activity completes, causing the orchestration to hang indefinitely. The orchestrator never processes the activity result and never completes, even though the activity successfully returned.
Symptoms
- Activity completes and returns result successfully ✅
- Orchestrator never resumes to process the result ❌
- No
StreamInstanceHistorygRPC call made after activity completion ❌ - Orchestration never completes ❌
- Tests timeout waiting for orchestration completion ❌
Reproduction
Minimal Repro Code
// Orchestrator that uses ContinueAsNew
[Function("TestOrchestrator")]
public async Task RunAsync([OrchestrationTrigger] TaskOrchestrationContext context, AsyncOperation input)
{
var result = await context.CallActivityAsync<AsyncOperation>("TestActivity", input);
input.Update(result);
if (!input.Closed)
{
await context.CreateTimer(TimeSpan.FromSeconds(5), CancellationToken.None);
context.ContinueAsNew(input); // Restart orchestration
return;
}
}
// Activity that completes quickly
[Function("TestActivity")]
public async Task<AsyncOperation> TestActivityAsync([ActivityTrigger] AsyncOperation input)
{
// Do some work that completes on second iteration
if (input.Status == Status.InProgress)
{
input.Status = Status.Succeeded;
}
else
{
input.Status = Status.InProgress;
}
return input;
}Steps to Reproduce
- Start an orchestration that uses
ContinueAsNew - The orchestration schedules an activity
- Activity completes and returns result
- Orchestration processes result and calls
ContinueAsNew - Orchestration restarts (2nd iteration)
- Orchestration schedules another activity
- Activity completes very quickly (timing critical - race condition)
- Orchestrator never resumes ❌
Expected Behavior
After activity completes:
- Activity completion triggers gRPC
CompleteActivityTaskcall ✅ InMemoryOrchestrationService.CompleteTaskActivityWorkItemAsyncis called ✅instanceStore.AddMessage(responseMessage)adds the activity result ✅- Orchestration instance should be scheduled to run ✅
- Orchestrator resumes via
StreamInstanceHistorycall ✅ - Orchestrator processes activity result ✅
- Orchestration completes ✅
Actual Behavior
After activity completes:
1-3. Same as expected ✅
4. Orchestration instance is NOT scheduled ❌
5-7. Never happens ❌
Evidence from Logs
✅ Successful Orchestration Completion (for comparison):
[00:10.837] ExecuteAsyncOperationJob: returning OperationId=d5e70f22, Closed=True
[00:10.839] Activity 'ExecuteOperationActivity_Version1' of orchestration ID 'd5e70f22' completed.
[00:10.843] POST /TaskHubSidecarService/StreamInstanceHistory (orchestrator checking for result)
[00:10.846] AsyncOperationOrchestrator: Activity returned - Closed=True, InProgress=False
[00:10.847] AsyncOperationOrchestrator: Operation orchestration end
[00:10.852] 'AsyncOperationOrchestrator_Version1' orchestration with ID 'd5e70f22' completed.
❌ Failed Orchestration (hanging after ContinueAsNew):
[00:10.880] ExecuteAsyncOperationJob: returning OperationId=5e6356ca, Closed=True
[00:10.??? ] Activity 'ExecuteOperationActivity_Version1' of orchestration ID '5e6356ca' completed.
... NO StreamInstanceHistory call ...
... NO "Activity returned" log ...
... NO "Operation orchestration end" log ...
... NO orchestration completion log ...
[01:00.617] Test timeout - orchestration never completed
Timeline Showing the Race Condition
[00:10.845] 5e6356ca: Scheduling ContinueAsNew (Status=Waiting, blocked on d5e70f22)
[00:10.852] 5e6356ca: Orchestration #1 completed
[00:10.855] 5e6356ca: Orchestration #2 started (ContinueAsNew worked!) ✅
[00:10.855] 5e6356ca: Calling ExecuteOperationActivity_Version1
[00:10.859] 5e6356ca: Activity started
[00:10.867] 5e6356ca: Sees blocking operation as "Succeeded" ✅ (unblocked!)
[00:10.875] 5e6356ca: Sets environment to "Succeeded" ✅
[00:10.880] 5e6356ca: Activity returns Closed=True ✅
... ORCHESTRATOR NEVER RESUMES ❌
[01:00.617] Timeout
Root Cause Analysis
Location in Source
File: src/InProcessTestHost/Sidecar/InMemoryOrchestrationService.cs
Class: InMemoryInstanceStore
Method: AddMessage(TaskMessage message)
Lines: ~220-280
The Bug
When an activity completes, CompleteTaskActivityWorkItemAsync calls instanceStore.AddMessage(responseMessage). The AddMessage method contains this logic:
public void AddMessage(TaskMessage message)
{
// ... message handling ...
state.MessagesJson.Add(message);
if (!state.IsLoaded) // ← BUG: Only schedules if NOT loaded
{
// The orchestration isn't running, so schedule it to run now.
// If it is running, it will be scheduled again automatically when it's released.
this.readyToRunQueue.Schedule(state);
}
}The problem: After ContinueAsNew:
- Orchestration completes and calls
CompleteTaskOrchestrationWorkItemAsync - State is saved with the
continuedAsNewMessage - Orchestration restarts with
ExecutionStartedEvent - New iteration starts and schedules an activity
- Activity completes very quickly (before orchestrator calls
LockNextTaskOrchestrationWorkItemAsyncagain) AddMessageis called with activity result- At this point,
state.IsLoadedmight still betruefrom the previous iteration ❌ - Message is added but instance is NOT scheduled ❌
- The orchestrator never gets the work item to process the result
Race Condition Window
Thread 1 (Orchestrator): Thread 2 (Activity):
───────────────────────────────────────── ──────────────────────────────────
CompleteTaskOrchestrationWorkItemAsync()
└─ SaveState(continuedAsNewMessage)
└─ [state.IsLoaded still TRUE]
CompleteTaskActivityWorkItemAsync()
└─ AddMessage(activityResult)
└─ state.MessagesJson.Add(msg) ✅
└─ if (!state.IsLoaded) ← FALSE!
└─ NOT SCHEDULED ❌
ReleaseLock(instanceId)
└─ state.IsLoaded = false
└─ if (state.MessagesJson.Count > 0) ← FALSE!
└─ [message arrived AFTER this check]
└─ NOT SCHEDULED ❌
The activity result arrives in the narrow window after the orchestration saved state but before ReleaseLock checks for pending messages.
Proposed Fix
Option 1: Always Check Message Queue in AddMessage (Safest)
public void AddMessage(TaskMessage message)
{
string instanceId = message.OrchestrationInstance.InstanceId;
string? executionId = message.OrchestrationInstance.ExecutionId;
SerializedInstanceState state = this.store.GetOrAdd(instanceId, id => new SerializedInstanceState(id, executionId));
lock (state)
{
// ... existing logic ...
state.MessagesJson.Add(message);
if (!state.IsLoaded)
{
this.readyToRunQueue.Schedule(state);
}
// FIX: Also schedule if an activity/sub-orchestration result arrives while loaded
else if (message.Event is TaskCompletedEvent or TaskFailedEvent or
SubOrchestrationInstanceCompletedEvent)
{
// Ensure orchestrator processes completion events even if marked as loaded
this.readyToRunQueue.Schedule(state);
}
}
}Option 2: Re-check Queue in ReleaseLock
public void ReleaseLock(string instanceId)
{
if (!this.store.TryGetValue(instanceId, out SerializedInstanceState? state) || !state.IsLoaded)
{
throw new InvalidOperationException($"Instance {instanceId} is not in the store or is not loaded!");
}
lock (state)
{
state.IsLoaded = false;
// FIX: Always re-check message queue after releasing lock
// Messages might have arrived while we were releasing
if (state.MessagesJson.Count > 0 && !state.IsCompleted)
{
this.readyToRunQueue.Schedule(state);
}
}
}Option 3: Atomic Lock Management
Use a more robust locking mechanism that prevents the race condition entirely by ensuring messages can't arrive between state save and lock release.
Workaround
Since this is a test-only issue, we've worked around it by polling the database directly instead of relying on orchestration completion events:
// Instead of: await WaitForFunctionAsync(orchestrator, condition)
// Use direct polling:
var timeout = TimeSpan.FromSeconds(60);
var stopwatch = System.Diagnostics.Stopwatch.StartNew();
while (stopwatch.Elapsed < timeout)
{
using var context = CreateDbContext();
var entity = await context.GetEntityAsync();
if (entity.ProvisioningState == "Succeeded")
return;
await Task.Delay(TimeSpan.FromMilliseconds(250));
}
throw new TimeoutException("Did not complete");Impact
- Severity: High for tests, None for production (only affects InProcessTestHost)
- Frequency: Intermittent - depends on timing (activity completing very quickly after ContinueAsNew)
- Workaround: Available (direct database polling)
Additional Context
This issue only manifests when:
- Using
ContinueAsNewin orchestrations - Activity completes very quickly (< few milliseconds)
- Running in the InProcessTestHost test environment
- Multiple orchestrations running concurrently
The issue does NOT affect:
- Production deployments using Azure Storage backend
- SQL Server backend
- Real Azure Functions deployments
Test Case
We can provide a complete integration test that reproduces this issue consistently if needed.
Related Patterns
Our codebase has other tests with similar workarounds, suggesting this is a known issue:
// From FlexWorkloadProfileTests.cs line 168
while (!poolTask.IsCompleted)
{
await rp.FunctionsBackgroundTasksClient.InvokeFunctionAsync("NodePoolUpdateTimer", CancellationToken.None);
await Task.Delay(TimeSpan.FromMilliseconds(250));
}Request
- Confirm if this is a known issue in v0.2.1-preview.1
- Provide ETA for a fix or stable version
- Advise on whether Option 1, 2, or 3 is the preferred fix approach
- Consider adding more diagnostic logging to help diagnose these race conditions