Skip to content

InProcessTestHost Orchestration Not Resuming After Activity Completion Following ContinueAsNew #689

@Tratcher

Description

@Tratcher

Environment

  • Package: Microsoft.DurableTask.InProcessTestHost
  • Version: 0.2.1-preview.1
  • Runtime: .NET 8.0
  • Host: Azure Functions Worker (Isolated) v2.1.0
  • DurableTask Version: 1.23.0

Summary

Orchestrations using ContinueAsNew intermittently fail to resume after an activity completes, causing the orchestration to hang indefinitely. The orchestrator never processes the activity result and never completes, even though the activity successfully returned.

Symptoms

  1. Activity completes and returns result successfully ✅
  2. Orchestrator never resumes to process the result
  3. No StreamInstanceHistory gRPC call made after activity completion ❌
  4. Orchestration never completes ❌
  5. Tests timeout waiting for orchestration completion ❌

Reproduction

Minimal Repro Code

// Orchestrator that uses ContinueAsNew
[Function("TestOrchestrator")]
public async Task RunAsync([OrchestrationTrigger] TaskOrchestrationContext context, AsyncOperation input)
{
    var result = await context.CallActivityAsync<AsyncOperation>("TestActivity", input);
    input.Update(result);

    if (!input.Closed)
    {
        await context.CreateTimer(TimeSpan.FromSeconds(5), CancellationToken.None);
        context.ContinueAsNew(input); // Restart orchestration
        return;
    }
}

// Activity that completes quickly
[Function("TestActivity")]
public async Task<AsyncOperation> TestActivityAsync([ActivityTrigger] AsyncOperation input)
{
    // Do some work that completes on second iteration
    if (input.Status == Status.InProgress)
    {
        input.Status = Status.Succeeded;
    }
    else
    {
        input.Status = Status.InProgress;
    }
    return input;
}

Steps to Reproduce

  1. Start an orchestration that uses ContinueAsNew
  2. The orchestration schedules an activity
  3. Activity completes and returns result
  4. Orchestration processes result and calls ContinueAsNew
  5. Orchestration restarts (2nd iteration)
  6. Orchestration schedules another activity
  7. Activity completes very quickly (timing critical - race condition)
  8. Orchestrator never resumes

Expected Behavior

After activity completes:

  1. Activity completion triggers gRPC CompleteActivityTask call ✅
  2. InMemoryOrchestrationService.CompleteTaskActivityWorkItemAsync is called ✅
  3. instanceStore.AddMessage(responseMessage) adds the activity result ✅
  4. Orchestration instance should be scheduled to run
  5. Orchestrator resumes via StreamInstanceHistory call ✅
  6. Orchestrator processes activity result ✅
  7. Orchestration completes ✅

Actual Behavior

After activity completes:
1-3. Same as expected ✅
4. Orchestration instance is NOT scheduled
5-7. Never happens ❌

Evidence from Logs

Successful Orchestration Completion (for comparison):

[00:10.837] ExecuteAsyncOperationJob: returning OperationId=d5e70f22, Closed=True
[00:10.839] Activity 'ExecuteOperationActivity_Version1' of orchestration ID 'd5e70f22' completed.
[00:10.843] POST /TaskHubSidecarService/StreamInstanceHistory (orchestrator checking for result)
[00:10.846] AsyncOperationOrchestrator: Activity returned - Closed=True, InProgress=False
[00:10.847] AsyncOperationOrchestrator: Operation orchestration end
[00:10.852] 'AsyncOperationOrchestrator_Version1' orchestration with ID 'd5e70f22' completed.

Failed Orchestration (hanging after ContinueAsNew):

[00:10.880] ExecuteAsyncOperationJob: returning OperationId=5e6356ca, Closed=True
[00:10.??? ] Activity 'ExecuteOperationActivity_Version1' of orchestration ID '5e6356ca' completed.
... NO StreamInstanceHistory call ...
... NO "Activity returned" log ...
... NO "Operation orchestration end" log ...
... NO orchestration completion log ...
[01:00.617] Test timeout - orchestration never completed

Timeline Showing the Race Condition

[00:10.845] 5e6356ca: Scheduling ContinueAsNew (Status=Waiting, blocked on d5e70f22)
[00:10.852] 5e6356ca: Orchestration #1 completed
[00:10.855] 5e6356ca: Orchestration #2 started (ContinueAsNew worked!) ✅
[00:10.855] 5e6356ca: Calling ExecuteOperationActivity_Version1
[00:10.859] 5e6356ca: Activity started
[00:10.867] 5e6356ca: Sees blocking operation as "Succeeded" ✅ (unblocked!)
[00:10.875] 5e6356ca: Sets environment to "Succeeded" ✅
[00:10.880] 5e6356ca: Activity returns Closed=True ✅
... ORCHESTRATOR NEVER RESUMES ❌
[01:00.617] Timeout

Root Cause Analysis

Location in Source

File: src/InProcessTestHost/Sidecar/InMemoryOrchestrationService.cs
Class: InMemoryInstanceStore
Method: AddMessage(TaskMessage message)
Lines: ~220-280

The Bug

When an activity completes, CompleteTaskActivityWorkItemAsync calls instanceStore.AddMessage(responseMessage). The AddMessage method contains this logic:

public void AddMessage(TaskMessage message)
{
    // ... message handling ...
    
    state.MessagesJson.Add(message);

    if (!state.IsLoaded)  // ← BUG: Only schedules if NOT loaded
    {
        // The orchestration isn't running, so schedule it to run now.
        // If it is running, it will be scheduled again automatically when it's released.
        this.readyToRunQueue.Schedule(state);
    }
}

The problem: After ContinueAsNew:

  1. Orchestration completes and calls CompleteTaskOrchestrationWorkItemAsync
  2. State is saved with the continuedAsNewMessage
  3. Orchestration restarts with ExecutionStartedEvent
  4. New iteration starts and schedules an activity
  5. Activity completes very quickly (before orchestrator calls LockNextTaskOrchestrationWorkItemAsync again)
  6. AddMessage is called with activity result
  7. At this point, state.IsLoaded might still be true from the previous iteration
  8. Message is added but instance is NOT scheduled
  9. The orchestrator never gets the work item to process the result

Race Condition Window

Thread 1 (Orchestrator):                    Thread 2 (Activity):
─────────────────────────────────────────   ──────────────────────────────────
CompleteTaskOrchestrationWorkItemAsync()
  └─ SaveState(continuedAsNewMessage)
  └─ [state.IsLoaded still TRUE]
                                            CompleteTaskActivityWorkItemAsync()
                                              └─ AddMessage(activityResult)
                                                  └─ state.MessagesJson.Add(msg) ✅
                                                  └─ if (!state.IsLoaded) ← FALSE!
                                                      └─ NOT SCHEDULED ❌
ReleaseLock(instanceId)
  └─ state.IsLoaded = false
  └─ if (state.MessagesJson.Count > 0) ← FALSE!
      └─ [message arrived AFTER this check]
      └─ NOT SCHEDULED ❌

The activity result arrives in the narrow window after the orchestration saved state but before ReleaseLock checks for pending messages.

Proposed Fix

Option 1: Always Check Message Queue in AddMessage (Safest)

public void AddMessage(TaskMessage message)
{
    string instanceId = message.OrchestrationInstance.InstanceId;
    string? executionId = message.OrchestrationInstance.ExecutionId;

    SerializedInstanceState state = this.store.GetOrAdd(instanceId, id => new SerializedInstanceState(id, executionId));
    lock (state)
    {
        // ... existing logic ...
        
        state.MessagesJson.Add(message);

        if (!state.IsLoaded)
        {
            this.readyToRunQueue.Schedule(state);
        }
        // FIX: Also schedule if an activity/sub-orchestration result arrives while loaded
        else if (message.Event is TaskCompletedEvent or TaskFailedEvent or 
                 SubOrchestrationInstanceCompletedEvent)
        {
            // Ensure orchestrator processes completion events even if marked as loaded
            this.readyToRunQueue.Schedule(state);
        }
    }
}

Option 2: Re-check Queue in ReleaseLock

public void ReleaseLock(string instanceId)
{
    if (!this.store.TryGetValue(instanceId, out SerializedInstanceState? state) || !state.IsLoaded)
    {
        throw new InvalidOperationException($"Instance {instanceId} is not in the store or is not loaded!");
    }

    lock (state)
    {
        state.IsLoaded = false;
        
        // FIX: Always re-check message queue after releasing lock
        // Messages might have arrived while we were releasing
        if (state.MessagesJson.Count > 0 && !state.IsCompleted)
        {
            this.readyToRunQueue.Schedule(state);
        }
    }
}

Option 3: Atomic Lock Management

Use a more robust locking mechanism that prevents the race condition entirely by ensuring messages can't arrive between state save and lock release.

Workaround

Since this is a test-only issue, we've worked around it by polling the database directly instead of relying on orchestration completion events:

// Instead of: await WaitForFunctionAsync(orchestrator, condition)

// Use direct polling:
var timeout = TimeSpan.FromSeconds(60);
var stopwatch = System.Diagnostics.Stopwatch.StartNew();

while (stopwatch.Elapsed < timeout)
{
    using var context = CreateDbContext();
    var entity = await context.GetEntityAsync();
    
    if (entity.ProvisioningState == "Succeeded")
        return;
    
    await Task.Delay(TimeSpan.FromMilliseconds(250));
}

throw new TimeoutException("Did not complete");

Impact

  • Severity: High for tests, None for production (only affects InProcessTestHost)
  • Frequency: Intermittent - depends on timing (activity completing very quickly after ContinueAsNew)
  • Workaround: Available (direct database polling)

Additional Context

This issue only manifests when:

  1. Using ContinueAsNew in orchestrations
  2. Activity completes very quickly (< few milliseconds)
  3. Running in the InProcessTestHost test environment
  4. Multiple orchestrations running concurrently

The issue does NOT affect:

  • Production deployments using Azure Storage backend
  • SQL Server backend
  • Real Azure Functions deployments

Test Case

We can provide a complete integration test that reproduces this issue consistently if needed.

Related Patterns

Our codebase has other tests with similar workarounds, suggesting this is a known issue:

// From FlexWorkloadProfileTests.cs line 168
while (!poolTask.IsCompleted)
{
    await rp.FunctionsBackgroundTasksClient.InvokeFunctionAsync("NodePoolUpdateTimer", CancellationToken.None);
    await Task.Delay(TimeSpan.FromMilliseconds(250));
}

Request

  1. Confirm if this is a known issue in v0.2.1-preview.1
  2. Provide ETA for a fix or stable version
  3. Advise on whether Option 1, 2, or 3 is the preferred fix approach
  4. Consider adding more diagnostic logging to help diagnose these race conditions

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions