InProcessTestHost Orchestration Not Resuming After Activity Completion Following ContinueAsNew

## Environment
- **Package**: `Microsoft.DurableTask.InProcessTestHost`
- **Version**: `0.2.1-preview.1`
- **Runtime**: .NET 8.0
- **Host**: Azure Functions Worker (Isolated) v2.1.0
- **DurableTask Version**: 1.23.0

## Summary
Orchestrations using `ContinueAsNew` intermittently fail to resume after an activity completes, causing the orchestration to hang indefinitely. The orchestrator never processes the activity result and never completes, even though the activity successfully returned.

## Symptoms
1. Activity completes and returns result successfully ✅
2. **Orchestrator never resumes to process the result** ❌
3. No `StreamInstanceHistory` gRPC call made after activity completion ❌
4. Orchestration never completes ❌
5. Tests timeout waiting for orchestration completion ❌

## Reproduction

### Minimal Repro Code

```csharp
// Orchestrator that uses ContinueAsNew
[Function("TestOrchestrator")]
public async Task RunAsync([OrchestrationTrigger] TaskOrchestrationContext context, AsyncOperation input)
{
    var result = await context.CallActivityAsync<AsyncOperation>("TestActivity", input);
    input.Update(result);

    if (!input.Closed)
    {
        await context.CreateTimer(TimeSpan.FromSeconds(5), CancellationToken.None);
        context.ContinueAsNew(input); // Restart orchestration
        return;
    }
}

// Activity that completes quickly
[Function("TestActivity")]
public async Task<AsyncOperation> TestActivityAsync([ActivityTrigger] AsyncOperation input)
{
    // Do some work that completes on second iteration
    if (input.Status == Status.InProgress)
    {
        input.Status = Status.Succeeded;
    }
    else
    {
        input.Status = Status.InProgress;
    }
    return input;
}
```

### Steps to Reproduce
1. Start an orchestration that uses `ContinueAsNew`
2. The orchestration schedules an activity
3. Activity completes and returns result
4. Orchestration processes result and calls `ContinueAsNew`
5. Orchestration restarts (2nd iteration)
6. Orchestration schedules another activity
7. **Activity completes very quickly** (timing critical - race condition)
8. **Orchestrator never resumes** ❌

## Expected Behavior
After activity completes:
1. Activity completion triggers gRPC `CompleteActivityTask` call ✅
2. `InMemoryOrchestrationService.CompleteTaskActivityWorkItemAsync` is called ✅
3. `instanceStore.AddMessage(responseMessage)` adds the activity result ✅
4. **Orchestration instance should be scheduled to run** ✅
5. Orchestrator resumes via `StreamInstanceHistory` call ✅
6. Orchestrator processes activity result ✅
7. Orchestration completes ✅

## Actual Behavior
After activity completes:
1-3. Same as expected ✅
4. **Orchestration instance is NOT scheduled** ❌
5-7. Never happens ❌

## Evidence from Logs

### ✅ **Successful Orchestration Completion** (for comparison):
```
[00:10.837] ExecuteAsyncOperationJob: returning OperationId=d5e70f22, Closed=True
[00:10.839] Activity 'ExecuteOperationActivity_Version1' of orchestration ID 'd5e70f22' completed.
[00:10.843] POST /TaskHubSidecarService/StreamInstanceHistory (orchestrator checking for result)
[00:10.846] AsyncOperationOrchestrator: Activity returned - Closed=True, InProgress=False
[00:10.847] AsyncOperationOrchestrator: Operation orchestration end
[00:10.852] 'AsyncOperationOrchestrator_Version1' orchestration with ID 'd5e70f22' completed.
```

### ❌ **Failed Orchestration** (hanging after ContinueAsNew):
```
[00:10.880] ExecuteAsyncOperationJob: returning OperationId=5e6356ca, Closed=True
[00:10.??? ] Activity 'ExecuteOperationActivity_Version1' of orchestration ID '5e6356ca' completed.
... NO StreamInstanceHistory call ...
... NO "Activity returned" log ...
... NO "Operation orchestration end" log ...
... NO orchestration completion log ...
[01:00.617] Test timeout - orchestration never completed
```

### Timeline Showing the Race Condition

```
[00:10.845] 5e6356ca: Scheduling ContinueAsNew (Status=Waiting, blocked on d5e70f22)
[00:10.852] 5e6356ca: Orchestration #1 completed
[00:10.855] 5e6356ca: Orchestration #2 started (ContinueAsNew worked!) ✅
[00:10.855] 5e6356ca: Calling ExecuteOperationActivity_Version1
[00:10.859] 5e6356ca: Activity started
[00:10.867] 5e6356ca: Sees blocking operation as "Succeeded" ✅ (unblocked!)
[00:10.875] 5e6356ca: Sets environment to "Succeeded" ✅
[00:10.880] 5e6356ca: Activity returns Closed=True ✅
... ORCHESTRATOR NEVER RESUMES ❌
[01:00.617] Timeout
```

## Root Cause Analysis

### Location in Source
**File**: `src/InProcessTestHost/Sidecar/InMemoryOrchestrationService.cs`  
**Class**: `InMemoryInstanceStore`  
**Method**: `AddMessage(TaskMessage message)`  
**Lines**: ~220-280

### The Bug

When an activity completes, `CompleteTaskActivityWorkItemAsync` calls `instanceStore.AddMessage(responseMessage)`. The `AddMessage` method contains this logic:

```csharp
public void AddMessage(TaskMessage message)
{
    // ... message handling ...
    
    state.MessagesJson.Add(message);

    if (!state.IsLoaded)  // ← BUG: Only schedules if NOT loaded
    {
        // The orchestration isn't running, so schedule it to run now.
        // If it is running, it will be scheduled again automatically when it's released.
        this.readyToRunQueue.Schedule(state);
    }
}
```

**The problem**: After `ContinueAsNew`:

1. Orchestration completes and calls `CompleteTaskOrchestrationWorkItemAsync`
2. State is saved with the `continuedAsNewMessage`
3. Orchestration restarts with `ExecutionStartedEvent`
4. New iteration starts and schedules an activity
5. **Activity completes very quickly** (before orchestrator calls `LockNextTaskOrchestrationWorkItemAsync` again)
6. `AddMessage` is called with activity result
7. **At this point, `state.IsLoaded` might still be `true` from the previous iteration** ❌
8. **Message is added but instance is NOT scheduled** ❌
9. The orchestrator never gets the work item to process the result

### Race Condition Window

```
Thread 1 (Orchestrator):                    Thread 2 (Activity):
─────────────────────────────────────────   ──────────────────────────────────
CompleteTaskOrchestrationWorkItemAsync()
  └─ SaveState(continuedAsNewMessage)
  └─ [state.IsLoaded still TRUE]
                                            CompleteTaskActivityWorkItemAsync()
                                              └─ AddMessage(activityResult)
                                                  └─ state.MessagesJson.Add(msg) ✅
                                                  └─ if (!state.IsLoaded) ← FALSE!
                                                      └─ NOT SCHEDULED ❌
ReleaseLock(instanceId)
  └─ state.IsLoaded = false
  └─ if (state.MessagesJson.Count > 0) ← FALSE!
      └─ [message arrived AFTER this check]
      └─ NOT SCHEDULED ❌
```

The activity result arrives in the narrow window after the orchestration saved state but before `ReleaseLock` checks for pending messages.

## Proposed Fix

### Option 1: Always Check Message Queue in AddMessage (Safest)

```csharp
public void AddMessage(TaskMessage message)
{
    string instanceId = message.OrchestrationInstance.InstanceId;
    string? executionId = message.OrchestrationInstance.ExecutionId;

    SerializedInstanceState state = this.store.GetOrAdd(instanceId, id => new SerializedInstanceState(id, executionId));
    lock (state)
    {
        // ... existing logic ...
        
        state.MessagesJson.Add(message);

        if (!state.IsLoaded)
        {
            this.readyToRunQueue.Schedule(state);
        }
        // FIX: Also schedule if an activity/sub-orchestration result arrives while loaded
        else if (message.Event is TaskCompletedEvent or TaskFailedEvent or 
                 SubOrchestrationInstanceCompletedEvent)
        {
            // Ensure orchestrator processes completion events even if marked as loaded
            this.readyToRunQueue.Schedule(state);
        }
    }
}
```

### Option 2: Re-check Queue in ReleaseLock

```csharp
public void ReleaseLock(string instanceId)
{
    if (!this.store.TryGetValue(instanceId, out SerializedInstanceState? state) || !state.IsLoaded)
    {
        throw new InvalidOperationException($"Instance {instanceId} is not in the store or is not loaded!");
    }

    lock (state)
    {
        state.IsLoaded = false;
        
        // FIX: Always re-check message queue after releasing lock
        // Messages might have arrived while we were releasing
        if (state.MessagesJson.Count > 0 && !state.IsCompleted)
        {
            this.readyToRunQueue.Schedule(state);
        }
    }
}
```

### Option 3: Atomic Lock Management

Use a more robust locking mechanism that prevents the race condition entirely by ensuring messages can't arrive between state save and lock release.

## Workaround

Since this is a test-only issue, we've worked around it by polling the database directly instead of relying on orchestration completion events:

```csharp
// Instead of: await WaitForFunctionAsync(orchestrator, condition)

// Use direct polling:
var timeout = TimeSpan.FromSeconds(60);
var stopwatch = System.Diagnostics.Stopwatch.StartNew();

while (stopwatch.Elapsed < timeout)
{
    using var context = CreateDbContext();
    var entity = await context.GetEntityAsync();
    
    if (entity.ProvisioningState == "Succeeded")
        return;
    
    await Task.Delay(TimeSpan.FromMilliseconds(250));
}

throw new TimeoutException("Did not complete");
```

## Impact
- **Severity**: High for tests, None for production (only affects InProcessTestHost)
- **Frequency**: Intermittent - depends on timing (activity completing very quickly after ContinueAsNew)
- **Workaround**: Available (direct database polling)

## Additional Context

This issue only manifests when:
1. Using `ContinueAsNew` in orchestrations
2. Activity completes very quickly (< few milliseconds)
3. Running in the InProcessTestHost test environment
4. Multiple orchestrations running concurrently

The issue does NOT affect:
- Production deployments using Azure Storage backend
- SQL Server backend
- Real Azure Functions deployments

## Test Case

We can provide a complete integration test that reproduces this issue consistently if needed.

## Related Patterns

Our codebase has other tests with similar workarounds, suggesting this is a known issue:

```csharp
// From FlexWorkloadProfileTests.cs line 168
while (!poolTask.IsCompleted)
{
    await rp.FunctionsBackgroundTasksClient.InvokeFunctionAsync("NodePoolUpdateTimer", CancellationToken.None);
    await Task.Delay(TimeSpan.FromMilliseconds(250));
}
```

## Request

1. **Confirm** if this is a known issue in v0.2.1-preview.1
2. **Provide** ETA for a fix or stable version
3. **Advise** on whether Option 1, 2, or 3 is the preferred fix approach
4. **Consider** adding more diagnostic logging to help diagnose these race conditions


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InProcessTestHost Orchestration Not Resuming After Activity Completion Following ContinueAsNew #689

Environment

Summary

Symptoms

Reproduction

Minimal Repro Code

Steps to Reproduce

Expected Behavior

Actual Behavior

Evidence from Logs

✅ Successful Orchestration Completion (for comparison):

❌ Failed Orchestration (hanging after ContinueAsNew):

Timeline Showing the Race Condition

Root Cause Analysis

Location in Source

The Bug

Race Condition Window

Proposed Fix

Option 1: Always Check Message Queue in AddMessage (Safest)

Option 2: Re-check Queue in ReleaseLock

Option 3: Atomic Lock Management

Workaround

Impact

Additional Context

Test Case

Related Patterns

Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

InProcessTestHost Orchestration Not Resuming After Activity Completion Following ContinueAsNew #689

Description

Environment

Summary

Symptoms

Reproduction

Minimal Repro Code

Steps to Reproduce

Expected Behavior

Actual Behavior

Evidence from Logs

✅ Successful Orchestration Completion (for comparison):

❌ Failed Orchestration (hanging after ContinueAsNew):

Timeline Showing the Race Condition

Root Cause Analysis

Location in Source

The Bug

Race Condition Window

Proposed Fix

Option 1: Always Check Message Queue in AddMessage (Safest)

Option 2: Re-check Queue in ReleaseLock

Option 3: Atomic Lock Management

Workaround

Impact

Additional Context

Test Case

Related Patterns

Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions