Dashboard and cli state disagree

https://github.com/user-attachments/assets/6555e10d-345d-40ab-9e6b-7bad497b61d6

## Investigation details

### Root cause
An orphaned AppHost process from a previous session was still running and holding onto the DCP proxy ports (7390/5557) and the dashboard (port 17092). When a new AppHost was started via `aspire start`, the new DCP could not bind to those proxy ports since the old DCP still owned them.

This caused the dashboard and CLI to disagree on state because:
- **Dashboard** — connected to the old AppHost's DCP instance (from 12:24AM), which showed the apiservice as `Finished` with `exitCode: 0` and a `start` command available.
- **CLI** (`aspire describe`) — connected to the new AppHost (started at 1:07AM), which showed apiservice as `Running / Unhealthy`.

### Why Unhealthy?
The health check (`apiservice_https_/health_200_check`) was configured to hit `https://localhost:7390/health`, but port 7390 was owned by the old DCP proxy which was no longer forwarding traffic. The health check timed out with `TaskCanceledException`.

### How it happened
1. An AppHost was started initially (PID 27434, 12:24AM). It owned the DCP, dashboard (port 17092), and proxy ports (7390/5557).
2. `aspire stop` was called, followed by `aspire start` multiple times during development.
3. The old AppHost process (PID 27434) and its DCP child processes (PIDs 27515, 27477, 27527) were **not fully terminated** by `aspire stop`. They continued running and holding the proxy ports.
4. The new AppHost started successfully but its apiservice was assigned different internal ports (e.g., 53557/53558) while the proxy ports (7390/5557) remained bound to the old DCP.
5. `aspire ps` only showed the new AppHost — the old one was invisible to the CLI but still alive.

### Evidence
- `ps aux` showed two sets of DCP processes: old (12:24AM) and new (1:07AM)
- The old DCP run-controllers (PID 27515) and apiserver (PID 27477) were still running
- Port 7390 returned HTTP 404 (old code without the new endpoints) or hung entirely
- The direct internal port (53557) served requests correctly with the new code
- After manually terminating the orphaned processes and doing a clean restart, everything worked — dashboard and CLI agreed, proxy ports worked, health checks passed.

## Why the old AppHost was invisible to `aspire ps`

`AuxiliaryBackchannelMonitor` discovers running AppHosts by scanning `aux*.sock.*` files in `~/.aspire/cli/backchannels/`. At the time of the issue, the directory contained:

```
auxi.sock.897ab4262e3c5cd8.4058     ← only auxi socket, PID 4058 (new AppHost)
cli.sock.08656a70215440ecbee54720aa637697   ← old format, from 00:50
cli.sock.2ae0eede5c2a465792345d11168effe9   ← old format, from 01:15
... (7 more cli.sock files)
```

There was no `auxi.sock` file for the old AppHost (PID 27434, from 12:24AM), so `aspire ps` couldn't see it.

At the very start of the session, `aspire ps` returned `[]` — the old AppHost was already invisible before any agent interaction.

## Process tree analysis

The process chain is:

**CLI → AppHost (CliOrphanDetector watches CLI PID) → DCP (--monitor watches AppHost PID)**

If the AppHost dies, DCP should die. If DCP dies, ports should be released. So if two sets of DCP processes were running, two AppHost processes were likely alive.

### Possible theory: race condition in stop-then-start

The session log shows many `aspire stop && aspire start` cycles. One possibility:

1. `aspire stop` sends a stop signal to the AppHost via the backchannel
2. `aspire stop` returns "AppHost stopped successfully" 
3. `aspire start` launches a new AppHost immediately
4. But the old AppHost/DCP process tree hasn't fully exited yet — DCP may still be shutting down and holding ports
5. The new DCP starts, can't bind the proxy ports (still held by old DCP mid-shutdown), and gets assigned different internal ports
6. The old DCP eventually dies, but now the proxy port mapping is broken

This could be a race condition between stop completing and the process tree fully releasing resources. `aspire stop` may report success when the stop signal is acknowledged, but before the actual process teardown (AppHost → DCP → port release) finishes.

However, there could be other explanations — this needs further investigation to confirm.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dashboard and cli state disagree #14971

Investigation details

Root cause

Why Unhealthy?

How it happened

Evidence

Why the old AppHost was invisible to `aspire ps`

Process tree analysis

Possible theory: race condition in stop-then-start

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dashboard and cli state disagree #14971

Description

Investigation details

Root cause

Why Unhealthy?

How it happened

Evidence

Why the old AppHost was invisible to aspire ps

Process tree analysis

Possible theory: race condition in stop-then-start

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Why the old AppHost was invisible to `aspire ps`