core: cut codex-core compile time 57% by type-erasing ToolHandler by bolinfest · Pull Request #16627 · openai/codex

bolinfest · 2026-04-02T22:17:04Z

Why

codex-core compile time was dominated by repeated trait solving and monomorphization work from the old ToolHandler<Output = T> shape plus the blanket async adapter into AnyToolHandler.

That pattern instantiated one async wrapper per concrete tool handler, and the compiler then had to prove Send/outlives obligations across a large set of generated future types. The profile showed this was not an LLVM/codegen problem; it was a trait-solving + monomorphization problem.

The win from this PR is large enough that I want the numbers in the PR body, because this is a broad API change and reviewers are right to ask whether the tradeoff is worth it.

Headline result

Cold codex-core lib compile time dropped from 187.15s to 79.90s on my machine after this refactor: -57.3% wall-clock in rustc's total pass.

The biggest hot passes dropped by roughly 70%:

Metric	Before	After	Delta
`total`	187.15s	79.90s	-57.3%
`generate_crate_metadata`	84.53s	25.88s	-69.4%
`MIR_borrow_checking`	84.13s	25.69s	-69.5%
`monomorphization_collector_graph_walk`	79.74s	23.54s	-70.5%
`evaluate_obligation` self-time	180.62s	43.29s	-76.0%

Important caveat: -Z time-passes timings are nested, so generate_crate_metadata and monomorphization_collector_graph_walk are mostly overlapping, not additive.

Profile data

Baseline before this change

cargo +nightly build -p codex-core --lib -Z unstable-options --timings=json after cargo clean -p codex-core:

Crate	`duration`	`rmeta_time`
`codex-core`	187.380776583s	174.474507208s
`starlark`	17.90s	n/a

So codex-core was far and away the dominant crate in the build.

cargo +nightly rustc -p codex-core --lib -- -Z time-passes -Z time-passes-format=json after cargo clean -p codex-core:

Pass	Time
`total`	187.150662083s
`generate_crate_metadata`	84.531864625s
`MIR_borrow_checking`	84.131389375s
`monomorphization_collector_graph_walk`	79.737515042s
`codegen_crate`	12.362532292s
`type_check_crate`	4.4765405s
`coherence_checking`	3.311121208s
process `real` / `user` / `sys`	187.70s / 201.87s / 4.99s

-Z self-profile + measureme summarize -p 0.5 before this change:

Query / phase	Self time	Total time	% total CPU	Item count	Cache hits
`evaluate_obligation`	180.62s	182.08s	70.821%	572,234	1,130,998
`mir_borrowck`	1.42s	93.77s	n/a	n/a	n/a
`typeck`	1.84s	2.38s	n/a	n/a	n/a
`LLVM_module_codegen_emit_obj`	n/a	17.01s	n/a	n/a	n/a
`LLVM_passes`	n/a	12.95s	n/a	n/a	n/a
`codegen_module`	n/a	12.22s	n/a	n/a	n/a
self-profile CPU total	255.042999997s	n/a	n/a	n/a	n/a
process `real` / `user` / `sys`	220.96s / 235.02s / 7.09s	n/a	n/a	n/a	n/a

Top normalized obligation buckets before this change:

Obligation bucket	Samples	Duration
`outlives:tasks::review::ReviewTask`	1,067	6.33s
`outlives:tools::handlers::unified_exec::UnifiedExecHandler`	896	5.63s
`trait:T as tools::registry::ToolHandler`	876	5.45s
`outlives:tools::handlers::shell::ShellHandler`	888	5.37s
`outlives:tools::handlers::shell::ShellCommandHandler`	870	5.29s
`outlives:tools::runtimes::shell::unix_escalation::CoreShellActionProvider`	637	3.73s
`outlives:tools::handlers::mcp::McpHandler`	695	3.61s
`outlives:tasks::regular::RegularTask`	726	3.57s

Top items_of_instance entries before this change were mostly concrete async handler/task impls:

Instance	Duration
`tasks::regular::{impl#2}::run`	3.79s
`tools::handlers::mcp::{impl#0}::handle`	3.27s
`tools::runtimes::shell::unix_escalation::{impl#2}::determine_action`	3.09s
`tools::handlers::agent_jobs::{impl#11}::handle`	3.07s
`tools::handlers::multi_agents::spawn::{impl#1}::handle`	2.84s
`tasks::review::{impl#4}::run`	2.82s
`tools::handlers::multi_agents_v2::spawn::{impl#2}::handle`	2.80s
`tools::handlers::multi_agents::resume_agent::{impl#1}::handle`	2.73s
`tools::handlers::unified_exec::{impl#2}::handle`	2.54s
`tasks::compact::{impl#4}::run`	2.45s

After this change

cargo +nightly rustc -p codex-core --lib -- -Z time-passes -Z time-passes-format=json after cargo clean -p codex-core:

Pass	Time
`total`	79.896377542s
`generate_crate_metadata`	25.882309084s
`MIR_borrow_checking`	25.694140042s
`monomorphization_collector_graph_walk`	23.542859708s
`codegen_crate`	12.729979125s
process `real` / `user` / `sys`	81.06s / 87.82s / 2.76s

-Z self-profile + measureme summarize -p 0.5 after this change:

Query / phase	Self time	Total time	% total CPU	Item count	Cache hits
`evaluate_obligation`	43.29s	44.61s	33.319%	366,783	1,056,994
`mir_borrowck`	1.42s	27.77s	n/a	n/a	n/a
`typeck`	1.83s	2.36s	n/a	n/a	n/a
`type_op_prove_predicate`	886.60ms	23.38s	n/a	n/a	n/a
`LLVM_module_codegen_emit_obj`	n/a	17.88s	n/a	n/a	n/a
`LLVM_passes`	n/a	17.33s	n/a	n/a	n/a
`codegen_module`	n/a	16.07s	n/a	n/a	n/a
self-profile CPU total	129.915927377s	n/a	n/a	n/a	n/a
process `real` / `user` / `sys`	90.10s / 101.70s / 2.68s	n/a	n/a	n/a	n/a

Artifact size deltas from the self-profile summaries:

Artifact	Before	After	Delta
`crate_metadata`	26,534,471	26,475,240	-59,231
`dep_graph`	253,181,425	236,801,785	-16,379,640
`linked_artifact`	565,366,624	557,429,424	-7,937,200
`object_file`	513,127,264	505,350,688	-7,776,576
`query_cache`	137,440,945	136,793,279	-647,666
`cgu_instructions`	3,586,307	3,565,670	-20,637
`codegen_unit_size_estimate`	2,084,846	2,073,774	-11,072
`work_product_index`	19,565	19,565	0

What changed

The main refactor is in codex-rs/core/src/tools/registry.rs:

ToolHandler is now object-safe and returns BoxFuture<'_, Result<AnyToolResult, FunctionCallError>> directly.
ToolRegistry stores Arc<dyn ToolHandler> directly.
AnyToolResult now wraps Box<dyn ToolOutput> plus the invocation metadata.
The blanket generic impl<T: ToolHandler> AnyToolHandler for T adapter is gone, which is the point: stop generating one generic async adapter per concrete handler type.

Representative handler updates:

Tradeoff

This intentionally moves a bit of work to runtime: each tool call now boxes the returned future and boxes the concrete ToolOutput behind AnyToolResult.

That is a reasonable tradeoff here because these handlers are typically dominated by process spawning, MCP I/O, agent orchestration, or filesystem work, while the compile-time savings are very large and directly address the crate's current scaling problem.

Validation

cargo test -p codex-core --lib --no-run passed
cargo test -p codex-core --lib ran 1,541 tests: 1,533 passed, 3 ignored, 5 failed
The 5 failures were all in untouched config tests in codex-rs/core/src/config/config_tests.rs, so I do not believe they are caused by this refactor:
- config::tests::approvals_reviewer_defaults_to_manual_only_without_guardian_feature
- config::tests::approvals_reviewer_stays_manual_only_when_guardian_feature_is_enabled
- config::tests::approvals_reviewer_can_be_set_in_config_without_guardian_approval
- config::tests::smart_approvals_alias_is_ignored
- config::tests::smart_approvals_alias_is_ignored_in_profiles
just fix -p codex-core
just fmt

I intentionally did not rerun tests after just fix / just fmt, per the repo instructions in this checkout.

…16630) ## Why `ToolHandler` was still paying a large compile-time tax from `#[async_trait]` on every concrete handler impl, even though the only object-safe boundary the registry actually stores is the internal `AnyToolHandler` adapter. This PR removes that macro-generated async wrapper layer from concrete `ToolHandler` impls while keeping the existing object-safe shim in `AnyToolHandler`. In practice, that gets essentially the same compile-time win as the larger type-erasure refactor in #16627, but with a much smaller diff and without changing the public shape of `ToolHandler<Output = T>`. That tradeoff matters here because this is a broad `codex-core` hotspot and reviewers should be able to judge the compile-time impact from hard numbers, not vibes. ## Headline result On a clean `codex-core` package rebuild (`cargo clean -p codex-core` before each command), rustc `total` dropped from **187.15s to 68.98s** versus the shared `0bd31dc382bd` baseline: **-63.1%**. The biggest hot passes dropped by roughly **71-72%**: | Metric | Baseline `0bd31dc382bd` | This PR `41f7ac0adeac` | Delta | |---|---:|---:|---:| | `total` | 187.15s | 68.98s | **-63.1%** | | `generate_crate_metadata` | 84.53s | 24.49s | **-71.0%** | | `MIR_borrow_checking` | 84.13s | 24.58s | **-70.8%** | | `monomorphization_collector_graph_walk` | 79.74s | 22.19s | **-72.2%** | | `evaluate_obligation` self-time | 180.62s | 46.91s | **-74.0%** | Important caveat: `-Z time-passes` timings are nested, so `generate_crate_metadata` and `monomorphization_collector_graph_walk` are mostly overlapping, not additive. ## Why this PR over #16627 #16627 already proved that the `ToolHandler` stack was the right hotspot, but it got there by making `ToolHandler` object-safe and changing every handler to return `BoxFuture<Result<AnyToolResult, _>>` directly. This PR keeps the lower-churn shape: - `ToolHandler` remains generic over `type Output`. - Concrete handlers use native RPITIT futures with explicit `Send` bounds. - `AnyToolHandler` remains the only object-safe adapter and still does the boxing at the registry boundary, as before. - The implementation diff is only **33 files, +28/-77**. The measurements are at least comparable, and in this run this PR is slightly faster than #16627 on the pass-level total: | Metric | #16627 | This PR | Delta | |---|---:|---:|---:| | `total` | 79.90s | 68.98s | **-13.7%** | | `generate_crate_metadata` | 25.88s | 24.49s | **-5.4%** | | `monomorphization_collector_graph_walk` | 23.54s | 22.19s | **-5.7%** | | `evaluate_obligation` self-time | 43.29s | 46.91s | +8.4% | ## Profile data ### Crate-level timings `cargo +nightly build -p codex-core --lib -Z unstable-options --timings=json` after `cargo clean -p codex-core`. Baseline data below is reused from the shared parent `0bd31dc382bd` profile because this PR and #16627 are both one commit on top of that same parent. | Crate | Baseline `duration` | This PR `duration` | Delta | Baseline `rmeta_time` | This PR `rmeta_time` | Delta | |---|---:|---:|---:|---:|---:|---:| | `codex_core` | 187.380776583s | 69.171113833s | **-63.1%** | 174.474507208s | 55.873015583s | **-68.0%** | | `starlark` | 17.90s | 16.773824125s | -6.3% | n/a | 8.8999965s | n/a | ### Pass-level timings `cargo +nightly rustc -p codex-core --lib -- -Z time-passes -Z time-passes-format=json` after `cargo clean -p codex-core`. | Pass | Baseline | This PR | Delta | |---|---:|---:|---:| | `total` | 187.150662083s | 68.978770375s | **-63.1%** | | `generate_crate_metadata` | 84.531864625s | 24.487462958s | **-71.0%** | | `MIR_borrow_checking` | 84.131389375s | 24.575553875s | **-70.8%** | | `monomorphization_collector_graph_walk` | 79.737515042s | 22.190207417s | **-72.2%** | | `codegen_crate` | 12.362532292s | 12.695237625s | +2.7% | | `type_check_crate` | 4.4765405s | 5.442019542s | +21.6% | | `coherence_checking` | 3.311121208s | 4.239935292s | +28.0% | | process `real` / `user` / `sys` | 187.70s / 201.87s / 4.99s | 69.52s / 85.90s / 2.92s | n/a | ### Self-profile query summary `cargo +nightly rustc -p codex-core --lib -- -Z self-profile=... -Z self-profile-events=default,query-keys,args,llvm,artifact-sizes` after `cargo clean -p codex-core`, summarized with `measureme summarize -p 0.5`. | Query / phase | Baseline self time | This PR self time | Delta | Baseline total time | This PR total time | Baseline item count | This PR item count | Baseline cache hits | This PR cache hits | |---|---:|---:|---:|---:|---:|---:|---:|---:|---:| | `evaluate_obligation` | 180.62s | 46.91s | **-74.0%** | 182.08s | 48.37s | 572,234 | 388,659 | 1,130,998 | 1,058,553 | | `mir_borrowck` | 1.42s | 1.49s | +4.9% | 93.77s | 29.59s | n/a | 6,184 | n/a | 15,298 | | `typeck` | 1.84s | 1.87s | +1.6% | 2.38s | 2.44s | n/a | 9,367 | n/a | 79,247 | | `LLVM_module_codegen_emit_obj` | n/a | 17.12s | n/a | 17.01s | 17.12s | n/a | 256 | n/a | 0 | | `LLVM_passes` | n/a | 13.07s | n/a | 12.95s | 13.07s | n/a | 1 | n/a | 0 | | `codegen_module` | n/a | 12.33s | n/a | 12.22s | 13.64s | n/a | 256 | n/a | 0 | | `items_of_instance` | n/a | 676.00ms | n/a | n/a | 24.96s | n/a | 99,990 | n/a | 0 | | `type_op_prove_predicate` | n/a | 660.79ms | n/a | n/a | 24.78s | n/a | 78,762 | n/a | 235,877 | | Summary | Baseline | This PR | |---|---:|---:| | `evaluate_obligation` % of total CPU | 70.821% | 38.880% | | self-profile total CPU time | 255.042999997s | 120.661175956s | | process `real` / `user` / `sys` | 220.96s / 235.02s / 7.09s | 86.35s / 103.66s / 3.54s | ### Artifact sizes From the same `measureme summarize` output: | Artifact | Baseline | This PR | Delta | |---|---:|---:|---:| | `crate_metadata` | 26,534,471 bytes | 26,545,248 bytes | +10,777 | | `dep_graph` | 253,181,425 bytes | 239,240,806 bytes | -13,940,619 | | `linked_artifact` | 565,366,624 bytes | 562,673,176 bytes | -2,693,448 | | `object_file` | 513,127,264 bytes | 510,464,096 bytes | -2,663,168 | | `query_cache` | 137,440,945 bytes | 136,982,566 bytes | -458,379 | | `cgu_instructions` | 3,586,307 bytes | 3,575,121 bytes | -11,186 | | `codegen_unit_size_estimate` | 2,084,846 bytes | 2,078,773 bytes | -6,073 | | `work_product_index` | 19,565 bytes | 19,565 bytes | 0 | ### Baseline hotspots before this change These are the top normalized obligation buckets from the shared baseline profile: | Obligation bucket | Samples | Duration | |---|---:|---:| | `outlives:tasks::review::ReviewTask` | 1,067 | 6.33s | | `outlives:tools::handlers::unified_exec::UnifiedExecHandler` | 896 | 5.63s | | `trait:T as tools::registry::ToolHandler` | 876 | 5.45s | | `outlives:tools::handlers::shell::ShellHandler` | 888 | 5.37s | | `outlives:tools::handlers::shell::ShellCommandHandler` | 870 | 5.29s | | `outlives:tools::runtimes::shell::unix_escalation::CoreShellActionProvider` | 637 | 3.73s | | `outlives:tools::handlers::mcp::McpHandler` | 695 | 3.61s | | `outlives:tasks::regular::RegularTask` | 726 | 3.57s | Top `items_of_instance` entries before this change were mostly concrete async handler/task impls: | Instance | Duration | |---|---:| | `tasks::regular::{impl#2}::run` | 3.79s | | `tools::handlers::mcp::{impl#0}::handle` | 3.27s | | `tools::runtimes::shell::unix_escalation::{impl#2}::determine_action` | 3.09s | | `tools::handlers::agent_jobs::{impl#11}::handle` | 3.07s | | `tools::handlers::multi_agents::spawn::{impl#1}::handle` | 2.84s | | `tasks::review::{impl#4}::run` | 2.82s | | `tools::handlers::multi_agents_v2::spawn::{impl#2}::handle` | 2.80s | | `tools::handlers::multi_agents::resume_agent::{impl#1}::handle` | 2.73s | | `tools::handlers::unified_exec::{impl#2}::handle` | 2.54s | | `tasks::compact::{impl#4}::run` | 2.45s | ## What changed Relevant pre-change registry shape: [`codex-rs/core/src/tools/registry.rs`](https://github.com/openai/codex/blob/0bd31dc382bd1c33dc2bb6b97069c76aa10ba14b/codex-rs/core/src/tools/registry.rs#L38-L219) Current registry shape in this PR: [`codex-rs/core/src/tools/registry.rs`](https://github.com/openai/codex/blob/41f7ac0adeac81d667541853d6546267d6083613/codex-rs/core/src/tools/registry.rs#L38-L203) - `ToolHandler::{is_mutating, handle}` now return native `impl Future + Send` futures instead of using `#[async_trait]`. - `AnyToolHandler` remains the object-safe adapter and boxes those futures at the registry boundary with explicit lifetimes. - Concrete handlers and the registry test handler drop `#[async_trait]` but otherwise keep their async method bodies intact. - Representative examples: [`codex-rs/core/src/tools/handlers/shell.rs`](https://github.com/openai/codex/blob/41f7ac0adeac81d667541853d6546267d6083613/codex-rs/core/src/tools/handlers/shell.rs#L223-L379), [`codex-rs/core/src/tools/handlers/unified_exec.rs`](https://github.com/openai/codex/blob/41f7ac0adeac81d667541853d6546267d6083613/codex-rs/core/src/tools/handlers/unified_exec.rs), [`codex-rs/core/src/tools/registry_tests.rs`](https://github.com/openai/codex/blob/41f7ac0adeac81d667541853d6546267d6083613/codex-rs/core/src/tools/registry_tests.rs) ## Tradeoff This is intentionally less invasive than #16627: it does **not** move result boxing into every concrete handler and does **not** change `ToolHandler` into an object-safe trait. Instead, it keeps the existing registry-level type-erasure boundary and only removes the macro-generated async wrapper layer from concrete impls. So the runtime boxing story stays basically the same as before, while the compile-time savings are still large. ## Verification Existing verification for this branch still applies: - Ran `cargo test -p codex-core`; this change compiled and the suite reached the known unrelated `config::tests::*guardian*` failures, with no local diff under `codex-rs/core/src/config/`. Profiling commands used for the tables above: - `cargo clean -p codex-core` - `cargo +nightly build -p codex-core --lib -Z unstable-options --timings=json` - `cargo +nightly rustc -p codex-core --lib -- -Z time-passes -Z time-passes-format=json` - `cargo +nightly rustc -p codex-core --lib -- -Z self-profile=... -Z self-profile-events=default,query-keys,args,llvm,artifact-sizes` - `measureme summarize -p 0.5`

bolinfest · 2026-04-02T23:06:23Z

Closing in favor of #16630.

…penai#16630) ## Why `ToolHandler` was still paying a large compile-time tax from `#[async_trait]` on every concrete handler impl, even though the only object-safe boundary the registry actually stores is the internal `AnyToolHandler` adapter. This PR removes that macro-generated async wrapper layer from concrete `ToolHandler` impls while keeping the existing object-safe shim in `AnyToolHandler`. In practice, that gets essentially the same compile-time win as the larger type-erasure refactor in openai#16627, but with a much smaller diff and without changing the public shape of `ToolHandler<Output = T>`. That tradeoff matters here because this is a broad `codex-core` hotspot and reviewers should be able to judge the compile-time impact from hard numbers, not vibes. ## Headline result On a clean `codex-core` package rebuild (`cargo clean -p codex-core` before each command), rustc `total` dropped from **187.15s to 68.98s** versus the shared `0bd31dc382bd` baseline: **-63.1%**. The biggest hot passes dropped by roughly **71-72%**: | Metric | Baseline `0bd31dc382bd` | This PR `41f7ac0adeac` | Delta | |---|---:|---:|---:| | `total` | 187.15s | 68.98s | **-63.1%** | | `generate_crate_metadata` | 84.53s | 24.49s | **-71.0%** | | `MIR_borrow_checking` | 84.13s | 24.58s | **-70.8%** | | `monomorphization_collector_graph_walk` | 79.74s | 22.19s | **-72.2%** | | `evaluate_obligation` self-time | 180.62s | 46.91s | **-74.0%** | Important caveat: `-Z time-passes` timings are nested, so `generate_crate_metadata` and `monomorphization_collector_graph_walk` are mostly overlapping, not additive. ## Why this PR over openai#16627 openai#16627 already proved that the `ToolHandler` stack was the right hotspot, but it got there by making `ToolHandler` object-safe and changing every handler to return `BoxFuture<Result<AnyToolResult, _>>` directly. This PR keeps the lower-churn shape: - `ToolHandler` remains generic over `type Output`. - Concrete handlers use native RPITIT futures with explicit `Send` bounds. - `AnyToolHandler` remains the only object-safe adapter and still does the boxing at the registry boundary, as before. - The implementation diff is only **33 files, +28/-77**. The measurements are at least comparable, and in this run this PR is slightly faster than openai#16627 on the pass-level total: | Metric | openai#16627 | This PR | Delta | |---|---:|---:|---:| | `total` | 79.90s | 68.98s | **-13.7%** | | `generate_crate_metadata` | 25.88s | 24.49s | **-5.4%** | | `monomorphization_collector_graph_walk` | 23.54s | 22.19s | **-5.7%** | | `evaluate_obligation` self-time | 43.29s | 46.91s | +8.4% | ## Profile data ### Crate-level timings `cargo +nightly build -p codex-core --lib -Z unstable-options --timings=json` after `cargo clean -p codex-core`. Baseline data below is reused from the shared parent `0bd31dc382bd` profile because this PR and openai#16627 are both one commit on top of that same parent. | Crate | Baseline `duration` | This PR `duration` | Delta | Baseline `rmeta_time` | This PR `rmeta_time` | Delta | |---|---:|---:|---:|---:|---:|---:| | `codex_core` | 187.380776583s | 69.171113833s | **-63.1%** | 174.474507208s | 55.873015583s | **-68.0%** | | `starlark` | 17.90s | 16.773824125s | -6.3% | n/a | 8.8999965s | n/a | ### Pass-level timings `cargo +nightly rustc -p codex-core --lib -- -Z time-passes -Z time-passes-format=json` after `cargo clean -p codex-core`. | Pass | Baseline | This PR | Delta | |---|---:|---:|---:| | `total` | 187.150662083s | 68.978770375s | **-63.1%** | | `generate_crate_metadata` | 84.531864625s | 24.487462958s | **-71.0%** | | `MIR_borrow_checking` | 84.131389375s | 24.575553875s | **-70.8%** | | `monomorphization_collector_graph_walk` | 79.737515042s | 22.190207417s | **-72.2%** | | `codegen_crate` | 12.362532292s | 12.695237625s | +2.7% | | `type_check_crate` | 4.4765405s | 5.442019542s | +21.6% | | `coherence_checking` | 3.311121208s | 4.239935292s | +28.0% | | process `real` / `user` / `sys` | 187.70s / 201.87s / 4.99s | 69.52s / 85.90s / 2.92s | n/a | ### Self-profile query summary `cargo +nightly rustc -p codex-core --lib -- -Z self-profile=... -Z self-profile-events=default,query-keys,args,llvm,artifact-sizes` after `cargo clean -p codex-core`, summarized with `measureme summarize -p 0.5`. | Query / phase | Baseline self time | This PR self time | Delta | Baseline total time | This PR total time | Baseline item count | This PR item count | Baseline cache hits | This PR cache hits | |---|---:|---:|---:|---:|---:|---:|---:|---:|---:| | `evaluate_obligation` | 180.62s | 46.91s | **-74.0%** | 182.08s | 48.37s | 572,234 | 388,659 | 1,130,998 | 1,058,553 | | `mir_borrowck` | 1.42s | 1.49s | +4.9% | 93.77s | 29.59s | n/a | 6,184 | n/a | 15,298 | | `typeck` | 1.84s | 1.87s | +1.6% | 2.38s | 2.44s | n/a | 9,367 | n/a | 79,247 | | `LLVM_module_codegen_emit_obj` | n/a | 17.12s | n/a | 17.01s | 17.12s | n/a | 256 | n/a | 0 | | `LLVM_passes` | n/a | 13.07s | n/a | 12.95s | 13.07s | n/a | 1 | n/a | 0 | | `codegen_module` | n/a | 12.33s | n/a | 12.22s | 13.64s | n/a | 256 | n/a | 0 | | `items_of_instance` | n/a | 676.00ms | n/a | n/a | 24.96s | n/a | 99,990 | n/a | 0 | | `type_op_prove_predicate` | n/a | 660.79ms | n/a | n/a | 24.78s | n/a | 78,762 | n/a | 235,877 | | Summary | Baseline | This PR | |---|---:|---:| | `evaluate_obligation` % of total CPU | 70.821% | 38.880% | | self-profile total CPU time | 255.042999997s | 120.661175956s | | process `real` / `user` / `sys` | 220.96s / 235.02s / 7.09s | 86.35s / 103.66s / 3.54s | ### Artifact sizes From the same `measureme summarize` output: | Artifact | Baseline | This PR | Delta | |---|---:|---:|---:| | `crate_metadata` | 26,534,471 bytes | 26,545,248 bytes | +10,777 | | `dep_graph` | 253,181,425 bytes | 239,240,806 bytes | -13,940,619 | | `linked_artifact` | 565,366,624 bytes | 562,673,176 bytes | -2,693,448 | | `object_file` | 513,127,264 bytes | 510,464,096 bytes | -2,663,168 | | `query_cache` | 137,440,945 bytes | 136,982,566 bytes | -458,379 | | `cgu_instructions` | 3,586,307 bytes | 3,575,121 bytes | -11,186 | | `codegen_unit_size_estimate` | 2,084,846 bytes | 2,078,773 bytes | -6,073 | | `work_product_index` | 19,565 bytes | 19,565 bytes | 0 | ### Baseline hotspots before this change These are the top normalized obligation buckets from the shared baseline profile: | Obligation bucket | Samples | Duration | |---|---:|---:| | `outlives:tasks::review::ReviewTask` | 1,067 | 6.33s | | `outlives:tools::handlers::unified_exec::UnifiedExecHandler` | 896 | 5.63s | | `trait:T as tools::registry::ToolHandler` | 876 | 5.45s | | `outlives:tools::handlers::shell::ShellHandler` | 888 | 5.37s | | `outlives:tools::handlers::shell::ShellCommandHandler` | 870 | 5.29s | | `outlives:tools::runtimes::shell::unix_escalation::CoreShellActionProvider` | 637 | 3.73s | | `outlives:tools::handlers::mcp::McpHandler` | 695 | 3.61s | | `outlives:tasks::regular::RegularTask` | 726 | 3.57s | Top `items_of_instance` entries before this change were mostly concrete async handler/task impls: | Instance | Duration | |---|---:| | `tasks::regular::{impl#2}::run` | 3.79s | | `tools::handlers::mcp::{impl#0}::handle` | 3.27s | | `tools::runtimes::shell::unix_escalation::{impl#2}::determine_action` | 3.09s | | `tools::handlers::agent_jobs::{impl#11}::handle` | 3.07s | | `tools::handlers::multi_agents::spawn::{impl#1}::handle` | 2.84s | | `tasks::review::{impl#4}::run` | 2.82s | | `tools::handlers::multi_agents_v2::spawn::{impl#2}::handle` | 2.80s | | `tools::handlers::multi_agents::resume_agent::{impl#1}::handle` | 2.73s | | `tools::handlers::unified_exec::{impl#2}::handle` | 2.54s | | `tasks::compact::{impl#4}::run` | 2.45s | ## What changed Relevant pre-change registry shape: [`codex-rs/core/src/tools/registry.rs`](https://github.com/openai/codex/blob/0bd31dc382bd1c33dc2bb6b97069c76aa10ba14b/codex-rs/core/src/tools/registry.rs#L38-L219) Current registry shape in this PR: [`codex-rs/core/src/tools/registry.rs`](https://github.com/openai/codex/blob/41f7ac0adeac81d667541853d6546267d6083613/codex-rs/core/src/tools/registry.rs#L38-L203) - `ToolHandler::{is_mutating, handle}` now return native `impl Future + Send` futures instead of using `#[async_trait]`. - `AnyToolHandler` remains the object-safe adapter and boxes those futures at the registry boundary with explicit lifetimes. - Concrete handlers and the registry test handler drop `#[async_trait]` but otherwise keep their async method bodies intact. - Representative examples: [`codex-rs/core/src/tools/handlers/shell.rs`](https://github.com/openai/codex/blob/41f7ac0adeac81d667541853d6546267d6083613/codex-rs/core/src/tools/handlers/shell.rs#L223-L379), [`codex-rs/core/src/tools/handlers/unified_exec.rs`](https://github.com/openai/codex/blob/41f7ac0adeac81d667541853d6546267d6083613/codex-rs/core/src/tools/handlers/unified_exec.rs), [`codex-rs/core/src/tools/registry_tests.rs`](https://github.com/openai/codex/blob/41f7ac0adeac81d667541853d6546267d6083613/codex-rs/core/src/tools/registry_tests.rs) ## Tradeoff This is intentionally less invasive than openai#16627: it does **not** move result boxing into every concrete handler and does **not** change `ToolHandler` into an object-safe trait. Instead, it keeps the existing registry-level type-erasure boundary and only removes the macro-generated async wrapper layer from concrete impls. So the runtime boxing story stays basically the same as before, while the compile-time savings are still large. ## Verification Existing verification for this branch still applies: - Ran `cargo test -p codex-core`; this change compiled and the suite reached the known unrelated `config::tests::*guardian*` failures, with no local diff under `codex-rs/core/src/config/`. Profiling commands used for the tables above: - `cargo clean -p codex-core` - `cargo +nightly build -p codex-core --lib -Z unstable-options --timings=json` - `cargo +nightly rustc -p codex-core --lib -- -Z time-passes -Z time-passes-format=json` - `cargo +nightly rustc -p codex-core --lib -- -Z self-profile=... -Z self-profile-events=default,query-keys,args,llvm,artifact-sizes` - `measureme summarize -p 0.5`

…penai#16630) `ToolHandler` was still paying a large compile-time tax from `#[async_trait]` on every concrete handler impl, even though the only object-safe boundary the registry actually stores is the internal `AnyToolHandler` adapter. This PR removes that macro-generated async wrapper layer from concrete `ToolHandler` impls while keeping the existing object-safe shim in `AnyToolHandler`. In practice, that gets essentially the same compile-time win as the larger type-erasure refactor in openai#16627, but with a much smaller diff and without changing the public shape of `ToolHandler<Output = T>`. That tradeoff matters here because this is a broad `codex-core` hotspot and reviewers should be able to judge the compile-time impact from hard numbers, not vibes. On a clean `codex-core` package rebuild (`cargo clean -p codex-core` before each command), rustc `total` dropped from **187.15s to 68.98s** versus the shared `0bd31dc382bd` baseline: **-63.1%**. The biggest hot passes dropped by roughly **71-72%**: | Metric | Baseline `0bd31dc382bd` | This PR `41f7ac0adeac` | Delta | |---|---:|---:|---:| | `total` | 187.15s | 68.98s | **-63.1%** | | `generate_crate_metadata` | 84.53s | 24.49s | **-71.0%** | | `MIR_borrow_checking` | 84.13s | 24.58s | **-70.8%** | | `monomorphization_collector_graph_walk` | 79.74s | 22.19s | **-72.2%** | | `evaluate_obligation` self-time | 180.62s | 46.91s | **-74.0%** | Important caveat: `-Z time-passes` timings are nested, so `generate_crate_metadata` and `monomorphization_collector_graph_walk` are mostly overlapping, not additive. hotspot, but it got there by making `ToolHandler` object-safe and changing every handler to return `BoxFuture<Result<AnyToolResult, _>>` directly. This PR keeps the lower-churn shape: - `ToolHandler` remains generic over `type Output`. - Concrete handlers use native RPITIT futures with explicit `Send` bounds. - `AnyToolHandler` remains the only object-safe adapter and still does the boxing at the registry boundary, as before. - The implementation diff is only **33 files, +28/-77**. The measurements are at least comparable, and in this run this PR is slightly faster than openai#16627 on the pass-level total: | Metric | openai#16627 | This PR | Delta | |---|---:|---:|---:| | `total` | 79.90s | 68.98s | **-13.7%** | | `generate_crate_metadata` | 25.88s | 24.49s | **-5.4%** | | `monomorphization_collector_graph_walk` | 23.54s | 22.19s | **-5.7%** | | `evaluate_obligation` self-time | 43.29s | 46.91s | +8.4% | `cargo +nightly build -p codex-core --lib -Z unstable-options --timings=json` after `cargo clean -p codex-core`. Baseline data below is reused from the shared parent `0bd31dc382bd` profile because this PR and openai#16627 are both one commit on top of that same parent. | Crate | Baseline `duration` | This PR `duration` | Delta | Baseline `rmeta_time` | This PR `rmeta_time` | Delta | |---|---:|---:|---:|---:|---:|---:| | `codex_core` | 187.380776583s | 69.171113833s | **-63.1%** | 174.474507208s | 55.873015583s | **-68.0%** | | `starlark` | 17.90s | 16.773824125s | -6.3% | n/a | 8.8999965s | n/a | `cargo +nightly rustc -p codex-core --lib -- -Z time-passes -Z time-passes-format=json` after `cargo clean -p codex-core`. | Pass | Baseline | This PR | Delta | |---|---:|---:|---:| | `total` | 187.150662083s | 68.978770375s | **-63.1%** | | `generate_crate_metadata` | 84.531864625s | 24.487462958s | **-71.0%** | | `MIR_borrow_checking` | 84.131389375s | 24.575553875s | **-70.8%** | | `monomorphization_collector_graph_walk` | 79.737515042s | 22.190207417s | **-72.2%** | | `codegen_crate` | 12.362532292s | 12.695237625s | +2.7% | | `type_check_crate` | 4.4765405s | 5.442019542s | +21.6% | | `coherence_checking` | 3.311121208s | 4.239935292s | +28.0% | | process `real` / `user` / `sys` | 187.70s / 201.87s / 4.99s | 69.52s / 85.90s / 2.92s | n/a | `cargo +nightly rustc -p codex-core --lib -- -Z self-profile=... -Z self-profile-events=default,query-keys,args,llvm,artifact-sizes` after `cargo clean -p codex-core`, summarized with `measureme summarize -p 0.5`. | Query / phase | Baseline self time | This PR self time | Delta | Baseline total time | This PR total time | Baseline item count | This PR item count | Baseline cache hits | This PR cache hits | |---|---:|---:|---:|---:|---:|---:|---:|---:|---:| | `evaluate_obligation` | 180.62s | 46.91s | **-74.0%** | 182.08s | 48.37s | 572,234 | 388,659 | 1,130,998 | 1,058,553 | | `mir_borrowck` | 1.42s | 1.49s | +4.9% | 93.77s | 29.59s | n/a | 6,184 | n/a | 15,298 | | `typeck` | 1.84s | 1.87s | +1.6% | 2.38s | 2.44s | n/a | 9,367 | n/a | 79,247 | | `LLVM_module_codegen_emit_obj` | n/a | 17.12s | n/a | 17.01s | 17.12s | n/a | 256 | n/a | 0 | | `LLVM_passes` | n/a | 13.07s | n/a | 12.95s | 13.07s | n/a | 1 | n/a | 0 | | `codegen_module` | n/a | 12.33s | n/a | 12.22s | 13.64s | n/a | 256 | n/a | 0 | | `items_of_instance` | n/a | 676.00ms | n/a | n/a | 24.96s | n/a | 99,990 | n/a | 0 | | `type_op_prove_predicate` | n/a | 660.79ms | n/a | n/a | 24.78s | n/a | 78,762 | n/a | 235,877 | | Summary | Baseline | This PR | |---|---:|---:| | `evaluate_obligation` % of total CPU | 70.821% | 38.880% | | self-profile total CPU time | 255.042999997s | 120.661175956s | | process `real` / `user` / `sys` | 220.96s / 235.02s / 7.09s | 86.35s / 103.66s / 3.54s | From the same `measureme summarize` output: | Artifact | Baseline | This PR | Delta | |---|---:|---:|---:| | `crate_metadata` | 26,534,471 bytes | 26,545,248 bytes | +10,777 | | `dep_graph` | 253,181,425 bytes | 239,240,806 bytes | -13,940,619 | | `linked_artifact` | 565,366,624 bytes | 562,673,176 bytes | -2,693,448 | | `object_file` | 513,127,264 bytes | 510,464,096 bytes | -2,663,168 | | `query_cache` | 137,440,945 bytes | 136,982,566 bytes | -458,379 | | `cgu_instructions` | 3,586,307 bytes | 3,575,121 bytes | -11,186 | | `codegen_unit_size_estimate` | 2,084,846 bytes | 2,078,773 bytes | -6,073 | | `work_product_index` | 19,565 bytes | 19,565 bytes | 0 | These are the top normalized obligation buckets from the shared baseline profile: | Obligation bucket | Samples | Duration | |---|---:|---:| | `outlives:tasks::review::ReviewTask` | 1,067 | 6.33s | | `outlives:tools::handlers::unified_exec::UnifiedExecHandler` | 896 | 5.63s | | `trait:T as tools::registry::ToolHandler` | 876 | 5.45s | | `outlives:tools::handlers::shell::ShellHandler` | 888 | 5.37s | | `outlives:tools::handlers::shell::ShellCommandHandler` | 870 | 5.29s | | `outlives:tools::runtimes::shell::unix_escalation::CoreShellActionProvider` | 637 | 3.73s | | `outlives:tools::handlers::mcp::McpHandler` | 695 | 3.61s | | `outlives:tasks::regular::RegularTask` | 726 | 3.57s | Top `items_of_instance` entries before this change were mostly concrete async handler/task impls: | Instance | Duration | |---|---:| | `tasks::regular::{impl#2}::run` | 3.79s | | `tools::handlers::mcp::{impl#0}::handle` | 3.27s | | `tools::runtimes::shell::unix_escalation::{impl#2}::determine_action` | 3.09s | | `tools::handlers::agent_jobs::{impl#11}::handle` | 3.07s | | `tools::handlers::multi_agents::spawn::{impl#1}::handle` | 2.84s | | `tasks::review::{impl#4}::run` | 2.82s | | `tools::handlers::multi_agents_v2::spawn::{impl#2}::handle` | 2.80s | | `tools::handlers::multi_agents::resume_agent::{impl#1}::handle` | 2.73s | | `tools::handlers::unified_exec::{impl#2}::handle` | 2.54s | | `tasks::compact::{impl#4}::run` | 2.45s | Relevant pre-change registry shape: [`codex-rs/core/src/tools/registry.rs`](https://github.com/openai/codex/blob/0bd31dc382bd1c33dc2bb6b97069c76aa10ba14b/codex-rs/core/src/tools/registry.rs#L38-L219) Current registry shape in this PR: [`codex-rs/core/src/tools/registry.rs`](https://github.com/openai/codex/blob/41f7ac0adeac81d667541853d6546267d6083613/codex-rs/core/src/tools/registry.rs#L38-L203) - `ToolHandler::{is_mutating, handle}` now return native `impl Future + Send` futures instead of using `#[async_trait]`. - `AnyToolHandler` remains the object-safe adapter and boxes those futures at the registry boundary with explicit lifetimes. - Concrete handlers and the registry test handler drop `#[async_trait]` but otherwise keep their async method bodies intact. - Representative examples: [`codex-rs/core/src/tools/handlers/shell.rs`](https://github.com/openai/codex/blob/41f7ac0adeac81d667541853d6546267d6083613/codex-rs/core/src/tools/handlers/shell.rs#L223-L379), [`codex-rs/core/src/tools/handlers/unified_exec.rs`](https://github.com/openai/codex/blob/41f7ac0adeac81d667541853d6546267d6083613/codex-rs/core/src/tools/handlers/unified_exec.rs), [`codex-rs/core/src/tools/registry_tests.rs`](https://github.com/openai/codex/blob/41f7ac0adeac81d667541853d6546267d6083613/codex-rs/core/src/tools/registry_tests.rs) This is intentionally less invasive than openai#16627: it does **not** move result boxing into every concrete handler and does **not** change `ToolHandler` into an object-safe trait. Instead, it keeps the existing registry-level type-erasure boundary and only removes the macro-generated async wrapper layer from concrete impls. So the runtime boxing story stays basically the same as before, while the compile-time savings are still large. Existing verification for this branch still applies: - Ran `cargo test -p codex-core`; this change compiled and the suite reached the known unrelated `config::tests::*guardian*` failures, with no local diff under `codex-rs/core/src/config/`. Profiling commands used for the tables above: - `cargo clean -p codex-core` - `cargo +nightly build -p codex-core --lib -Z unstable-options --timings=json` - `cargo +nightly rustc -p codex-core --lib -- -Z time-passes -Z time-passes-format=json` - `cargo +nightly rustc -p codex-core --lib -- -Z self-profile=... -Z self-profile-events=default,query-keys,args,llvm,artifact-sizes` - `measureme summarize -p 0.5`

core: type-erase ToolHandler outputs

2810b26

bolinfest changed the title ~~core: type-erase ToolHandler outputs~~ core: cut codex-core compile time 57% by type-erasing ToolHandler Apr 2, 2026

bolinfest mentioned this pull request Apr 2, 2026

core: cut codex-core compile time 63% with native async ToolHandler #16630

Merged

bolinfest closed this Apr 2, 2026

github-actions bot mentioned this pull request Apr 3, 2026

📊 AI CLI 工具社区动态日报 2026-04-03 gsscsd/big_model_radar#125

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: cut codex-core compile time 57% by type-erasing ToolHandler#16627

core: cut codex-core compile time 57% by type-erasing ToolHandler#16627
bolinfest wants to merge 1 commit intomainfrom
pr16627

bolinfest commented Apr 2, 2026 •

edited

Loading

Uh oh!

bolinfest commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bolinfest commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Headline result

Profile data

Baseline before this change

After this change

What changed

Tradeoff

Validation

Uh oh!

bolinfest commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bolinfest commented Apr 2, 2026 •

edited

Loading