We optimize the context itself, not the model weights.
Powered by Tooliense Crux Agent Architecture
IC‑RL is a meta‑learning framework that treats the prompt (context) as a trainable policy and leverages natural‑language feedback (NL reward) as the supervisory signal, while the underlying large language model (agent) remains frozen. By re‑interpreting a multi‑turn dialogue as an RL loop, the method extracts solutions already latent in the LLM through pure prompt search.
This implementation is based on the Crux agent system developed by Tooliense, featuring an enhanced architecture that overcomes the limitations of traditional Self-Evolve mechanisms through hierarchical agent orchestration.
| 🔑 Principle | 📝 Description |
|---|---|
| G1 Expressive Prompt Space |
A sufficiently large LLM will output the desired answer given some prompt θ* ∈ 𝒫, so we train prompts—not weights. |
| G2 Self‑Diagnostics |
Even if one‑shot answers are imperfect, LLMs can accurately articulate their own errors in natural language. |
| G3 Rich NL Reward |
Natural‑language feedback carries orders of magnitude more information than a scalar reward—crucial for hard reasoning tasks. |
| G4 Context‑Search Exploration |
The agent need not become smarter than the base model; it only needs an exploration policy to discover θ*. |
| G5 Hierarchical Independence |
Independent agents in a hierarchical structure can explore state spaces more efficiently than single-layer mechanisms. |
Where:
- Evaluator generates natural‑language feedback φ
- Refiner (π_φ) converts φ into a prompt update Δθ
- r(·) is simply the raw natural‑language feedback φ (no scalar projection by default)
By the concept of IC-RL, we developed the Self-Evolve mechanism - a workflow that implements the core IC-RL loop:
1. Generate response using current prompt θₜ
2. Evaluate response quality using prepared QA sets
3. Feedback natural-language evaluation sign
4. Refine initial prompt using feedback optimization
5. Evolve repeat mechanism for continuous improvementThe Self-Evolve mechanism works effectively when the three components (Generator, Evaluator, Refiner) operate independently (based on Idea 3). This independence allows for:
- Unbiased evaluation of generated responses
- Objective refinement based purely on feedback signals
- Continuous evolution through iterative improvement cycles
While the Self-Evolve mechanism showed promising results, we discovered it had limitations when facing harder, more complex tasks. The single-layer approach would get stuck and plateau, unable to break through challenging problem domains.
To overcome these limitations, we developed an enhanced architecture inspired by graduate school research structures:
🎓 Professor Agent
│
┌─────────────┼─────────────┐
│ │ │
🔬 Specialist 🔬 Specialist 🔬 Specialist
Agent Agent Agent
│ │ │
[Function Call] [Function Call] [Function Call]
- Professor Agent: Acts as the research leader, orchestrating and controlling specialist agents through function calling
- Specialist Agents: Independent experts that can also act as sub-professors, managing their own team of specialists
- Recursive Hierarchy: Each specialist can recursively become a professor for lower-level specialists, creating deep hierarchical structures
- Function Calling Interface: Enables any agent at any level to utilize sub-agents as sophisticated tools
This mirrors how a graduate school professor leads research by directing specialists in their respective fields, and those specialists may lead their own research groups with sub-specialists, creating a natural recursive hierarchy.
The enhanced architecture draws parallels to neural networks:
Traditional NN Layer → Single Self-Evolve Mechanism
Deep NN Architecture → Multi-Layer Agent Architecture
Key Insight: Just as deeper neural networks can solve more complex problems, deeper layers of agents in our system demonstrate enhanced capability for complex problem-solving.
We've observed that Transformers are essentially agent systems based on Neural Networks, not just language models. The same architectural principles that make deep Transformers powerful apply to our multi-layer agent systems:
- Attention Mechanisms → Agent Communication Patterns
- Layer Depth → Hierarchical Agent Depth
- Parallel Processing → Concurrent Agent Operations
The key insight is that any Specialist can become a Professor for its own sub-specialists, creating a fractal-like recursive structure:
Input → [Generator → Evaluator → Refiner] → Output
Input → [Professor] → [Specialist₁, Specialist₂, Specialist₃] → Integration → Output
Input → [Professor] → [Specialist₁-Prof, Specialist₂-Prof, Specialist₃-Prof] → Integration → Output
│ │ │
[Sub-Spec₁₁, [Sub-Spec₂₁, [Sub-Spec₃₁,
Sub-Spec₁₂, Sub-Spec₂₂, Sub-Spec₃₂,
Sub-Spec₁₃] Sub-Spec₂₃] Sub-Spec₃₃]
---
config:
layout: dagre
---
flowchart TD
Input[["🎯 Complex Problem"]]
subgraph L0["Level 0 - Root Professor"]
Prof0["🎓 Root Professor"]
end
subgraph L1["Level 1 - Department Heads"]
Prof1["🎓 Math Prof-Specialist"]
Prof2["🎓 Logic Prof-Specialist"]
Prof3["🎓 Creative Prof-Specialist"]
end
subgraph L2["Level 2 - Sub-Departments"]
Prof11["🎓 Algebra Prof-Spec"]
Prof12["🎓 Geometry Prof-Spec"]
Prof21["🎓 Formal Prof-Spec"]
Prof22["🎓 Reasoning Prof-Spec"]
Prof31["🎓 Writing Prof-Spec"]
Prof32["🎓 Design Prof-Spec"]
end
subgraph L3["Level 3 - Individual Specialists"]
Spec111["🔬 Linear Algebra"]
Spec112["🔬 Abstract Algebra"]
Spec121["🔬 Euclidean Geo"]
Spec122["🔬 Topology"]
Spec211["🔬 Propositional"]
Spec212["🔬 Predicate Logic"]
Spec221["🔬 Deductive"]
Spec222["🔬 Inductive"]
Spec311["🔬 Technical Writing"]
Spec312["🔬 Creative Writing"]
Spec321["🔬 UI Design"]
Spec322["🔬 System Design"]
end
subgraph L4["Level 4 - Tool Specialists"]
Tool1["⚙️ Calculation"]
Tool2["⚙️ Verification"]
Tool3["⚙️ Formatting"]
Tool4["⚙️ Research"]
end
Integration[["🔗 Recursive Integration"]]
Output[["✅ Deep Solution"]]
%% Main flow
Input --> Prof0
Prof0 --> Prof1 & Prof2 & Prof3
Prof1 --> Prof11 & Prof12
Prof2 --> Prof21 & Prof22
Prof3 --> Prof31 & Prof32
Prof11 --> Spec111 & Spec112
Prof12 --> Spec121 & Spec122
Prof21 --> Spec211 & Spec212
Prof22 --> Spec221 & Spec222
Prof31 --> Spec311 & Spec312
Prof32 --> Spec321 & Spec322
Spec111 & Spec112 & Spec121 & Spec122 --> Tool1 & Tool2
Spec211 & Spec212 & Spec221 & Spec222 --> Tool2 & Tool4
Spec311 & Spec312 & Spec321 & Spec322 --> Tool3 & Tool4
%% Integration flow
Tool1 & Tool2 & Tool3 & Tool4 --> Integration
Spec111 & Spec112 & Spec121 & Spec122 & Spec211 & Spec212 & Spec221 & Spec222 & Spec311 & Spec312 & Spec321 & Spec322 --> Integration
Prof11 & Prof12 & Prof21 & Prof22 & Prof31 & Prof32 --> Integration
Prof1 & Prof2 & Prof3 --> Integration
Prof0 --> Integration
Integration --> Output
classDef prof fill:#FFD700,stroke:#FF8C00,stroke-width:3px,color:#000
classDef spec fill:#98FB98,stroke:#228B22,stroke-width:2px,color:#000
classDef tool fill:#FFB6C1,stroke:#DC143C,stroke-width:2px,color:#000
classDef integration fill:#DDA0DD,stroke:#9370DB,stroke-width:3px,color:#000
class Prof0,Prof1,Prof2,Prof3,Prof11,Prof12,Prof21,Prof22,Prof31,Prof32 prof
class Spec111,Spec112,Spec121,Spec122,Spec211,Spec212,Spec221,Spec222,Spec311,Spec312,Spec321,Spec322 spec
class Tool1,Tool2,Tool3,Tool4 tool
class Integration integration
Through extensive testing, we discovered optimal Self-Evolve loop configurations for each architecture depth:
| 🏗️ Architecture | 🔄 Loop Configuration | 📊 Dynamic Calls | 💡 Reasoning |
|---|---|---|---|
| Basic Mode (Depth-1) |
4 loops total | Fixed: 1 agent | Single agent needs multiple iterations to converge |
| Enhanced Mode (Depth-2) |
Specialists: 6 loops Professor: 2-3 loops |
Avg: 3-4 specialists (Dynamic function calls) |
Specialists need deep refinement, Professor adapts team size |
| Deep Mode (Depth-3+) |
Each level: 4-8 loops Higher levels: 2-4 loops |
Avg: 3-4 per level (Recursive dynamic calls) |
Each professor-specialist adapts sub-team size based on problem complexity |
Unlike static architectures, our system uses dynamic function calling where each Professor-level agent determines the optimal number of specialists based on problem complexity. Testing on IMO/USAMO-level mathematical problems showed an average of 3-4 specialist calls per professor.
def calculate_dynamic_api_calls(depth, avg_specialists=3.5, base_loops=4):
"""
Calculate API calls with dynamic specialist allocation
Based on IMO/USAMO complexity testing
"""
total_calls = 0
for level in range(depth):
if level == 0: # Root professor
agents_at_level = 1
loops = 3 # Professor coordination loops
else: # Specialist levels
agents_at_level = int(avg_specialists ** level)
loops = base_loops + (2 if level == depth-1 else 0) # Leaf specialists get more loops
level_calls = agents_at_level * loops * 3 # 3 components per agent
total_calls += level_calls
return total_callsdef calculate_dynamic_api_calls(depth, avg_specialists=3.5, base_loops=4):
"""
Calculate API calls with dynamic specialist allocation
Based on IMO/USAMO complexity testing showing 3-4 avg specialist calls
"""
total_calls = 0
for level in range(depth):
if level == 0: # Root professor
agents_at_level = 1
loops = 3 # Professor coordination loops
else: # Specialist levels (dynamically allocated)
agents_at_level = round(avg_specialists ** level)
loops = base_loops + (2 if level == depth-1 else 0) # Leaf specialists get more loops
level_calls = agents_at_level * loops * 3 # 3 components per agent
total_calls += level_calls
print(f"Level {level}: {agents_at_level} agents × {loops} loops × 3 = {level_calls} calls")
return total_calls
# Example outputs with dynamic allocation (3.5 avg):
# Depth-1: 12 calls
# Depth-2: 72 calls (6x increase)
# Depth-3: 231 calls (3.2x increase)
# Depth-4: 774 calls (3.4x increase)The growth pattern is more moderate than static 3^N due to:
- Adaptive specialist allocation based on problem complexity
- Professor agents intelligently determine optimal team size
- IMO/USAMO testing showed consistent 3-4 specialist pattern
- Diminishing returns as deeper levels require fewer additional specialists
| 🎚️ Depth Level | 🏗️ Structure | 🤖 Total Agents | 🎯 Problem Types |
|---|---|---|---|
| Depth-1 | Single Agent | 1 | Simple Q&A, Basic calculations |
| Depth-2 | 1 Prof + 3 Specs | 4 | Multi-step reasoning, Code debugging |
| Depth-3 | 1 + 3 + 9 | 13 | Complex proofs, Research synthesis |
| Depth-4 | 1 + 3 + 9 + 27 | 40 | Scientific discovery, System design |
| Depth-N | ∑ 3ⁱ (i=0 to N-1) | (3ᴺ - 1) / 2 | Arbitrarily complex problems |
The recursive structure creates a fractal pattern where:
- Each specialist can become a professor
- Problem decomposition happens naturally at each level
- Self-evolve loops operate independently at all levels
- Integration occurs recursively from bottom to top
This mirrors how human expertise develops - specialists in narrow fields can become generalists who manage other specialists, creating natural hierarchies of knowledge and problem-solving capability.
1. Initialise prompt θ₀
2. Roll‑out yₜ ← f(x; θₜ)
3. Evaluate φₜ ← Evaluator(x, yₜ)
4. Refine θₜ₊₁ ← θₜ + π_φ(φₜ)
5. Repeat until budget or convergence1. Initialise Professor θ_prof, Specialists {θ_spec₁, θ_spec₂, ...}
2. Orchestrate y_prof ← Professor(x; θ_prof)
3. Delegate y_specᵢ ← Specialist_i(subproblem; θ_specᵢ)
4. Integrate y_final ← Professor.integrate({y_specᵢ})
5. Multi-Evaluate φ_prof, {φ_specᵢ} ← Multi-Evaluator(y_final)
6. Multi-Refine θ_prof, {θ_specᵢ} ← Multi-Refiner({φ_specᵢ})
7. Repeat with enhanced exploration capability┌─────────────────────────────────────────────────────────────────┐
│ Professor Agent │
│ ┌─────────┐ x ┌─────────┐ y_prof ┌──────────────────┐ │
│ │ Prompt │ ───▶ │ LLM f │ ────────▶ │ Integration │ │
│ │ θ_prof │ │ (Prof) │ │ Module │ │
│ └─────────┘ └─────────┘ └──────────────────┘ │
│ ▲ │ │ │
│ │ ▼ (Function Calls) ▼ │
│ ┌─────────┐ ┌──────────────────────┐ ┌─────────────────┐ │
│ │Refiner │ │ Specialist Agents │ │ Evaluator │ │
│ │ π_φ │ │ ┌─────┬─────┬─────┐ │ │ (Multi) │ │
│ └─────────┘ │ │Spec1│Spec2│Spec3│ │ └─────────────────┘ │
│ ▲ │ └─────┴─────┴─────┘ │ │ │
│ └─────────────┼──────────────────────┼──────────┘ │
│ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Empirical Discovery: Testing revealed that as the layer depth of agents increases, the system's ability to solve complex problems grows significantly. This mirrors the scaling behavior observed in deep neural networks.
State Space Exploration: The enhanced architecture searches the solution state space more efficiently by:
- Parallel Exploration: Multiple specialists explore different aspects simultaneously
- Hierarchical Decomposition: Complex problems broken into manageable subproblems
- Enhanced Time Utilization: Better resource allocation across the agent hierarchy
---
config:
layout: dagre
---
flowchart LR
subgraph Main_Pipeline["Professor-Led Pipeline"]
direction LR
Professor["Professor Agent"]
input(["input"])
Integration["Integration Module"]
output(["output"])
MultiEvaluator["Multi-Evaluator"]
MultiRefiner["Multi-Refiner"]
end
subgraph Specialist_Layer_1["Specialist Layer 1"]
direction LR
S1Gen["Specialist #1"]
S1Out(("output #1"))
S1Eval["Evaluator #1"]
S1Ref["Refiner #1"]
end
subgraph Specialist_Layer_2["Specialist Layer 2"]
direction LR
S2Gen["Specialist #2"]
S2Out(("output #2"))
S2Eval["Evaluator #2"]
S2Ref["Refiner #2"]
end
subgraph Specialist_Layer_3["Specialist Layer 3"]
direction LR
S3Gen["Specialist #3"]
S3Out(("output #3"))
S3Eval["Evaluator #3"]
S3Ref["Refiner #3"]
end
subgraph Self_Evolve["Self-Evolve Core"]
direction LR
OutSE(["response"])
GenSE["Generator"]
EvalSE["Evaluator"]
RefSE["Refiner"]
end
input --> Professor
MultiRefiner --> Professor
Professor --> Integration
Professor -.->|"Function Calls"| S1Gen & S2Gen & S3Gen
Integration --> output
output --> MultiEvaluator
MultiEvaluator --> MultiRefiner
S1Gen --> S1Out
S1Out --> S1Eval & Integration
S1Eval --> S1Ref
S1Ref --> S1Gen
S2Gen --> S2Out
S2Out --> S2Eval & Integration
S2Eval --> S2Ref
S2Ref --> S2Gen
S3Gen --> S3Out
S3Out --> S3Eval & Integration
S3Eval --> S3Ref
S3Ref --> S3Gen
GenSE --> OutSE
OutSE --> EvalSE
EvalSE --> RefSE
RefSE --> GenSE
MultiRefiner:::green
S1Ref:::green
S2Ref:::green
S3Ref:::green
RefSE:::green
classDef green fill:#006400,color:#FFFFFF,stroke-width:0
Each specialist operates completely independently:
---
config:
layout: dagre
---
flowchart TD
Professor["🎓 Professor Agent"]
ProblemAnalysis["🔍 Problem Analysis"]
TeamDesign["👥 Dynamic Team Design"]
subgraph DynamicCreation["🎭 Dynamic Specialist Creation"]
FC1["Function Call: Math Specialist"]
FC2["Function Call: Logic Specialist"]
FC3["Function Call: Writing Specialist"]
FC4["Function Call: Research Specialist"]
end
subgraph Spec1["🔬 Math Specialist"]
S1Gen["Generator"]
S1Eval["Evaluator"]
S1Ref["Refiner"]
S1Gen --> S1Eval --> S1Ref --> S1Gen
end
subgraph Spec2["🔬 Logic Specialist"]
S2Gen["Generator"]
S2Eval["Evaluator"]
S2Ref["Refiner"]
S2Gen --> S2Eval --> S2Ref --> S2Gen
end
subgraph Spec3["🔬 Writing Specialist"]
S3Gen["Generator"]
S3Eval["Evaluator"]
S3Ref["Refiner"]
S3Gen --> S3Eval --> S3Ref --> S3Gen
end
subgraph Spec4["🔬 Research Specialist"]
S4Gen["Generator"]
S4Eval["Evaluator"]
S4Ref["Refiner"]
S4Gen --> S4Eval --> S4Ref --> S4Gen
end
Integration["🔗 Result Integration"]
MetaEval["📊 Meta-Evaluation"]
Output["✅ Final Output"]
Professor --> ProblemAnalysis
ProblemAnalysis --> TeamDesign
TeamDesign --> DynamicCreation
FC1 --> Spec1
FC2 --> Spec2
FC3 --> Spec3
FC4 --> Spec4
Spec1 --> Integration
Spec2 --> Integration
Spec3 --> Integration
Spec4 --> Integration
Integration --> MetaEval
MetaEval --> Output
classDef professor fill:#FFD700,stroke:#FF8C00,stroke-width:3px,color:#000
classDef specialist fill:#98FB98,stroke:#228B22,stroke-width:2px,color:#000
classDef dynamic fill:#87CEEB,stroke:#4682B4,stroke-width:2px,color:#000
classDef evolve fill:#DDA0DD,stroke:#9370DB,stroke-width:2px,color:#000
class Professor professor
class Spec1,Spec2,Spec3,Spec4 specialist
class DynamicCreation,FC1,FC2,FC3,FC4 dynamic
class S1Gen,S1Eval,S1Ref,S2Gen,S2Eval,S2Ref,S3Gen,S3Eval,S3Ref,S4Gen,S4Eval,S4Ref evolve
| 🧩 Module | 💻 Practical Tips |
|---|---|
| Professor Agent | Orchestration-focused prompt design; function calling capabilities; integration logic for specialist outputs. |
| Specialist Agents | Domain-specific prompts; independent Self-Evolve mechanisms; specialized evaluation criteria. |
| Function Calling | Structured interfaces between Professor and Specialists; clear input/output schemas; error handling. |
| Multi-Evaluator | Hierarchical evaluation: specialist-level and integration-level feedback; coherence checking across outputs. |
| Multi-Refiner | Coordinated refinement: individual specialist improvements and Professor orchestration updates. |
Key Finding: The enhanced Crux architecture shows dramatic improvements on complex, multi-domain tasks where basic Self-Evolve mechanisms plateau.
- 🌍 Universal Prompt Formalism – Every policy π can be encoded by some prompt θ; thus Π ≅ 𝒫
- 📡 High‑Bandwidth Reward – A scalar conveys log₂|𝑨| bits, while an NL sequence of T tokens conveys O(T) bits, enabling faster exploration
- 🎯 Convergence with Noisy Refiners – If E[Δθ | φ] · ∇J > 0, Robbins‑Monro conditions yield θₜ → θ* almost surely
- 🏗️ Hierarchical Exploration Enhancement – Multi-layer agent architectures provide exponentially larger effective search spaces compared to single-layer mechanisms
| 🚧 Challenge | 💡 Solution |
|---|---|
| API Cost Scaling | Implement intelligent caching and selective specialist activation |
| Coordination Complexity | Develop robust communication protocols and failure recovery |
| Depth vs. Efficiency Trade-off | Dynamic architecture adaptation based on problem complexity |
| Specialist Specialization | Automated specialist role discovery and optimization |
- Automatic Architecture Discovery: Learning optimal Professor-Specialist configurations
- Cross-Domain Transfer: Leveraging specialist knowledge across different problem domains
- Resource-Aware Orchestration: Dynamic specialist allocation based on computational budgets
- Meta-Learning Integration: Learning to learn across different agent hierarchies
# Clone the repository
git clone https://github.com/tooliense/icrl-crux
cd icrl-crux
# Install dependencies
pip install -r requirements.txt
# Set API keys
export OPENAI_API_KEY="your-key-here"
export DEEPSEEK_API_KEY="your-key-here"# Basic iterative improvement with prompt refinement
python examples/example_usage.py# Full Professor + Graduate system with o3 models
python examples/run_professor_graduate.py
# Quick test with gpt-4o models
python examples/run_professor_graduate.py --simple
# Test Responses API features
python examples/run_professor_graduate.py --test
# Show help and options
python examples/run_professor_graduate.py --help# Required
export OPENAI_API_KEY="your-api-key"
# Optional model configuration
export PROFESSOR_MODEL="o3" # Default: o3
export EVALUATOR_MODEL="o3" # Default: o3
export WORKER_MODEL="o3" # Default: o3
export PROBLEM_FILE="path/to/problem.xml" # Custom problem file@misc{tooliense2025icrl,
title = {IC-RL: In-Context Reinforcement Learning with Natural-Language Rewards and Enhanced Agent Architecture},
author = {Tooliense Team},
year = {2025},
note = {Crux Agent System Implementation},
url = {https://github.com/tooliense/icrl-crux}
}We welcome contributions to the IC-RL Crux project! Please see our Contributing Guidelines for details.
- Automated specialist discovery
- Cross-domain transfer learning
- Resource optimization algorithms
- Integration with popular ML frameworks
MIT License. Respect the terms of your model provider (OpenAI, DeepSeek, etc.).
