Feature request
Hello,
Recent research on reasoning metrics such as Deep-Thinking Ratio (DTR) and related work
(e.g. https://arxiv.org/abs/2602.13517 and https://arxiv.org/abs/2603.10165)
shows that intermediate transformer activations contain useful signals
about reasoning quality during generation.
Motivation
These signals can be used for:
• reasoning metrics (e.g. DTR)
• candidate pruning in multi-sample reasoning
• inference-time compute allocation
• router calibration
• model interpretability
Currently most optimized inference frameworks only return the final logits or tokens,
which makes it impossible to compute these metrics.
It would be very helpful if the framework supported an optional debugging/research mode
that exposes intermediate layer outputs during generation.
For example:
generate(
prompt,
return_hidden_states=True,
layers=[4,8,12,16]
)
or
generate(
prompt,
return_layer_logits=True
)
This mode could be:
• disabled by default
• restricted to batch size = 1
• limited to selected layers
• marked as a research/debug feature
Providing this capability would enable experimentation with reasoning metrics
without requiring users to reimplement the entire inference stack in raw PyTorch.
Your contribution
Thank you for considering this feature.
Feature request
Hello,
Recent research on reasoning metrics such as Deep-Thinking Ratio (DTR) and related work
(e.g. https://arxiv.org/abs/2602.13517 and https://arxiv.org/abs/2603.10165)
shows that intermediate transformer activations contain useful signals
about reasoning quality during generation.
Motivation
These signals can be used for:
• reasoning metrics (e.g. DTR)
• candidate pruning in multi-sample reasoning
• inference-time compute allocation
• router calibration
• model interpretability
Currently most optimized inference frameworks only return the final logits or tokens,
which makes it impossible to compute these metrics.
It would be very helpful if the framework supported an optional debugging/research mode
that exposes intermediate layer outputs during generation.
For example:
generate(
prompt,
return_hidden_states=True,
layers=[4,8,12,16]
)
or
generate(
prompt,
return_layer_logits=True
)
This mode could be:
• disabled by default
• restricted to batch size = 1
• limited to selected layers
• marked as a research/debug feature
Providing this capability would enable experimentation with reasoning metrics
without requiring users to reimplement the entire inference stack in raw PyTorch.
Your contribution
Thank you for considering this feature.