This repository implements transport operators for mapping neural network activations between layers, with a focus on analysing how features propagate through transformer models. The work studies how high-quality sparse autoencoder (SAE) features can be "transported" from one layer to another using learned affine transformations.
[📄 ArXiv Paper]
Transport operators learn linear mappings that predict downstream layer activations from upstream layer activations. This enables analysis of:
- How semantic features evolve across transformer layers
- Quality of feature representations at different depths
- Causal effects of feature interventions on model behaviour
-
Clone the repository
-
Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate- Install dependencies:
pip install -r requirements.txtThe repository implements a complete pipeline for studying activation transport:
- Activation Collection: Extract activations from transformer models
- Feature Selection: Identify high-quality SAE features
- Feature Explanation: Get interpretable descriptions of features
- Transport Operator Training: Learn linear mappings between layers
- Evaluation: Assess transport quality and run transport efficiency analysis
- Causal Analysis: Measure causal effects via perplexity evaluation
First, collect activations from your target model on a dataset:
python collect_activations.py --config-path=configs --config-name=activation_collectionConfiguration: Edit configs/activation_collection.yaml to specify:
- Model to analyze (default:
gemma_2_2b) - Dataset for activation collection (default:
slimpajama) - Output directory and batch size
- Number of sequences to process
Key parameters in configs/activation_collection/slimpajama.yaml:
num_sequences: Number of text sequences to processmax_length: Maximum sequence lengthbatch_size: Processing batch sizeoutput_dir: Where to save activations
Output: Creates a directory with zarr-compressed activation files, e.g., activations-gemma2-2b-slimpajama-250k/
Use the automated feature discovery script to identify high-quality SAE features for any layer:
python scripts/find_feats.py --model-name google/gemma-2-2b --sae-l0 100 --top-percent 5.0What the script does:
-
Loads SAE models from Gemma Scope for each layer automatically
-
Computes multiple quality metrics for each feature:
- Token coherence: Low entropy over tokens that activate the feature (more interpretable)
- Vocabulary focus: How sharply the feature targets specific tokens in vocabulary space
- Activation patterns: Frequency and magnitude of feature activations
- Redundancy: Correlation with other features (lower is better)
- Causal effect: Impact on model predictions when ablated
-
Combines metrics into a composite score and ranks features
-
Saves results to
feature_scores_{layer}.jsonfiles
Key parameters:
--model-name: Transformer model to analyze (default:google/gemma-2-2b)--sae-l0: Target sparsity level for SAE (default: 100)--top-percent: Percentage of top features to save (default: 5%)--max-tokens: Number of tokens to analyze for statistics (default: 120k)--candidate-percent: Percentage sent to causal testing (default: 15%)
Output format: Each JSON file contains:
"high_quality_feature_ids": Ranked list of best feature indices"scores": Detailed metrics for all features"meta": Analysis parameters and statistics
The script automatically processes all layers and creates comprehensive feature rankings based on interpretability, uniqueness, and causal importance.
Fetch interpretable explanations for your selected features from Neuronpedia:
python explain_features.py --config-path=configs --config-name=explain_featuresConfiguration: Edit configs/explain_features.yaml to specify:
sae_layer: Which layer's features to explainfeature_ids_file: Path to JSON file with feature IDscache_dir: Directory for caching downloaded explanationsoutput_file: Where to save the explanations
Example configuration:
sae_layer: 6
feature_ids_file: "./feature_ids/feature_scores_6.json"
cache_dir: "./explanations_dict_cache"
output_file: "./feature_explanations/feature_explanations_6.json"Output: JSON file containing human-readable explanations for each feature.
Train linear transport operators to map activations between layers:
python train_transport_operators.py --config-path=configs --config-name=defaultConfiguration: Edit configs/experiment/ files to specify:
L: Source layer numbers (list)k: Target layer offsets (list) - target layer = L + k- Training parameters (regularization, method, etc.)
Key parameters:
method: Regression method ("ridge","linear","lasso","elasticnet")regularization: Regularization strengthauto_tune: Whether to use cross-validation for hyperparameter tuningcv_folds: Number of CV folds for tuning
Output: Trained models saved to cache/transport_model_*.pkl
Example experiment configuration:
# configs/experiment/residual_stream.yaml
L: [5, 6, 10] # Source layers
k: [1, 2, 4, 8] # Target layer offsetsEvaluate the quality of trained transport operators and run matched-rank analysis:
python eval.py --config-path=configs --config-name=evalConfiguration: Edit configs/eval.yaml to specify:
eval_mode:"pretrained"(use saved models) or"baselines"(identity operators)Ls: Layers to evaluateks: Layer offsets to evaluaterun_matched_rank: Whether to run matched-rank analysismatched_rank_only: Skip other evaluations if true
Key evaluation metrics:
- R² scores: How well transport operators predict downstream activations
- Feature-wise metrics: Performance on individual SAE features
- Matched-rank analysis: Compare transported features against all possible features at target layer
Output:
- JSON files with evaluation results in
outputs/ - Plots and visualizations in
plots/ - Matched-rank analysis results showing feature similarity rankings
Measure the causal impact of transport interventions on model behavior:
python causal_perplexity_eval.py --config-path=configs --config-name=causal_evalThis evaluation:
- Applies transport operator interventions during model inference
- Measures changes in model perplexity on text sequences
- Compares interventions against baselines (no intervention, zero intervention)
Configuration: Edit configs/causal_eval.yaml and configs/causal_eval/ subdirectories to specify:
- Intervention layers and target layers
- Text dataset for evaluation
- Number of sequences to evaluate
- Specific features or feature sets to intervene on
Key intervention types:
- Transport interventions: Replace target layer activations with transported upstream activations
- Zero interventions: Zero out specific feature activations
- Feature-specific interventions: Target specific SAE features
Output: JSON files with perplexity changes, measuring causal effects of different interventions.
The repository uses Hydra for configuration management. Key configuration directories:
configs/model/: Model specifications (gemma_2_2b, etc.)configs/sae/: SAE configurations (residual_stream, mlp, etc.)configs/experiment/: Experiment parameters for trainingconfigs/datasets/: Dataset configurationsconfigs/logger/: Logging and W&B setup
Common configuration patterns:
# Use different model
python script.py model=gemma_2_9b
# Override specific parameters
python script.py experiment.L=[8,12,16] experiment.k=[2,4]
# Use different SAE type
python script.py sae=mlp
# Change experiment name for tracking
python script.py experiment_name=my_experimentHere's a complete example workflow:
# 1. Collect activations (if not already available)
python collect_activations.py --config-name=activation_collection
# 2. Get feature explanations for layer 6
python explain_features.py sae_layer=6
# 3. Train transport operators for layers 5→6, 6→10
python train_transport_operators.py experiment.L=[5,6] experiment.k=[1,4]
# 4. Evaluate transport quality
python eval.py eval.Ls=[5,6] eval.ks=[1,4] run_matched_rank=true
# 5. Run causal evaluation
python causal_perplexity_eval.py causal_eval=layer_6_to_10collect_activations.py: Activation extraction from modelstrain_transport_operators.py: Transport operator trainingeval.py: Evaluation and matched-rank analysiscausal_perplexity_eval.py: Causal intervention evaluationexplain_features.py: Feature explanation fetchingsrc/: Core implementation modulestransport_operator.py: Transport operator implementationactivation_loader.py: Activation data loading utilitiessae_loader.py: SAE loading and configurationcausal_eval/: Causal evaluation hooks and utilities
configs/: Hydra configuration filescache/: Cached models and datafeature_ids/: High-quality feature selections per layeroutputs/: Evaluation results and analysis
If you use this code in your research, please cite the associated paper:
@misc{szablewski2025activationtransportoperators,
title={Activation Transport Operators},
author={Andrzej Szablewski and Marek Masiak},
year={2025},
eprint={2508.17540},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.17540},
}
[License information to be added]