RLVR is a fresh implementation of multi-step question answering training using the modern verifiers library. This project focuses on training language models with reinforcement learning on the MuSiQue dataset for multi-hop reasoning tasks.
This project represents a migration from a custom verifiers fork to the official verifiers library, taking advantage of modern improvements like:
- Native tool calling support
- Clean environment protocols (
ToolEnv,MultiTurnEnv) - Modern GRPO trainer with async batch generation
- Installable environment packages
- Proper rubric system for evaluation
-
MuSiQue Environment (
environments/vf_musique/)- Custom
ToolEnvimplementation for multi-hop question answering - Document retrieval tools (BM25, semantic, hybrid retrieval)
- MuSiQue-specific dataset preprocessing
- Citation tracking and multi-hop reasoning support
- Custom
-
Training Infrastructure (
scripts/train_musique.py)- Modern GRPO training using official verifiers library
- LoRA fine-tuning support
- Configurable retrieval strategies
- WandB integration for experiment tracking
-
Evaluation System
- Custom rubrics for MuSiQue evaluation
- Exact match and F1 scoring
- Retrieval quality metrics (recall, precision)
- Multi-hop difficulty weighting
uv venv
source .venv/bin/activate
uv sync
uv pip install flash-attn --no-build-isolation# Install the MuSiQue environment package
vf-install vf-musique -p environmentsvf-eval vf-musique --model Qwen/Qwen2.5-3B-Instruct --api-base-url http://0.0.0.0:8000/v1 --api-key localStart vLLM inference server
CUDA_VISIBLE_DEVICES=0 vf-vllm --model meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 1 --data-parallel-size 1 --enforce-eager --disable-log-requests --gpu-memory-utilization 0.7Train on MuSiQue environment with GRPO
CUDA_VISIBLE_DEVICES=1,2,3 accelerate launch --num-processes 3 \
--config-file configs/zero3.yaml \
scripts/train_musique.py train Quick prediction:
python scripts/train_musique.py predict \
--model outputs/your-trained-model \
--batch-size 8The evaluation script provides multiple commands for comprehensive analysis:
Basic evaluation:
python scripts/evaluate_musique.py outputs/your-trained-modelDetailed evaluation:
python scripts/evaluate_musique.py evaluate \
outputs/your-trained-model \
--dataset-split validation \
--num-examples 100 \
--retriever hybrid \
--verboseBenchmark multiple models:
python scripts/evaluate_musique.py benchmark \
--models model1,model2,model3 \
--retrievers bm25,hybrid \
--output-dir benchmark_resultsAnalyze results:
python scripts/evaluate_musique.py analyze benchmark_results- Tool-Based Interaction: Models use retrieval tools to gather information
- Citation Requirements: Models must cite sources used in reasoning
- Multi-Hop Questions: Dataset requires connecting information across multiple documents
- Completion Detection: Environment detects when reasoning is complete
- BM25: Classic lexical retrieval
- Semantic: Neural embedding-based retrieval
- Hybrid: Combination of lexical and semantic approaches
- Golden: Oracle retrieval for debugging (returns supporting documents)
- Exact Match: Binary accuracy against ground truth
- F1 Score: Token-level overlap with references
- Retrieval Recall: Fraction of supporting documents retrieved
- Weighted Scoring: Multi-hop questions receive higher weight
- Combined Reward: Balances answer quality and retrieval performance
The MuSiQue environment (MuSiQueToolEnv) extends verifiers' ToolEnv to provide:
- Document Injection: Tools receive access to question-specific documents
- Multi-Turn Interaction: Up to 10 turns of tool usage per question
- Native Tool Calling: Python functions automatically converted to OpenAI format
- State Management: Tracks retrieval history and completion status
Tools are implemented as Python functions with docstrings that become tool descriptions:
def retrieve_documents(query: str) -> str:
"""
Retrieve relevant documents by the query.
Args:
query: The query to retrieve documents for.
Returns:
Retrieved documents formatted as text.
"""
# Implementation...The MuSiQueRubric class provides comprehensive evaluation:
def score(self, prompt, completion, answer, **kwargs) -> vf.RolloutScore:
# Compute EM, F1, retrieval metrics
# Weight by question difficulty (number of hops)
# Return combined reward and detailed metricsBefore (Custom Fork):
- Complex XML parsing and tool integration
- Manual environment state management
- Custom training loops and reward computation
- Difficulty staying up-to-date with improvements
After (Official Verifiers):
- Native tool calling with automatic OpenAI conversion
- Clean environment protocols and state management
- Modern GRPO trainer with async batch generation
- Easy updates and community improvements
- Async Training: Improved throughput with async batch generation
- Native Tools: Cleaner tool integration without custom parsers
- Modern Config: Better hyperparameter management
- Standardized Evaluation: Consistent metrics across environments
- Import Errors: Ensure verifiers library is installed and environment is in Python path
- Service Dependencies: Rerank and wiki search services must be running for advanced retrievers
- Dataset Loading: MuSiQue dataset download may take time on first run
- GPU Memory: Adjust batch size and gradient accumulation for available hardware
# Debug with minimal examples
python scripts/train_musique.py --num-train-examples 10 --max-steps 5- Multi-dataset support (HotpotQA, 2WikiMultiHopQA)
- Advanced reward shaping techniques
- Integration with external knowledge bases
- Self-supervised document filtering
- Hierarchical reasoning decomposition
- Meta-learning for few-shot adaptation
- Interpretability and reasoning visualization
- Verifiers Library - Official documentation
- MuSiQue Dataset - Multi-hop QA dataset
- GRPO Paper - Reinforcement learning algorithm