You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ILearner-LLM + RLearner-LLM: Explanation Generation for PeerWise
This repository contains two progressive lines of work for automatically generating and evaluating educational explanations for PeerWise exam questions:
RLearner-LLM (this extension) — Replaces the K-round iterative prompt loop with reinforcement learning (DPO/PPO) using modern models (Llama-3, Qwen3) and LoRA for parameter-efficient fine-tuning.
RLearner-LLM environment (recommended: use existing trl conda env)
conda activate trl
# Install any missing packages
pip install trl>=0.10.0 peft>=0.10.0 bitsandbytes>=0.43.0 \
deepspeed>=0.14.0 datasets>=2.18.0 accelerate>=0.27.0
# Optional: Flash Attention 2 for ~2× faster training
pip install flash-attn --no-build-isolation
Datasets
Dataset
Domain
Train (~)
Test (~)
Cardiff
Biology
4,000
1,400
Sydney
Biology
500
460
Auckland Law
Law
1,000
—
UK Medicine Year 1
Medicine
1,000
—
UK Medicine Year 2
Medicine
1,000
—
Merged All
Mixed
5,000
—
Data format (JSON, one object per question):
{
"instruction": "As an explanation generation expert, can you generate the explanation for the given input?",
"input": "Question: [stem] Option A: ... Option B: ... The correct answer is A",
"output": "Student-written explanation text..."
}
ILearner-LLM Pipeline (Original)
1. Data Preprocessing
# Full merged dataset for generator training
python scripts/preprocessing/data_preprocessing_generator.py
# Single-domain: Cardiff only, avg_rating >= 3, explanation length >= 10
python scripts/preprocessing/data_preprocess_generator_one_dataset_cardiff.py
# Verifier data (Way 2: input = question + explanation → output = score)
python scripts/preprocessing/data_preprocessing_verifier_way2.py
ILearner-LLM Baseline (Reproduced from saved evaluation files)
Cardiff Dataset
Model
Training Data
N (test)
BLEU ↑
BERTScore F1 ↑
Alpaca-7B (no Cardiff SFT)
—
4,203
0.1952
0.5550
LLaMA-7B (Cardiff SFT)
Cardiff
4,203
0.2378
0.5466
Alpaca-7B (Cardiff SFT)
Cardiff
4,203
0.2415
0.5494
GPT4-X-Alpaca-13B (Cardiff SFT)
Cardiff
4,203
0.2469
0.5466
Vicuna-13B (Cardiff SFT)
Cardiff
4,203
0.2402
0.5562
Vicuna-13B (merged all, 100 sample)
All domains
100
0.1523
0.5797
Sydney Dataset
Model
Training Data
N (test)
BLEU ↑
BERTScore F1 ↑
Vicuna-13B (no SFT)
—
463
0.0818
0.1845
Vicuna-13B (Sydney SFT, avg≥3, len≥10)
Sydney
463
0.3317
0.6255
Vicuna-13B (merged all, 100 sample)
All domains
100
0.1589
0.5343
RLearner-LLM Results — Cardiff Biology (100-question test set)
Model
BLEU ↑
BERT(Stu) ↑
BERT(Ans) ↑
ACR ↑
NLI ↑
Verifier ↑
Time(s) ↓
SFT (LLaMA-2-13B + LoRA)
0.0160
0.8070
0.7820
0.8087
0.0555
3.1976
19.947
DPO v1 (165 pairs)
0.0173
0.8238
0.8325
0.7698
0.2969
3.0467
6.567
DPO v2 (458 pairs)
0.0247
0.8300
0.8422
0.8682
0.2905
3.0648
5.774
PPO (125 steps)
0.0175
0.8245
0.8255
0.7390
0.2260
3.0750
7.234
RLearner-LLM Results — Sydney Biology (100-question test set)
Model
BLEU ↑
BERT(Stu) ↑
BERT(Ans) ↑
ACR ↑
NLI ↑
Verifier ↑
Time(s) ↓
SFT (LLaMA-2-13B + LoRA)
0.0222
0.8244
0.7870
0.6249
0.0537
3.1937
19.001
DPO v1 (165 pairs)
0.0314
0.8262
0.8272
0.6034
0.2171
2.9094
9.049
DPO v2 (458 pairs)
0.0364
0.8367
0.8426
0.6290
0.2774
2.9474
6.370
PPO (125 steps)
0.0421
0.8364
0.8294
0.6606
0.2269
2.9609
7.596
Key finding: NLI entailment is the most discriminative metric — SFT scores ~0.05 vs all RL models 0.22–0.30 (4–5× gap).
See README_RL.md for full analysis, metric definitions, and future experiments.
We introduce "ILearner-LLM" a framework that uses iterative enhancement with LLMs to improve generated explanations. The paper has been accepted by the Proceedings of the AAAI Conference on Artificial Intelligence 2025.