This repository provides a fully reproducible pipeline for studying whether language models (BERT / Qwen2 / LLaMA) can perform systematic logical reasoning β and how robustly that reasoning survives rule perturbations.
It includes:
- Synthetic data generation with multiple controlled perturbation variants
- LoRA-based model training with a two-stage training pipeline
- Multiple training strategies: SFT, Generative, Mixed, DPO, CoT, Fusion, RA-CoT
- Detailed evaluation across 11 test splits with prediction logging
- Logical equivalence stress tests (single-law & multi-law)
- Real-world NLI generalization evaluation (LogicNLI / MNLI)
conda create -n logic python=3.10 -y
conda activate logic
pip install -r requirements.txt.
βββ train.py # Main LoRA training script
βββ evaluate.py # Main evaluation suite
βββ data_gen.py # Data generator for all variants
βββ requirements.txt
β
βββ data/
β βββ train.csv # Base training set (80%)
β βββ test_base.csv # Base test set (20%)
β βββ test_variant{1-3}.csv # Rule perturbation variants
β βββ test_variant4_equiv_*.csv # Logical equivalence variants (Γ7)
β βββ train_cot.csv # CoT training data
β βββ train_dpo.jsonl # DPO preference pairs
β βββ train_fusion.csv # Fusion (SFT+CoT) training data
β βββ train_mixed.csv # Mixed generative training data
β βββ train_ra_cot.csv # RA-CoT training data
β βββ real_world/ # LogicNLI / MNLI evaluation data
β
βββ scripts/
β βββ data_generation/ # Data generation scripts
β βββ training/ # Advanced training scripts
β βββ evaluation/ # Extended evaluation scripts
β βββ utils/ # Utilities, debug, reporting
β
βββ evals_data/ # OpenAI Evals format test data
βββ evals_submission/ # OpenAI Evals submission
βββ results/ # Evaluation summary CSVs
βββ docs/ # Documentation, paper, reports
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA GENERATION β
β β
β data_gen.py βββΊ train.csv / test_base.csv / test_variant{1-4}.csv β
β β
β scripts/data_generation/ β
β stage1_data_gen.py βββΊ data/stage1_train_{bert,generative}.csv β
β generate_cot_data.py βββΊ data/train_cot.csv β
β generate_dpo_data.py βββΊ data/train_dpo.jsonl β
β generate_fusion_data.py βββΊ data/train_fusion.csv β
β generate_mixed_data.py βββΊ data/train_mixed.csv β
β generate_ra_cot_data.py βββΊ data/train_ra_cot.csv β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 1 TRAINING β
β β
β βββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββ β
β β SFT on variant β β Generative: rule generation task β β
β β 2/3 style data β β facts+masked_rules β missing_rule β β
β β stage1_train.py β β stage1_train_generative.py β β
β ββββββββββββ¬βββββββββββ ββββββββββββββββββββ¬ββββββββββββββββββββ β
β β β β
β bert_stage1 / qwen_stage1 qwen_stage1_gen checkpoint β
βββββββββββββββΌβββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββ
β β
ββββββββββββββββββ¬ββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 2 TRAINING β
β (fine-tune on Stage-1 checkpoint; Qwen2 / LLaMA) β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ βββββββββββββ β
β β Mixed SFT β β DPO β β CoT β β Fusion β β
β β (T/F + rule β β (preference β β (step-by- β β (SFT+CoT) β β
β β prediction) β β pairs) β β step trace) β β β β
β βstage2_train_ β βstage2_train_ β βstage2_train_ β βstage2_ β β
β βgenerative.py β β dpo.py β β cot.py β βfusion.py β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ βββββββ¬ββββββ β
β β β β β β
β qwen_stage2_mixed qwen_stage2_dpo (cot model) (fusion model) β
β llama_stage2_mixed β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EVALUATION β
β β
β evaluate.py βββΊ 11 synthetic splits (base + variant1-3 + variant4Γ7) β
β β
β scripts/evaluation/ β
β evaluate_real_world.py βββΊ LogicNLI / MNLI generalization test β
β evaluate_cot.py βββΊ CoT model evaluation β
β evaluate_generative.py βββΊ Generative model evaluation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
python data_gen.py| Split | Description | Expected Behavior |
|---|---|---|
test_base.csv |
Original reasoning chain | All correct |
test_variant1.csv |
Redundant rule removed | Unchanged answers |
test_variant2.csv |
Critical rule removed | Answers change |
test_variant3.csv |
Contradictory facts injected | All False |
test_variant4_equiv_contrapositive.csv |
Contrapositive rewrite | Unchanged answers |
test_variant4_equiv_double_negation.csv |
Double negation rewrite | Unchanged answers |
test_variant4_equiv_implication.csv |
Implication law rewrite | Unchanged answers |
test_variant4_equiv_demorgan.csv |
De Morgan rewrite | Unchanged answers |
test_variant4_equiv_identity.csv |
Identity rewrite | Unchanged answers |
test_variant4_equiv_commutativity.csv |
Commutativity rewrite | Unchanged answers |
test_variant4_equiv_multi.csv |
2β5 laws combined | Unchanged answers |
python train.py --model bert # BERT-base-uncased
python train.py --model qwen # Qwen2-1.5B
python train.py --model llama # TinyLlama-1.1BStage 1:
python scripts/data_generation/stage1_data_gen.py
python scripts/training/stage1_train.py --model qwen
python scripts/training/stage1_train_generative.py --model qwenStage 2 (run after Stage 1):
# Generate data first
python scripts/data_generation/generate_mixed_data.py
python scripts/data_generation/generate_dpo_data.py
python scripts/data_generation/generate_cot_data.py
# Train
python scripts/training/stage2_train_generative.py # Mixed SFT
python scripts/training/stage2_train_dpo.py # DPO
python scripts/training/stage2_train_cot.py # CoT
python scripts/training/stage2_train_fusion.py # Fusionpython evaluate.py --model bert
python evaluate.py --model qwen
python evaluate.py --model llamaPredictions saved to trained_models/{model}/predictions/{model}_{split}_predictions.csv.
β = not evaluated on this split. V4-avg = average over all evaluated logical equivalence splits.
| Model | Strategy | Base | V1 | V2 | V3 | V4-avg |
|---|---|---|---|---|---|---|
| BERT | Stage-1 SFT | 1.000 | 1.000 | 0.295 | 0.000 | 0.999 |
| BERT (stage2) | Stage-2 SFT | 1.000 | 1.000 | 0.250 | 0.000 | 1.000 |
| LLaMA (TinyLlama) | Stage-1 SFT | 1.000 | 1.000 | 0.250 | 0.000 | 0.999 |
| LLaMA | Stage-2 Mixed | 0.538 | 0.693 | 0.533 | 0.145 | 0.797 |
| Qwen2-1.5B | Stage-1 SFT | 1.000 | 1.000 | 0.250 | 0.000 | 0.943 |
| Qwen2 | Stage-1 Generative | 0.175 | 0.185 | 0.555 | 0.908 | 0.165 β |
| Qwen2 | Stage-2 DPO | 0.000 | 0.000 | 0.750 | 1.000 | 0.000 β |
| Qwen2 | Stage-2 Mixed | 0.525 | 0.938 | 0.405 | 0.973 | 0.444 |
| Qwen2 | Stage-2 Mixed+Aug | 0.488 | 0.908 | 0.450 | 0.988 | 0.400 |
β Only 3 of 7 V4 splits evaluated.
| Law | BERT | BERT-S2 | LLaMA | LLaMA-S2 | Qwen | Qwen-Mixed |
|---|---|---|---|---|---|---|
| Commutativity | 0.993 | 1.000 | 1.000 | 0.858 | 1.000 | 0.498 |
| Contrapositive | 1.000 | 1.000 | 1.000 | 0.705 | 1.000 | 0.318 |
| De Morgan | 1.000 | 1.000 | 1.000 | 0.910 | 1.000 | 0.163 |
| Double Negation | 1.000 | 1.000 | 1.000 | 0.803 | 1.000 | 0.545 |
| Identity | 1.000 | 1.000 | 1.000 | 0.815 | 1.000 | 0.570 |
| Implication | 1.000 | 1.000 | 1.000 | 0.745 | 0.953 | 0.590 |
| Multi-law | 1.000 | 1.000 | 0.993 | 0.745 | 0.645 | 0.428 |
Models trained on synthetic logic data were evaluated on real-world NLI benchmarks.
| Model | Dataset | Predictions | Accuracy |
|---|---|---|---|
| Qwen2 Fusion-Conflict | LogicNLI (n=500) | All "Unknown" | 0.000 |
| Qwen2 Fusion-Conflict | MNLI (n=349) | All "Unknown" | 0.000 |
| Qwen2 RealWorld-SFT | LogicNLI (n=500) | All "Unknown" | 0.000 |
All models predict "Unknown" on real-world NLI, indicating zero generalization from synthetic logic reasoning to natural language inference. The reasoning skills learned are tightly coupled to the synthetic template format.
Standard LoRA SFT is robust to logical equivalence but brittle to contradictions:
- Perfect on Base / Variant 1 / Variant 4 (equivalence rewriting)
- ~0.25 on Variant 2 (near-random, loses critical rule) β relies on complete rule chains
- 0.00 on Variant 3 (contradictions fully break reasoning)
Mixed/Generative training recovers contradiction robustness at a cost:
- Stage-2 Mixed reaches 0.97β0.99 on Variant 3 and 0.94 on Variant 1
- But Variant 4 (logical equivalence) accuracy drops to ~0.40β0.45
- Variant 2 also stays weak at 0.40β0.45
DPO maximizes contradiction robustness but collapses elsewhere:
- Best Variant 3 (1.00) and Variant 2 (0.75)
- Catastrophic failure on Base (0.00) and all Variant 4 β collapses to always predicting False
Core trade-off:
Models robust to logical equivalence rewrites (SFT) are brittle to contradictions. Models that handle contradictions (Mixed/DPO) lose logical equivalence robustness. No single training strategy dominates across all perturbation types.
No generalization to real-world NLI:
All models predict "Unknown" on LogicNLI and MNLI, showing the learned reasoning is format-specific and does not transfer to natural language.
We submitted the Variant 3 test set to the Human Last Exam benchmark. All state-of-the-art models fail, including claude-sonnet-4-5, gpt-4.1, gpt-5.2, claude-opus-4-5, and gemini-3-pro-preview.
Facts: Anne is green or blue
Rules:
If someone is green then they are cold.
If someone is blue then they are cold.
If someone is cold then they are rough.
If someone is not young then they are not rough.
If someone is young then they are cold.
If someone is young then they are nice.
| Q | Base | V1 (remove redundant) | V2 (remove key) | V3 (contradiction) |
|---|---|---|---|---|
| Anne is cold | T | T | T | F |
| Anne is rough | T | T | F | F |
| Anne is young | T | T | F | F |
| Anne is nice | T | T | F | F |
Original: If someone is green then they are cold.
| Law | Rewritten Form |
|---|---|
| Contrapositive | If someone is not cold then they are not green. |
| Double Negation | If someone is not not green then they are not not cold. |
| Implication | Someone is not green or they are cold. |
| De Morgan | If someone is not green and not blue then they are not cold. |
| Identity | If someone is not not green then they are cold. |
| Commutativity | If someone is blue or green then they are cold. |
| Multi-law | equiv_laws_used="contrapositive,implication,demorgan" |
Predictions saved to trained_models/{model}/predictions/{model}_{split}_predictions.csv:
| Column | Description |
|---|---|
facts |
Input facts |
rules |
Rule list |
question |
Question text |
ground_truth |
Correct answer |
prediction |
Model prediction |
equiv_laws_used |
Logical laws applied (V4 only) |
equiv_law_count |
Number of laws applied |
changed_rule |
Human-readable description of the change |