<<<<<<< HEAD
TempusLogic bundles two complementary evaluation tracks for advanced natural language processing research:
- Logical reasoning diagnostics based on custom dyad and triad arguments.
- Mathematical reasoning stress tests on the GSM8K benchmark with adversarial prompting.
The repository is organised so you can run either track independently or compare results across both to understand model robustness to context edits and misleading exemplars.
TempusLogic/
├── data/ # Shared data assets
├── dyad_triad_gen.py # Generator for dyad/triad problem sets
├── eval.py # Evaluation script for logical reasoning
├── eval_utils.py # Parsing and scoring helpers for dyads/triads
├── gsm8k/ # GSM8K robustness experiments
│ ├── bad_answers.py
│ ├── count_token.py
│ ├── test.py
│ ├── data/
│ ├── logs/
│ ├── plots/
│ └── results/
├── parse_utils.py # Text parsing utilities shared by experiments
├── req.py # API request helpers
└── requirements.txt
pip install -r requirements.txtSet the required API keys as environment variables before running any experiment:
# PowerShell example
$env:OPENROUTER = "your-openrouter-api-key"
$env:GROQ = "your-groq-api-key"
$env:GENAI_API_KEY = "your-google-genai-api-key"This track probes how well a model preserves conclusions when the supporting context is modified.
- Dyad: Two statements, one question, answer in {yes, no, inconclusive}.
- Triad: Three statements, one question, answer in {yes, no, inconclusive}.
- Modified variants: Additional statements alter the context while keeping the question identical.
data/dyads.txt,data/dyads_gpt5.txtdata/triads.txt,data/triads_gpt5.txt
Entries follow the template:
**Dyad N**
1. Statement 1
2. Statement 2
**Question:** Does ... ?
**Answer:** Yes/No/Inconclusive
**Modified Dyad N**
1. Additional statement ...
2. Additional statement ...
**Question:** (same as above)
**Answer:** Yes/No/Inconclusive
python dyad_triad_gen.py- Uses the configured LLM (default: GPT-oss-120b) to synthesise fresh dyads/triads.
- Adds modified counterparts that perturb the original context while keeping the question fixed.
python eval.py- Reads problem sets from
data/. - Sends each prompt through the API gateway defined in
req.py. - Computes accuracy, precision, recall, and confusion matrices for original vs. modified arguments.
- Prints metrics to stdout; redirect to a file if needed for record keeping.
Typical findings (example using Kimi K2 instruct 1T) highlight that modified arguments reduce accuracy, especially on triads where recall drops sharply—suggesting sensitivity to contextual shifts.
This track inspects how GSM8K mathematical reasoning performance changes when prompts include misleading exemplars.
gsm8k/bad_answers.py: Generates intentionally wrong solutions for training examples.gsm8k/test.py: Runs comparative evaluations with bad prompts, good prompts, and zero-shot queries.gsm8k/count_token.py: Estimates token budgets for large-context experiments.
cd gsm8k
python bad_answers.py- Downloads the GSM8K split from Hugging Face on first run.
- Queries the selected model (e.g.
models/gemini-2.5-flash-liteor an Ollama-hosted alternative) with a system prompt that demands incorrect answers. - Verifies numeric disagreement with the ground-truth response before accepting a bad answer.
- Persists progress to
data/gsm8k_with_bad_llm_answers.parquetfor recoverability.
Edit the configuration block at the bottom of gsm8k/test.py, then execute:
python gsm8k/test.pyThe script:
- Builds prompts containing
num_bad_examplesexemplars (either incorrect or correct). - Evaluates
num_testsheld-out questions under three conditions:- Bad prompt (misleading exemplars).
- Good prompt (correct exemplars).
- No prompt (question only).
- Supports three prompt templates:
- Test 1: Standard few-shot prompting.
- Test 2: Forces the model to complete a truncated incorrect answer.
- Test 3: Asks the model to re-answer a question already answered incorrectly.
- Scores outputs via numeric comparison or by delegating to a stronger LLM (
models/gemma-3-12b-it) for qualitative checking. - Logs detailed traces in
gsm8k/logs/and appends summary statistics togsm8k/results/scores.csv.
python gsm8k/count_token.py- Uses the GPT-2 tokenizer to approximate token counts.
- Reports milestone context sizes (4k, 8k, 16k, 32k, 64k, 128k, 256k, 512k, 1M tokens) to guide prompt design.
- Provides average tokens per GSM8K Q&A pair (~817 tokens per five samples in the supplied dataset).
- Enable retry logic (
retry=True) when using rate-limited APIs; exponential backoff is implemented by default. - Set
DRY_RUN=1to exercise the pipelines without incurring API calls during development. - Keep API keys secure—
keys.pyis intentionally excluded from version control. - Large runs can be expensive; start with small
num_teststo validate settings.
Contributions are welcome. Please open an issue or pull request describing the change and the evaluation track it targets.
Project repository for the course Advanced Natural Language Processing.
To test the model for context rot on code-related tasks, our focus has been to design the syntax for a toy language, and feed it increasingly large code snippets in this written in this language. This draws the models focus onto the context we provide, since the code examples will not be part of its training data, and as the code sizes grow this can prove an effective test for context rot.
However, at the start of this project we were unsure how capable models would be when faced with such a task. Since we did not know how hard we can even push them in the first place, we spent the last month trying to gauge their understanding of code written in existing languages. The L-99 problem set is a set of 99 basic problems for people picking up lisp for the first time. Many variations of it exist for other languages as well. We have chosen to test the model on this set of problems, in Lisp and in Python. The idea is simple: we feed the model the code, which grows increasingly complex from problem 1 to 99, and over this set of problems we test whether the model can gauge what the code is trying to accomplish. Lisp and Python were chosen because while Python would feature prominently in the model's training data, Lisp would feature much less substantially, providing a good mix of model familiarity with the code.
origin/codetext