LLM prompt env

A lightweight Python environment for NLP experiments using LLMs via API calls (primarily OpenAI). Each module has a single, focused purpose to keep experiments readable and easy to modify.

Project layout

├── config.py               # Configuration — reads settings from env vars
├── client.py               # OpenAI client factory
├── prompts.py              # Prompt template building and management
├── experiments/
│   ├── base.py             # Abstract base class for experiments
│   ├── completion.py       # General-purpose text completion experiment
│   ├── acorn_laaj.py       # ACORN explanation-quality LLM-as-a-Judge experiment
│   └── llm_reasoning_survey.py  # Claim-level paper annotation for reasoning surveys
├── utils/
│   └── logging.py          # Logging configuration helpers
├── tests/                  # Pytest unit tests
├── .env.example            # Environment variable template (never commit .env)
├── requirements.txt        # Runtime dependencies (uv-compatible)
└── requirements-dev.txt    # Development / test dependencies (uv-compatible)

Quick start

1 — Clone and create a virtual environment

git clone https://github.com/a-brassard/LLM_prompt_env.git
cd LLM_prompt_env
uv venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate

2 — Install dependencies

uv pip install -r requirements-dev.txt   # includes runtime deps

If uv is not installed yet, install it first:

brew install uv

3 — Configure your API key

cp .env.example .env
# open .env and fill in OPENAI_API_KEY=sk-...

4 — Run an experiment

from config import Config
from client import build_client
from experiments.completion import TextCompletionExperiment
from utils.logging import setup_logging

setup_logging()

config = Config()          # reads OPENAI_API_KEY from environment
client = build_client(config)

exp = TextCompletionExperiment(
    config,
    client,
    system="You are a helpful assistant.",
)

result = exp.run("Summarize the history of transformers in one sentence.")
print(result)

Running tests

uv run pytest

ACORN LLM-as-a-Judge experiment

Use ACORNLaaJExperiment when you want to score ACORN explanations with the same AMT-style rubric prompt used in prior experiments.

from client import build_client
from config import Config
from experiments.acorn_laaj import ACORNLaaJExperiment

config = Config()
client = build_client(config)
exp = ACORNLaaJExperiment(config, client)

sample = {
    "question": "It got dark outside. What happened as a result?",
    "choices": [
        "Snowflakes began to fall from the sky.",
        "The moon became visible in the sky.",
    ],
    "explanation": "At night, darkness makes the moon visible.",
}

response = exp.run_sample(sample, temperature=0)
print(response)  # raw model text

ratings = exp.parse_ratings(response)
print(ratings)
# {'supports': 1, 'overall': 4, 'well_written': 1, 'related': 1,
#  'factual': 1, 'new_info': 2, 'unnecessary_info': 0, 'contrastive': 0}

Notes

Keep your .env out of version control — only .env.example is tracked.
Config raises an EnvironmentError if a required variable is missing.
Project coding guidance lives in .github/copilot-instructions.md.
Logging standard: capture run metadata, API-call settings, and outcomes to support reproducibility.

Logging configuration

Optional environment variables for logging and API-call behavior:

OPENAI_MAX_RETRIES=2                  # default: 2

By default, experiment logs include structured JSON events with run IDs, model/settings metadata, token usage, durations, and error details.

Batch evaluation script

scripts/run_acorn_batch.py processes a JSONL dataset end-to-end and writes rated results to a JSONL file. If the output file already exists, the script resumes by default (appends and skips already written rows).

python scripts/run_acorn_batch.py \
    --input  data/ACORN.jsonl \
    --output results/acorn_rated.jsonl \
    --model  gpt-4o \
    --temperature 0

Key arguments:

Flag	Default	Description
`--input`	(required)	Input JSONL path. Each line needs `question`, `choices`, `explanation`.
`--output`	(required)	Output JSONL path (parent dirs created automatically).
`--model`	`OPENAI_DEFAULT_MODEL`	Model name — override here or via the env var.
`--temperature`	`0.0`	Sampling temperature.
`--limit`	(none)	Process only the first N samples (for dry-runs).
`--log-file`	(none)	Write logs to this file in addition to stdout — monitor with `tail -f`.
`--overwrite`	`false`	Truncate `--output` and start from scratch instead of resuming.

Each output line mirrors the input and adds batch_run_id, item_run_id, model, temperature, response, ratings, error, and ts fields. Results are flushed to disk after every item, so partial results survive a crash. To restart from zero on an existing output path, pass --overwrite. The script exits with code 1 if any item failed.

Reasoning survey batch annotation

scripts/run_reasoning_survey_batch.py annotates paper PDFs and writes paper-level JSONL rows (one main-claim row per paper), including taxonomy labels, verdict-with-evidence, and supporting passages.

Core scope

This survey catalogues how the field asks "Can LLMs reason?" under operationalized conditions. Each paper must be rephraseable as:

Assuming [performance_bar] [ideal_target] [goal/skill/mechanism] in a [testing_environment] with [relationship_with_language], can LLMs [operationalization]?

Papers that cannot be framed this way are marked exclude=true and excluded from operationalization labeling.

Examples:

"Can GPT-4 solve grade-school math?" → "Assuming layman-level precise skill in a structured setting integrated with language, can GPT-4 solve grade-school arithmetic problems?"
"Does Claude have an internal representation of other agents' beliefs?" → "Assuming expert-level precise mechanism in a naturalistic setting integrated with language, can Claude develop theory-of-mind representations in activation space?"

python scripts/run_reasoning_survey_batch.py \
    --papers-dir "data/Reasoning x LLM survey papers" \
    --output results/reasoning_survey_claims.jsonl \
    --metadata-output results/reasoning_survey_run_metadata.json \
    --model gpt-5.4 \
    --temperature 0

Key arguments:

Flag	Default	Description
`--papers-dir`	`data/Reasoning x LLM survey papers`	Directory of PDF papers to annotate.
`--output`	`results/reasoning_survey_claims.jsonl`	Claim-level JSONL output path.
`--metadata-output`	`results/reasoning_survey_run_metadata.json`	Run summary JSON output path.
`--model`	`OPENAI_DEFAULT_MODEL`	Model name to use for annotation.
`--sections`	`abstract introduction experiments evaluation conclusion`	Priority section keywords used for page ranking before fallback.
`--max-chars`	`30000`	Max extracted text characters sent per paper.
`--timeout-retries`	`2`	Retry count for timeout failures on a single paper.
`--timeout-backoff-seconds`	`5.0`	Base exponential backoff between timeout retries.
`--request-timeout-seconds`	(none)	Optional per-request timeout override for model calls.
`--limit`	(none)	Process only first N papers (pilot run).
`--overwrite`	`false`	Restart from scratch instead of resume mode.

The runner resumes by default: if output already exists, papers with matching paper_id are skipped. For each successful paper, the script writes one or more rows (typically one per paper under current policy) with paper_id, operationalization_id, taxonomy labels, structured verdict, model settings, and run IDs.

Current extraction policy keeps only the paper's main claim:

exactly one operationalization entry is retained per included paper
additional significant settings should be summarized in notes
if multiple entries are returned, the parser prefers is_main_claim=true; otherwise it picks the most complete candidate

Each paper also gets a scope pre-tagging decision:

exclude=false: normal operationalization labeling is produced.
exclude=true: the row includes exclusion_justification, and individuation, operationalization, and verdict are null.

Additional row-level fields:

core_question: one general framing question (guided by level_of_analysis: goal/skill/mechanism) describing how reasoning is evaluated and what counts as success
reasoning_general_link: why this setup informs reasoning in general
reasoning_definition: explicit paper definition of "reasoning" when present, with supporting citation passages; null if absent
paper_type: empirical | theoretical | survey | other
prescriptive_or_descriptive: descriptive | prescriptive | both

Within operationalization.level_of_analysis, the output also includes focus_summary: a short non-quoted summary of which specific goal, skill, or mechanism is under analysis.

Taxonomy dimensions and N/A handling:

Each operationalization is classified across five dimensions. All dimensions support an optional n_a label for cases where that dimension does not clearly apply to the operationalization (use sparingly).

Dimension-specific guidance:

Reasoning paper crawler (ArXiv + Semantic Scholar)

scripts/run_reasoning_paper_crawler.py crawls recent papers about reasoning in LLMs, deduplicates results across ArXiv and Semantic Scholar, applies explicit inclusion/exclusion heuristics, and optionally downloads PDFs for included papers.

python scripts/run_reasoning_paper_crawler.py \
  --lookback-years 3 \
  --max-results-per-source 5000 \
  --semantic-scholar-search-mode bulk \
  --semantic-scholar-fields-of-study "Computer Science" "Philosophy" "Psychology" "Linguistics" \
  --semantic-scholar-publication-types Review JournalArticle Conference \
  --semantic-scholar-min-citation-count 20 \
  --latest-only \
  --download-pdfs \
  --output results/reasoning_llm_papers_catalog.jsonl \
  --metadata-output results/reasoning_llm_papers_run_metadata.json

Key arguments:

Flag	Default	Description
`--sources`	`semantic_scholar`	Data sources to crawl.
`--search-query`	built-in query set	Repeatable custom query terms.
`--semantic-scholar-search-mode`	`bulk`	`bulk` for large-scale fetch, `relevance` for ranked retrieval.
`--semantic-scholar-rps`	`1.0`	Semantic Scholar request rate cap (requests per second).
`--semantic-scholar-bulk-sort`	`citationCount:desc`	Sort expression for bulk mode.
`--semantic-scholar-fields-of-study`	`Computer Science Philosophy Psychology Linguistics`	Server-side fields-of-study filter in Semantic Scholar.
`--semantic-scholar-publication-types`	`Review JournalArticle Conference`	Server-side publication type filter in Semantic Scholar.
`--semantic-scholar-min-citation-count`	`0`	Server-side citation floor in Semantic Scholar.
`--semantic-scholar-open-access-only`	`false`	Restrict retrieval to open-access PDF papers in Semantic Scholar.
`--lookback-years`	`3`	Rolling time window ending at `--as-of-year`.
`--latest-only`	`false`	Keep only latest candidate per inferred topic cluster.
`--include-rejected`	`false`	Also write non-included candidates to output JSONL.
`--download-pdfs`	`false`	Download available PDFs for included candidates.
`--download-dir`	`data/reasoning_llm_papers`	PDF destination folder.
`--semantic-scholar-api-key`	(none)	Optional key for higher Semantic Scholar rate limits.
`--overwrite`	`false`	Restart output instead of resume mode.

Each output row includes:

source metadata (title, authors, venue, citation_count, reference_count)
inclusion decision (include) plus detailed gate checks under inclusion
download status (download_path, download_error)

The inclusion logic approximates your criteria with deterministic heuristics:

topic gate: must match LLM + reasoning signals
first group: latest-of-kind and survey/book or concrete analysis framing
second group: peer-reviewed or highly cited (100+) or substantial preprint proxy
exclusion gate: known out-of-scope topics (e.g., non-LLM, deep MI-focused, coursework/blog-like sources)
level_of_analysis: goal | skill | mechanism | n_a
ideal_target: precise | human_like | n_a
- Use n_a only if the operationalization does not align with either standard.
performance_bar: expert | layman | n_a
- Classifies the expected reasoning complexity of the model (not the evaluator).
- layman: everyday reasoning (e.g., GSM8K, basic math, common sense).
- expert: expert-level reasoning (e.g., Humanity's Last Exam, domain knowledge, multi-step deduction).
- Use n_a only if reasoning complexity is fundamentally unclear.
testing_environment: structured | naturalistic | n_a
relationship_with_language: integrated | independent | n_a

Adding a new experiment

Create experiments/my_experiment.py.
Subclass BaseExperiment and implement _build_prompt and _parse_response.
Add a corresponding test file tests/test_my_experiment.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM prompt env

Project layout

Quick start

1 — Clone and create a virtual environment

2 — Install dependencies

3 — Configure your API key

4 — Run an experiment

Running tests

ACORN LLM-as-a-Judge experiment

Notes

Logging configuration

Batch evaluation script

Reasoning survey batch annotation

Core scope

Reasoning paper crawler (ArXiv + Semantic Scholar)

Adding a new experiment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github		.github
crawl		crawl
data		data
experiments		experiments
results		results
scripts		scripts
tests		tests
utils		utils
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
client.py		client.py
config.py		config.py
prompts.py		prompts.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LLM prompt env

Project layout

Quick start

1 — Clone and create a virtual environment

2 — Install dependencies

3 — Configure your API key

4 — Run an experiment

Running tests

ACORN LLM-as-a-Judge experiment

Notes

Logging configuration

Batch evaluation script

Reasoning survey batch annotation

Core scope

Reasoning paper crawler (ArXiv + Semantic Scholar)

Adding a new experiment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages