A lightweight Python environment for NLP experiments using LLMs via API calls (primarily OpenAI). Each module has a single, focused purpose to keep experiments readable and easy to modify.
├── config.py # Configuration — reads settings from env vars
├── client.py # OpenAI client factory
├── prompts.py # Prompt template building and management
├── experiments/
│ ├── base.py # Abstract base class for experiments
│ ├── completion.py # General-purpose text completion experiment
│ ├── acorn_laaj.py # ACORN explanation-quality LLM-as-a-Judge experiment
│ └── llm_reasoning_survey.py # Claim-level paper annotation for reasoning surveys
├── utils/
│ └── logging.py # Logging configuration helpers
├── tests/ # Pytest unit tests
├── .env.example # Environment variable template (never commit .env)
├── requirements.txt # Runtime dependencies (uv-compatible)
└── requirements-dev.txt # Development / test dependencies (uv-compatible)
git clone https://github.com/a-brassard/LLM_prompt_env.git
cd LLM_prompt_env
uv venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activateuv pip install -r requirements-dev.txt # includes runtime depsIf uv is not installed yet, install it first:
brew install uvcp .env.example .env
# open .env and fill in OPENAI_API_KEY=sk-...from config import Config
from client import build_client
from experiments.completion import TextCompletionExperiment
from utils.logging import setup_logging
setup_logging()
config = Config() # reads OPENAI_API_KEY from environment
client = build_client(config)
exp = TextCompletionExperiment(
config,
client,
system="You are a helpful assistant.",
)
result = exp.run("Summarize the history of transformers in one sentence.")
print(result)uv run pytestUse ACORNLaaJExperiment when you want to score ACORN explanations with the
same AMT-style rubric prompt used in prior experiments.
from client import build_client
from config import Config
from experiments.acorn_laaj import ACORNLaaJExperiment
config = Config()
client = build_client(config)
exp = ACORNLaaJExperiment(config, client)
sample = {
"question": "It got dark outside. What happened as a result?",
"choices": [
"Snowflakes began to fall from the sky.",
"The moon became visible in the sky.",
],
"explanation": "At night, darkness makes the moon visible.",
}
response = exp.run_sample(sample, temperature=0)
print(response) # raw model text
ratings = exp.parse_ratings(response)
print(ratings)
# {'supports': 1, 'overall': 4, 'well_written': 1, 'related': 1,
# 'factual': 1, 'new_info': 2, 'unnecessary_info': 0, 'contrastive': 0}- Keep your
.envout of version control — only.env.exampleis tracked. Configraises anEnvironmentErrorif a required variable is missing.- Project coding guidance lives in
.github/copilot-instructions.md. - Logging standard: capture run metadata, API-call settings, and outcomes to support reproducibility.
Optional environment variables for logging and API-call behavior:
OPENAI_MAX_RETRIES=2 # default: 2By default, experiment logs include structured JSON events with run IDs, model/settings metadata, token usage, durations, and error details.
scripts/run_acorn_batch.py processes a JSONL dataset end-to-end and writes
rated results to a JSONL file. If the output file already exists, the script
resumes by default (appends and skips already written rows).
python scripts/run_acorn_batch.py \
--input data/ACORN.jsonl \
--output results/acorn_rated.jsonl \
--model gpt-4o \
--temperature 0Key arguments:
| Flag | Default | Description |
|---|---|---|
--input |
(required) | Input JSONL path. Each line needs question, choices, explanation. |
--output |
(required) | Output JSONL path (parent dirs created automatically). |
--model |
OPENAI_DEFAULT_MODEL |
Model name — override here or via the env var. |
--temperature |
0.0 |
Sampling temperature. |
--limit |
(none) | Process only the first N samples (for dry-runs). |
--log-file |
(none) | Write logs to this file in addition to stdout — monitor with tail -f. |
--overwrite |
false |
Truncate --output and start from scratch instead of resuming. |
Each output line mirrors the input and adds batch_run_id, item_run_id,
model, temperature, response, ratings, error, and ts fields.
Results are flushed to disk after every item, so partial results survive a crash.
To restart from zero on an existing output path, pass --overwrite.
The script exits with code 1 if any item failed.
scripts/run_reasoning_survey_batch.py annotates paper PDFs and writes
paper-level JSONL rows (one main-claim row per paper),
including taxonomy labels, verdict-with-evidence, and supporting passages.
This survey catalogues how the field asks "Can LLMs reason?" under operationalized conditions. Each paper must be rephraseable as:
Assuming [performance_bar] [ideal_target] [goal/skill/mechanism] in a [testing_environment] with [relationship_with_language], can LLMs [operationalization]?
Papers that cannot be framed this way are marked exclude=true and excluded from operationalization labeling.
Examples:
- "Can GPT-4 solve grade-school math?" → "Assuming layman-level precise skill in a structured setting integrated with language, can GPT-4 solve grade-school arithmetic problems?"
- "Does Claude have an internal representation of other agents' beliefs?" → "Assuming expert-level precise mechanism in a naturalistic setting integrated with language, can Claude develop theory-of-mind representations in activation space?"
python scripts/run_reasoning_survey_batch.py \
--papers-dir "data/Reasoning x LLM survey papers" \
--output results/reasoning_survey_claims.jsonl \
--metadata-output results/reasoning_survey_run_metadata.json \
--model gpt-5.4 \
--temperature 0Key arguments:
| Flag | Default | Description |
|---|---|---|
--papers-dir |
data/Reasoning x LLM survey papers |
Directory of PDF papers to annotate. |
--output |
results/reasoning_survey_claims.jsonl |
Claim-level JSONL output path. |
--metadata-output |
results/reasoning_survey_run_metadata.json |
Run summary JSON output path. |
--model |
OPENAI_DEFAULT_MODEL |
Model name to use for annotation. |
--sections |
abstract introduction experiments evaluation conclusion |
Priority section keywords used for page ranking before fallback. |
--max-chars |
30000 |
Max extracted text characters sent per paper. |
--timeout-retries |
2 |
Retry count for timeout failures on a single paper. |
--timeout-backoff-seconds |
5.0 |
Base exponential backoff between timeout retries. |
--request-timeout-seconds |
(none) | Optional per-request timeout override for model calls. |
--limit |
(none) | Process only first N papers (pilot run). |
--overwrite |
false |
Restart from scratch instead of resume mode. |
The runner resumes by default: if output already exists, papers with matching
paper_id are skipped. For each successful paper, the script writes one or
more rows (typically one per paper under current policy) with paper_id, operationalization_id,
taxonomy labels, structured verdict, model settings, and run IDs.
Current extraction policy keeps only the paper's main claim:
- exactly one operationalization entry is retained per included paper
- additional significant settings should be summarized in
notes - if multiple entries are returned, the parser prefers
is_main_claim=true; otherwise it picks the most complete candidate
Each paper also gets a scope pre-tagging decision:
exclude=false: normal operationalization labeling is produced.exclude=true: the row includesexclusion_justification, andindividuation,operationalization, andverdictarenull.
Additional row-level fields:
core_question: one general framing question (guided by level_of_analysis: goal/skill/mechanism) describing how reasoning is evaluated and what counts as successreasoning_general_link: why this setup informs reasoning in generalreasoning_definition: explicit paper definition of "reasoning" when present, with supporting citation passages;nullif absentpaper_type:empirical | theoretical | survey | otherprescriptive_or_descriptive:descriptive | prescriptive | both
Within operationalization.level_of_analysis, the output also includes
focus_summary: a short non-quoted summary of which specific goal, skill, or
mechanism is under analysis.
Taxonomy dimensions and N/A handling:
Each operationalization is classified across five dimensions. All dimensions support
an optional n_a label for cases where that dimension does not clearly apply to
the operationalization (use sparingly).
Dimension-specific guidance:
scripts/run_reasoning_paper_crawler.py crawls recent papers about reasoning in LLMs,
deduplicates results across ArXiv and Semantic Scholar, applies explicit inclusion/exclusion
heuristics, and optionally downloads PDFs for included papers.
python scripts/run_reasoning_paper_crawler.py \
--lookback-years 3 \
--max-results-per-source 5000 \
--semantic-scholar-search-mode bulk \
--semantic-scholar-fields-of-study "Computer Science" "Philosophy" "Psychology" "Linguistics" \
--semantic-scholar-publication-types Review JournalArticle Conference \
--semantic-scholar-min-citation-count 20 \
--latest-only \
--download-pdfs \
--output results/reasoning_llm_papers_catalog.jsonl \
--metadata-output results/reasoning_llm_papers_run_metadata.jsonKey arguments:
| Flag | Default | Description |
|---|---|---|
--sources |
semantic_scholar |
Data sources to crawl. |
--search-query |
built-in query set | Repeatable custom query terms. |
--semantic-scholar-search-mode |
bulk |
bulk for large-scale fetch, relevance for ranked retrieval. |
--semantic-scholar-rps |
1.0 |
Semantic Scholar request rate cap (requests per second). |
--semantic-scholar-bulk-sort |
citationCount:desc |
Sort expression for bulk mode. |
--semantic-scholar-fields-of-study |
Computer Science Philosophy Psychology Linguistics |
Server-side fields-of-study filter in Semantic Scholar. |
--semantic-scholar-publication-types |
Review JournalArticle Conference |
Server-side publication type filter in Semantic Scholar. |
--semantic-scholar-min-citation-count |
0 |
Server-side citation floor in Semantic Scholar. |
--semantic-scholar-open-access-only |
false |
Restrict retrieval to open-access PDF papers in Semantic Scholar. |
--lookback-years |
3 |
Rolling time window ending at --as-of-year. |
--latest-only |
false |
Keep only latest candidate per inferred topic cluster. |
--include-rejected |
false |
Also write non-included candidates to output JSONL. |
--download-pdfs |
false |
Download available PDFs for included candidates. |
--download-dir |
data/reasoning_llm_papers |
PDF destination folder. |
--semantic-scholar-api-key |
(none) | Optional key for higher Semantic Scholar rate limits. |
--overwrite |
false |
Restart output instead of resume mode. |
Each output row includes:
- source metadata (
title,authors,venue,citation_count,reference_count) - inclusion decision (
include) plus detailed gate checks underinclusion - download status (
download_path,download_error)
The inclusion logic approximates your criteria with deterministic heuristics:
-
topic gate: must match LLM + reasoning signals
-
first group: latest-of-kind and survey/book or concrete analysis framing
-
second group: peer-reviewed or highly cited (100+) or substantial preprint proxy
-
exclusion gate: known out-of-scope topics (e.g., non-LLM, deep MI-focused, coursework/blog-like sources)
-
level_of_analysis: goal | skill | mechanism | n_a -
ideal_target: precise | human_like | n_a- Use
n_aonly if the operationalization does not align with either standard.
- Use
-
performance_bar: expert | layman | n_a- Classifies the expected reasoning complexity of the model (not the evaluator).
layman: everyday reasoning (e.g., GSM8K, basic math, common sense).expert: expert-level reasoning (e.g., Humanity's Last Exam, domain knowledge, multi-step deduction).- Use
n_aonly if reasoning complexity is fundamentally unclear.
-
testing_environment: structured | naturalistic | n_a -
relationship_with_language: integrated | independent | n_a
- Create
experiments/my_experiment.py. - Subclass
BaseExperimentand implement_build_promptand_parse_response. - Add a corresponding test file
tests/test_my_experiment.py.