Skip to content

a-brassard/LLM_prompt_env

Repository files navigation

LLM prompt env

A lightweight Python environment for NLP experiments using LLMs via API calls (primarily OpenAI). Each module has a single, focused purpose to keep experiments readable and easy to modify.


Project layout

├── config.py               # Configuration — reads settings from env vars
├── client.py               # OpenAI client factory
├── prompts.py              # Prompt template building and management
├── experiments/
│   ├── base.py             # Abstract base class for experiments
│   ├── completion.py       # General-purpose text completion experiment
│   ├── acorn_laaj.py       # ACORN explanation-quality LLM-as-a-Judge experiment
│   └── llm_reasoning_survey.py  # Claim-level paper annotation for reasoning surveys
├── utils/
│   └── logging.py          # Logging configuration helpers
├── tests/                  # Pytest unit tests
├── .env.example            # Environment variable template (never commit .env)
├── requirements.txt        # Runtime dependencies (uv-compatible)
└── requirements-dev.txt    # Development / test dependencies (uv-compatible)

Quick start

1 — Clone and create a virtual environment

git clone https://github.com/a-brassard/LLM_prompt_env.git
cd LLM_prompt_env
uv venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate

2 — Install dependencies

uv pip install -r requirements-dev.txt   # includes runtime deps

If uv is not installed yet, install it first:

brew install uv

3 — Configure your API key

cp .env.example .env
# open .env and fill in OPENAI_API_KEY=sk-...

4 — Run an experiment

from config import Config
from client import build_client
from experiments.completion import TextCompletionExperiment
from utils.logging import setup_logging

setup_logging()

config = Config()          # reads OPENAI_API_KEY from environment
client = build_client(config)

exp = TextCompletionExperiment(
    config,
    client,
    system="You are a helpful assistant.",
)

result = exp.run("Summarize the history of transformers in one sentence.")
print(result)

Running tests

uv run pytest

ACORN LLM-as-a-Judge experiment

Use ACORNLaaJExperiment when you want to score ACORN explanations with the same AMT-style rubric prompt used in prior experiments.

from client import build_client
from config import Config
from experiments.acorn_laaj import ACORNLaaJExperiment

config = Config()
client = build_client(config)
exp = ACORNLaaJExperiment(config, client)

sample = {
    "question": "It got dark outside. What happened as a result?",
    "choices": [
        "Snowflakes began to fall from the sky.",
        "The moon became visible in the sky.",
    ],
    "explanation": "At night, darkness makes the moon visible.",
}

response = exp.run_sample(sample, temperature=0)
print(response)  # raw model text

ratings = exp.parse_ratings(response)
print(ratings)
# {'supports': 1, 'overall': 4, 'well_written': 1, 'related': 1,
#  'factual': 1, 'new_info': 2, 'unnecessary_info': 0, 'contrastive': 0}

Notes

  • Keep your .env out of version control — only .env.example is tracked.
  • Config raises an EnvironmentError if a required variable is missing.
  • Project coding guidance lives in .github/copilot-instructions.md.
  • Logging standard: capture run metadata, API-call settings, and outcomes to support reproducibility.

Logging configuration

Optional environment variables for logging and API-call behavior:

OPENAI_MAX_RETRIES=2                  # default: 2

By default, experiment logs include structured JSON events with run IDs, model/settings metadata, token usage, durations, and error details.


Batch evaluation script

scripts/run_acorn_batch.py processes a JSONL dataset end-to-end and writes rated results to a JSONL file. If the output file already exists, the script resumes by default (appends and skips already written rows).

python scripts/run_acorn_batch.py \
    --input  data/ACORN.jsonl \
    --output results/acorn_rated.jsonl \
    --model  gpt-4o \
    --temperature 0

Key arguments:

Flag Default Description
--input (required) Input JSONL path. Each line needs question, choices, explanation.
--output (required) Output JSONL path (parent dirs created automatically).
--model OPENAI_DEFAULT_MODEL Model name — override here or via the env var.
--temperature 0.0 Sampling temperature.
--limit (none) Process only the first N samples (for dry-runs).
--log-file (none) Write logs to this file in addition to stdout — monitor with tail -f.
--overwrite false Truncate --output and start from scratch instead of resuming.

Each output line mirrors the input and adds batch_run_id, item_run_id, model, temperature, response, ratings, error, and ts fields. Results are flushed to disk after every item, so partial results survive a crash. To restart from zero on an existing output path, pass --overwrite. The script exits with code 1 if any item failed.


Reasoning survey batch annotation

scripts/run_reasoning_survey_batch.py annotates paper PDFs and writes paper-level JSONL rows (one main-claim row per paper), including taxonomy labels, verdict-with-evidence, and supporting passages.

Core scope

This survey catalogues how the field asks "Can LLMs reason?" under operationalized conditions. Each paper must be rephraseable as:

Assuming [performance_bar] [ideal_target] [goal/skill/mechanism] in a [testing_environment] with [relationship_with_language], can LLMs [operationalization]?

Papers that cannot be framed this way are marked exclude=true and excluded from operationalization labeling.

Examples:

  • "Can GPT-4 solve grade-school math?" → "Assuming layman-level precise skill in a structured setting integrated with language, can GPT-4 solve grade-school arithmetic problems?"
  • "Does Claude have an internal representation of other agents' beliefs?" → "Assuming expert-level precise mechanism in a naturalistic setting integrated with language, can Claude develop theory-of-mind representations in activation space?"
python scripts/run_reasoning_survey_batch.py \
    --papers-dir "data/Reasoning x LLM survey papers" \
    --output results/reasoning_survey_claims.jsonl \
    --metadata-output results/reasoning_survey_run_metadata.json \
    --model gpt-5.4 \
    --temperature 0

Key arguments:

Flag Default Description
--papers-dir data/Reasoning x LLM survey papers Directory of PDF papers to annotate.
--output results/reasoning_survey_claims.jsonl Claim-level JSONL output path.
--metadata-output results/reasoning_survey_run_metadata.json Run summary JSON output path.
--model OPENAI_DEFAULT_MODEL Model name to use for annotation.
--sections abstract introduction experiments evaluation conclusion Priority section keywords used for page ranking before fallback.
--max-chars 30000 Max extracted text characters sent per paper.
--timeout-retries 2 Retry count for timeout failures on a single paper.
--timeout-backoff-seconds 5.0 Base exponential backoff between timeout retries.
--request-timeout-seconds (none) Optional per-request timeout override for model calls.
--limit (none) Process only first N papers (pilot run).
--overwrite false Restart from scratch instead of resume mode.

The runner resumes by default: if output already exists, papers with matching paper_id are skipped. For each successful paper, the script writes one or more rows (typically one per paper under current policy) with paper_id, operationalization_id, taxonomy labels, structured verdict, model settings, and run IDs.

Current extraction policy keeps only the paper's main claim:

  • exactly one operationalization entry is retained per included paper
  • additional significant settings should be summarized in notes
  • if multiple entries are returned, the parser prefers is_main_claim=true; otherwise it picks the most complete candidate

Each paper also gets a scope pre-tagging decision:

  • exclude=false: normal operationalization labeling is produced.
  • exclude=true: the row includes exclusion_justification, and individuation, operationalization, and verdict are null.

Additional row-level fields:

  • core_question: one general framing question (guided by level_of_analysis: goal/skill/mechanism) describing how reasoning is evaluated and what counts as success
  • reasoning_general_link: why this setup informs reasoning in general
  • reasoning_definition: explicit paper definition of "reasoning" when present, with supporting citation passages; null if absent
  • paper_type: empirical | theoretical | survey | other
  • prescriptive_or_descriptive: descriptive | prescriptive | both

Within operationalization.level_of_analysis, the output also includes focus_summary: a short non-quoted summary of which specific goal, skill, or mechanism is under analysis.

Taxonomy dimensions and N/A handling:

Each operationalization is classified across five dimensions. All dimensions support an optional n_a label for cases where that dimension does not clearly apply to the operationalization (use sparingly).

Dimension-specific guidance:


Reasoning paper crawler (ArXiv + Semantic Scholar)

scripts/run_reasoning_paper_crawler.py crawls recent papers about reasoning in LLMs, deduplicates results across ArXiv and Semantic Scholar, applies explicit inclusion/exclusion heuristics, and optionally downloads PDFs for included papers.

python scripts/run_reasoning_paper_crawler.py \
  --lookback-years 3 \
  --max-results-per-source 5000 \
  --semantic-scholar-search-mode bulk \
  --semantic-scholar-fields-of-study "Computer Science" "Philosophy" "Psychology" "Linguistics" \
  --semantic-scholar-publication-types Review JournalArticle Conference \
  --semantic-scholar-min-citation-count 20 \
  --latest-only \
  --download-pdfs \
  --output results/reasoning_llm_papers_catalog.jsonl \
  --metadata-output results/reasoning_llm_papers_run_metadata.json

Key arguments:

Flag Default Description
--sources semantic_scholar Data sources to crawl.
--search-query built-in query set Repeatable custom query terms.
--semantic-scholar-search-mode bulk bulk for large-scale fetch, relevance for ranked retrieval.
--semantic-scholar-rps 1.0 Semantic Scholar request rate cap (requests per second).
--semantic-scholar-bulk-sort citationCount:desc Sort expression for bulk mode.
--semantic-scholar-fields-of-study Computer Science Philosophy Psychology Linguistics Server-side fields-of-study filter in Semantic Scholar.
--semantic-scholar-publication-types Review JournalArticle Conference Server-side publication type filter in Semantic Scholar.
--semantic-scholar-min-citation-count 0 Server-side citation floor in Semantic Scholar.
--semantic-scholar-open-access-only false Restrict retrieval to open-access PDF papers in Semantic Scholar.
--lookback-years 3 Rolling time window ending at --as-of-year.
--latest-only false Keep only latest candidate per inferred topic cluster.
--include-rejected false Also write non-included candidates to output JSONL.
--download-pdfs false Download available PDFs for included candidates.
--download-dir data/reasoning_llm_papers PDF destination folder.
--semantic-scholar-api-key (none) Optional key for higher Semantic Scholar rate limits.
--overwrite false Restart output instead of resume mode.

Each output row includes:

  • source metadata (title, authors, venue, citation_count, reference_count)
  • inclusion decision (include) plus detailed gate checks under inclusion
  • download status (download_path, download_error)

The inclusion logic approximates your criteria with deterministic heuristics:

  • topic gate: must match LLM + reasoning signals

  • first group: latest-of-kind and survey/book or concrete analysis framing

  • second group: peer-reviewed or highly cited (100+) or substantial preprint proxy

  • exclusion gate: known out-of-scope topics (e.g., non-LLM, deep MI-focused, coursework/blog-like sources)

  • level_of_analysis: goal | skill | mechanism | n_a

  • ideal_target: precise | human_like | n_a

    • Use n_a only if the operationalization does not align with either standard.
  • performance_bar: expert | layman | n_a

    • Classifies the expected reasoning complexity of the model (not the evaluator).
    • layman: everyday reasoning (e.g., GSM8K, basic math, common sense).
    • expert: expert-level reasoning (e.g., Humanity's Last Exam, domain knowledge, multi-step deduction).
    • Use n_a only if reasoning complexity is fundamentally unclear.
  • testing_environment: structured | naturalistic | n_a

  • relationship_with_language: integrated | independent | n_a


Adding a new experiment

  1. Create experiments/my_experiment.py.
  2. Subclass BaseExperiment and implement _build_prompt and _parse_response.
  3. Add a corresponding test file tests/test_my_experiment.py.

About

A general-purpose environment for API-based LLM prompting experiments

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages