-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Why
lm_eval/tasks/ is by far the largest code surface in this repo — hundreds of YAML-configured evaluation tasks across dozens of subdirectories. Most external contributions are new task additions. Path-scoped Copilot instructions for this directory will guide agents (and code review) with task-specific schema rules, naming conventions, and testing requirements without bloating the repo-wide instructions.
Scope / Proposed changes
- New file:
.github/instructions/tasks.instructions.md(~100 lines) - New directory:
.github/instructions/(if it doesn't exist)
Proposed contents
---
applyTo: "lm_eval/tasks/**"
---
# Task YAML Authoring Instructions
These instructions apply when creating or modifying files in `lm_eval/tasks/`.
## Canonical Reference
For the full task authoring guide, see `docs/new_task_guide.md`.
For the full task configuration reference, see `docs/task_guide.md`.
## Directory Structure
Each task lives in its own subdirectory under `lm_eval/tasks/`:
lm_eval/tasks/<dataset_name>/
├── <task_name>.yaml # Task configuration (required)
├── _default_template.yaml # Shared defaults via !include (optional)
├── utils.py # Custom functions for process_docs, doc_to_text, etc. (optional)
└── README.md # Task documentation with paper citation (required for new tasks)
## Required YAML Fields
Every task YAML must include:
| Field | Description |
|-------|-------------|
| `task` | Unique task name (snake_case) |
| `dataset_path` | HuggingFace dataset name or path |
| `output_type` | One of: `multiple_choice`, `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
| `doc_to_text` | Jinja2 template or function for model input |
| `doc_to_target` | Jinja2 template or function for expected output |
## Common Optional Fields
| Field | Default | Notes |
|-------|---------|-------|
| `dataset_name` | `null` | HF dataset config name |
| `test_split` | — | Split to evaluate on |
| `validation_split` | — | Split for validation |
| `fewshot_split` | — | Split for few-shot examples |
| `num_fewshot` | `0` | Number of few-shot examples |
| `metric_list` | — | List of metrics to compute |
| `filter_list` | — | Output post-processing filters |
## Naming Conventions
- Task names: `snake_case`, matching the dataset or paper name
- Subdirectory name: matches the dataset/benchmark name
- YAML filenames: match the task name (e.g., `gsm8k.yaml` for task `gsm8k`)
- Group YAML: use `_<group_name>.yaml` prefix for group definitions
## Jinja2 Templates
- Use `{{variable}}` for dataset column references
- For multiple choice: `doc_to_choice` must return a list of answer strings
- For generative tasks: set `generation_kwargs` (temperature, max_gen_toks, until)
- Multiline templates: use YAML literal block scalar `|` or folded block scalar `>`
## Testing
- Run `pytest tests/test_tasks.py -x -s -vv` to validate all task configs load correctly
- The CI workflow `new_tasks.yml` automatically runs task tests when `lm_eval/tasks/**` changes
- If possible, validate with a small model: `lm-eval run --model hf --model_args pretrained=EleutherAI/pythia-160m --tasks <your_task> --limit 10`
## Process Docs Function
If using `process_docs` to preprocess data, define the function in a `utils.py` file in the same directory:
```python
def process_docs(dataset):
"""Preprocess the dataset before evaluation."""
# Return modified dataset
return dataset.map(transform_fn)
Reference it in the YAML as:
process_docs: !function utils.process_docsCommon Patterns
- Shared defaults: Use
includeto inherit from a parent YAML:include: _default_template.yaml - Task groups: Create a group YAML that lists subtasks under
taskas a list - Custom metrics: Define metric functions in
utils.pyand reference via!function - Answer extraction: Use filters (regex, take_first) to extract answers from model output
## Labels to apply
- **Base**: `agent-readiness`
- **Priority**: `priority:high`
- **Area**: `documentation`
## Depends on
- #3610 (label creation)
- #3611 (.github/copilot-instructions.md must exist; this file supplements it)
## Related existing issues
None directly, though many open issues involve task configuration problems (e.g., #2552, #2479).
## Acceptance criteria
- [ ] `.github/instructions/tasks.instructions.md` exists and is ≤120 lines
- [ ] YAML frontmatter has `applyTo: "lm_eval/tasks/**"`
- [ ] No `excludeAgent` field (both coding agent and code review should see these instructions)
- [ ] All referenced paths and patterns exist in the repo
- [ ] Guidance links to `docs/new_task_guide.md` and `docs/task_guide.md` rather than duplicating them
- [ ] File passes pymarkdown lint (excluding YAML frontmatter)
## Avoid drift/duplication notes
- This file covers **task-specific** guidance only. General repo conventions are in `.github/copilot-instructions.md`.
- Canonical task authoring procedures live in `docs/new_task_guide.md` — this file summarizes key rules and links there.
- If task YAML schema changes, update both this file and `docs/task_guide.md`.
## References
- [GitHub Docs: Path-specific custom instructions](https://docs.github.com/en/copilot/customizing-copilot/adding-repository-custom-instructions-for-github-copilot#creating-path-specific-custom-instructions)
- Example syntax for `applyTo` frontmatter:
```yaml
---
applyTo: "lm_eval/tasks/**"
---