SkillLearnBench

SkillLearnBench: A benchmark for continual learning methods that generate agent skills for real-world tasks.
20 skill-dependent tasks · 15 sub-domains · 100 verified instances

Installation

pip install anthropic openai rich tomli dataclaw json-repair   # tomli is only required on Python < 3.11
dataclaw --help                                                # verify that the dataclaw CLI is on your PATH
cp .env.example .env                                           # fill in your API keys (some tasks also require extra variables such as GH_TOKEN — see .env.example)

Docker is a hard requirement, since every agent trial runs inside a container. Install it from docs.docker.com/get-docker.

Quick Start of Evaluation

# Always dry-run first — preview what will run, no execution
python evaluate_skills.py court-form-filling github-repo-analytics --dry-run

# Evaluate with human-authored skills (default, `skills/human_authored`) on two tasks (court-form-filling, github-repo-analytics)
python evaluate_skills.py court-form-filling github-repo-analytics

# The committed `skills/<method>/` tree holds pre-generated skills for all baselines,
# e.g., compare method: one-shot (claude-sonnet-4-6) vs. human-authored vs. no-skill baseline.
# Warning: This command will evaluate these 3 methods across all 20 tasks in SkillLearnBench and requires some time to complete.
python evaluate_skills.py --skill-path skills/b1-one-shot-claude-sonnet-4-6 skills/human_authored none

When running evaluate_skills.py, the code will load the correspondent skills from skill-path, and evaluate them in the tasks of SkillLearnBench. Evaluation results are written to output/evaluation_reports/<method>/<task>/, a report.csv record the average results of each metrics.

⚠️ Keep --max-workers ≤ 50 to avoid API rate limits.

(1) Tasks

SkillLearnBench contains 20 tasks across 6 real-world categories, with 100 instances in total.

Category	Task	Instances
Software Engineering	python-scala-translation	2
	nlp-paper-reproduction	3
	dependency-vulnerability-check	5
	github-repo-analytics	5
	fix-security-bug	3
Information Retrieval	enterprise-information-search	6
	travel-planning	5
Productivity Tools	schedule-planning	5
	offer-letter-generator	6
	court-form-filling	6
Data & Analytics	earthquake-plate-calculation	6
	financial-analysis	6
	weighted-gdp-calculation	6
	dbscan-parameter-tuning	5
	stock-data-visualization	5
Content & Creative	anthropic-poster-design	5
	chinese-poem-generator	5
	video-object-counting	5
Utilities & Other	organize-messy-files	6
	temperature-simulation	5

(2) Evaluation Dimensions

Dimension	Metrics	What it measures
Task Success	Pass rate	Binary verifier outcome per trial
Skill Quality	Functional coverage, executability, safety	How well a skill describes the task and avoids unsafe instructions
Trajectory Quality	Key-point recall, execution order, completeness	Whether the solving agent's action trace matches the expected solution path

The solving agent is powered by Claude Sonnet 4.6, and the LLM-as-judge uses GPT-5-mini.

Baselines

We implement four continual learning methods based on skill generation.

ID	Name	Description
b1	One-Shot	The agent generates a skill set in a single pass.
b2	Self-Feedback	The agent first generates an initial skill set and uses it to attempt the task. After execution, it reviews the trajectory, identifies issues, and refines the skills. This cycle repeats K=2 times (i.e., K−1 rounds of feedback) without any external supervision.
b3	Teacher-Feedback	After each failed attempt, the agent asks the teacher questions, and the teacher provides directional guidance without revealing the ground-truth skill. The agent then updates its skills and retries the task. The skill set is regenerated up to K=3 times, with up to K−1 QA rounds triggered by failed attempts. This setting simulates a domain expert helping the agent improve.
b4	Skill Creator	Claude's official skill-creator. The agent follows a structured multi-stage process: analyzing the task intent, investigating edge cases and dependencies, writing a skill specification, and validating it with automated checks.

The agent can be powered by any LLM. In our codebase we provide results for Claude and Gemini (claude-haiku-4-5, claude-sonnet-4-6, and claude-opus-4-6; gemini-3.1-flash-lite-preview, gemini-3-flash-preview, and gemini-3.1-pro-preview).

Run a baseline to generate skills:

python generate_skills.py --tasks court-form-filling --methods b1-one-shot --models claude-sonnet-4-6

(1) Evaluate other LLMs with the existing four baselines

You can plug additional LLMs into the four baselines above. See BASELINES.md for more details.

(2) Evaluate other continual learning methods

To evaluate your own continual learning method, use the tasks folder, which provides the verifier (in each tests subfolder) and the instance data for every task. Feed the instance-1 information together WITHOUT its verifier into your method to generate skills. The generated skills should follow the same layout as any subfolder in skills. Then evaluate your method with python evaluate_skills.py --skill-path your_skill_path.

See CONTRIBUTING.md for the full set of options and instructions on adding new methods.

Citation

@article{zhong2026skilllearnbench,
  title={SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks},
  author={Zhong, Shanshan and Lu, Yi and Ning, Jingjie and Wan, Yibing and Feng, Lihan and Ao, Yuyi and Ribeiro, Leonardo F. R. and Dreyer, Markus and Ammirati, Sean and Xiong, Chenyan},
  journal={arXiv preprint arXiv:2604.20087},
  year={2026},
  url={https://arxiv.org/abs/2604.20087}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SkillLearnBench

Installation

Quick Start of Evaluation

(1) Tasks

(2) Evaluation Dimensions

Baselines

(1) Evaluate other LLMs with the existing four baselines

(2) Evaluate other continual learning methods

Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
agents		agents
baselines		baselines
core		core
eval_keypoints		eval_keypoints
evaluation		evaluation
skills		skills
tasks		tasks
.env.example		.env.example
.gitignore		.gitignore
BASELINES.md		BASELINES.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SkillLearnBench_logo.png		SkillLearnBench_logo.png
evaluate_skills.py		evaluate_skills.py
generate_skills.py		generate_skills.py

Folders and files

Latest commit

History

Repository files navigation

SkillLearnBench

Installation

Quick Start of Evaluation

(1) Tasks

(2) Evaluation Dimensions

Baselines

(1) Evaluate other LLMs with the existing four baselines

(2) Evaluate other continual learning methods

Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages