GSoC 2026 Draft Proposal: Long-Context & Complex Reasoning Coding Evaluation Dataset #24143

kambleakash0 · 2026-03-29T09:54:25Z

kambleakash0
Mar 29, 2026

Contact & Identity

Full Name: Akash Ankush Kamble
Email: kambleakash0@gmail.com
GitHub: https://github.com/kambleakash0
LinkedIn: https://www.linkedin.com/in/kambleakash
University: SUNY Buffalo, MS in Data Science (graduating May 2026)
Timezone: EDT (UTC-4)

Title

Long-Context & Complex Reasoning Coding Evaluation Dataset for Gemini CLI

Summary

Current AI coding agent benchmarks (SWE-bench Pro, TerminalBench) are saturating and fail to capture real-world production complexity. This project builds an evaluation suite of 30-50 large-scale, multi-language open-source repositories with authentic engineering problems that demand multi-step reasoning across expansive code contexts. The dataset will serve as a proving ground for Gemini CLI's ability to navigate, understand, and resolve architectural dependencies spanning thousands of lines of code.

Community Impact

This benchmark directly benefits:

Gemini CLI team: Precise measurement of agent performance on enterprise-grade tasks, giving the team a concrete signal for targeted improvements
Broader AI coding agent community: A public, reproducible benchmark that goes beyond SWE-bench's relatively isolated tasks
Open-source maintainers: Well-documented engineering problems that could double as onboarding challenges for new contributors

Background

Relevant Experience

AI Agent Development & Evaluation:

Worked at Turing Inc. on OpenAI's RLHF training for ChatGPT, authoring 800+ programming conversations across diverse languages and complexity levels, directly relevant to understanding what makes coding tasks hard for LLMs and how to evaluate quality (maintained 6.7/7 score, top 95%)
Built agent-skills for AI coding assistants (Claude Code, Codex, Gemini CLI): https://github.com/kambleakash0/agent-skills
Built MCP servers exposing tools for LLM agents, which gave me hands-on experience with how agents interact with complex codebases

Multi-Language Proficiency:

Python (primary, expert): 9 repos, data science, ML, backend APIs
JavaScript/Node.js (proficient): 14 repos, full-stack web apps, browser extensions
TypeScript (working knowledge): Next.js projects
C/C++ (intermediate): block cipher implementation, interview prep, academic projects
Familiar with Java, SQL, R

Large Codebase Navigation:

Contributed to LlamaIndex (48k stars, merged PR) and LangChain (131k stars, PR) — both massive, multi-module codebases requiring cross-component understanding
At HCLTech/Rubrik: navigated and automated testing for the CDM platform, a large-scale distributed system
At Argonaut AI: worked across complex ETL pipeline codebases with multiple services

Git Expertise:

11 years on GitHub (since 2015), 48 public repos, 343 recent contributions
Deep practical experience with Git workflows across large projects

Understanding of Existing Benchmarks

SWE-bench Pro: Real GitHub issues from popular Python repos, but tasks are often scoped to single files with clear error signals
TerminalBench: Tests terminal/CLI interaction patterns but lacks architectural complexity
Gap: Neither benchmark tests cross-component reasoning, multi-file architectural changes, or enterprise-scale codebases with deep dependency chains, which is where real production work actually lives

Work Breakdown

Phase 1: Repository Curation (Weeks 1-3, ~50 hours)

Goal: Identify and onboard 30-50 large-scale, actively maintained OSS repositories.

Selection Criteria:

Size: 100k+ lines of code, 50+ files in core modules
Language diversity: Python, TypeScript/JavaScript, Go, Java (at least 3 languages represented)
Architectural complexity: multi-service, monorepo, or deeply layered architectures
Active maintenance: commits within the last 3 months, responsive issue tracker
Enterprise frameworks: Django, Spring Boot, Kubernetes controllers, React at scale, etc.

Deliverable: Repository catalog with metadata (language, size, architecture type, complexity score, test infrastructure quality).

Example targets: Large Django/FastAPI projects, Kubernetes operators, React enterprise apps, Go microservices, Rust systems tools.

Phase 2: Task Formulation (Weeks 4-7, ~60 hours)

Goal: Extract 2-3 complex engineering tasks per repository (60-150 total tasks).

Task Categories:

Architectural bug fixes: Bugs whose root cause spans 3+ files across different modules, requiring understanding of data flow and side effects
Cross-component feature integration: Adding a feature that touches the API layer, business logic, data model, and tests simultaneously
Refactoring with behavioral preservation: Restructuring code while maintaining exact behavior, verified by existing test suites
Performance optimization: Identifying and fixing N+1 queries, memory leaks, or algorithmic inefficiencies that require codebase-wide understanding

Task Extraction Process:

Mine closed PRs and issues for multi-file changes with detailed descriptions
Reconstruct the "before" state (checkout parent commit)
Define the task as the original issue/PR description
Validate that the original solution is correct and that the tests pass
Categorize difficulty (context window required, files touched, reasoning steps)

Deliverable: Task dataset with standardized fields (repo, commit_before, commit_after, task_description, files_changed, difficulty_score, category).

Phase 3: Schema Design (Week 8, ~15 hours)

Goal: Develop a standardized, reproducible dataset schema.

{
  "id": "django-rest-framework-042",
  "repo": "encode/django-rest-framework",
  "language": "python",
  "commit_before": "abc123",
  "commit_after": "def456",
  "task_description": "...",
  "category": "architectural_bug_fix",
  "files_changed": ["rest_framework/serializers.py", "rest_framework/fields.py", "tests/test_serializers.py"],
  "context_window_estimate": 45000,
  "difficulty": "hard",
  "reasoning_steps": 7,
  "test_command": "pytest tests/test_serializers.py",
  "validation": {
    "tests_pass_before": false,
    "tests_pass_after": true
  }
}

Deliverable: JSON Schema specification with validation tooling.

Phase 4: Pipeline Integration (Weeks 9-10, ~25 hours)

Goal: Integrate the dataset into Gemini CLI's existing evaluation infrastructure.

Study Gemini CLI's current eval framework
Build automation scripts: clone repos, checkout commits, run agent, validate results
Docker-based isolated execution environments per task
Parallel evaluation support for running multiple tasks concurrently

Deliverable: Evaluation runner integrated with Gemini CLI's testing pipeline.

Phase 5: Baseline Analysis (Weeks 11-12, ~25 hours)

Goal: Comprehensive performance report.

Run Gemini CLI against the full dataset
Measure: success rate by category, success rate by language, context utilization, failure mode classification
Failure modes: wrong file edited, incomplete change, hallucinated API, test regression, timeout
Compare with SWE-bench Pro scores where overlapping repos exist

Deliverable: Baseline report with visualizations, failure taxonomy, and recommendations for Gemini CLI improvements.

Time Availability

Weekly hours: 30-35 hours
Conflicts: None during the coding period (graduating May 2026)
Communication: Will engage on GitHub issue Long-Context & Complex Reasoning Coding Evaluation Dataset #23316 and any team channels

Related Work

Familiar with SWE-bench (Princeton), TerminalBench, HumanEval, MBPP
My RLHF work at Turing/OpenAI involved evaluating coding conversation quality at scale — giving me direct experience with evaluation methodology and quality rubrics
The agent-skills repo demonstrates how AI coding agents interact with real codebases in practice

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoC 2026 Draft Proposal: Long-Context & Complex Reasoning Coding Evaluation Dataset #24143

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

GSoC 2026 Draft Proposal: Long-Context & Complex Reasoning Coding Evaluation Dataset #24143

Uh oh!

kambleakash0 Mar 29, 2026

Contact & Identity

Title

Summary

Community Impact

Background

Relevant Experience

Understanding of Existing Benchmarks

Work Breakdown

Phase 1: Repository Curation (Weeks 1-3, ~50 hours)

Phase 2: Task Formulation (Weeks 4-7, ~60 hours)

Phase 3: Schema Design (Week 8, ~15 hours)

Phase 4: Pipeline Integration (Weeks 9-10, ~25 hours)

Phase 5: Baseline Analysis (Weeks 11-12, ~25 hours)

Time Availability

Related Work

Replies: 0 comments

kambleakash0
Mar 29, 2026