Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 37 additions & 1 deletion .github/workflows/ci-monitor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@ name: CI Monitor

on:
schedule:
- cron: '0 */12 * * *'
- cron: '0 */12 * * *' # Every 12 hours for main analysis
- cron: '0 6 * * *' # Daily at 6:00 AM UTC for balance analysis
workflow_dispatch:
inputs:
limit:
Expand Down Expand Up @@ -63,3 +64,38 @@ jobs:
scripts/ci_monitor/ci_analysis_*.json
scripts/ci_monitor/performance_tables_*
retention-days: 30

ci-monitor-balance:
if: github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request'
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.9'

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install requests

- name: Run Test Balance Analysis
env:
GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI }}
PYTHONUNBUFFERED: 1
PYTHONIOENCODING: utf-8
run: |
cd scripts/ci_monitor
python ci_analyzer_balance.py --token $GITHUB_TOKEN --limit ${{ inputs.limit || '1000' }} --output test_balance_report_$(date +%Y%m%d_%H%M%S).json

- name: Upload Balance Analysis Results
uses: actions/upload-artifact@v4
with:
name: test-balance-results-${{ github.run_number }}
path: |
scripts/ci_monitor/test_balance_report_*.json
scripts/ci_monitor/test_balance_report_*.csv
retention-days: 30
46 changes: 45 additions & 1 deletion scripts/ci_monitor/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,11 @@

> **Note**: This README.md is primarily generated by Claude 4 with some manual adjustments.

A comprehensive toolkit to analyze CI failures and performance trends for the SGLang project. This toolkit includes two main tools:
A comprehensive toolkit to analyze CI failures and performance trends for the SGLang project. This toolkit includes three main tools:

1. **CI Analyzer** (`ci_analyzer.py`): Analyzes CI failures and provides detailed failure pattern analysis
2. **Performance Analyzer** (`ci_analyzer_perf.py`): Tracks performance metrics over time and generates trend charts
3. **Test Balance Analyzer** (`ci_analyzer_balance.py`): Analyzes test time gaps between elapsed and estimated times to help balance CI

## Features

Expand All @@ -26,6 +27,15 @@ A comprehensive toolkit to analyze CI failures and performance trends for the SG
- **Comprehensive Metrics**: Track output throughput, E2E latency, TTFT, accept length, and more
- **Time-Based Sampling**: Intelligent sampling strategy to cover extended time periods (up to 30 days) with limited API calls

### Test Balance Analyzer (`ci_analyzer_balance.py`)
- **Time Gap Analysis**: Identify GPU tests with large gaps between elapsed and estimated times
- **CI Balancing**: Help optimize CI by identifying tests that need time adjustments
- **Gap Tracking**: Track maximum time gaps for each test across multiple CI runs
- **PR Test Focus**: Only analyzes GPU jobs from pr-test.yml workflow (excludes AMD and other workflows)
- **Ranking System**: Sort tests by time gap severity to prioritize adjustments
- **CSV Export**: Export analysis results in CSV format for easy review
- **GitHub Integration**: Generate GitHub Actions summaries with recommendations

### Common Features
- **Automated Monitoring**: GitHub Actions workflow for continuous CI and performance monitoring

Expand All @@ -45,6 +55,13 @@ Additional dependencies required for chart generation:
pip install requests matplotlib pandas
```

### For Test Balance Analyzer
No additional dependencies required beyond Python standard library and `requests`:

```bash
pip install requests
```

## Usage

### CI Analyzer
Expand Down Expand Up @@ -97,6 +114,25 @@ python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --start-date $(date -d '7 d
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 1000 --upload-to-github
```

### Test Balance Analyzer

#### Basic Usage

```bash
# Analyze PR Test GPU job time gaps from recent CI runs
python ci_analyzer_balance.py --token YOUR_GITHUB_TOKEN
```

#### Advanced Usage

```bash
# Analyze last 1000 PR Test GPU CI runs for comprehensive test balance analysis
python ci_analyzer_balance.py --token YOUR_GITHUB_TOKEN --limit 1000

# Custom output file
python ci_analyzer_balance.py --token YOUR_GITHUB_TOKEN --limit 500 --output my_balance_analysis.json
```

**Important**: Make sure your GitHub token has `repo` and `workflow` permissions, otherwise you'll get 404 errors.

## Data Collection Strategies
Expand Down Expand Up @@ -183,6 +219,14 @@ Use `--start-date` and `--end-date` parameters to get **ALL** CI runs within a s
| `--end-date` | None | End date for date range query (YYYY-MM-DD format) |
| `--upload-to-github` | False | Upload results to sglang-bot/sglang-ci-data repository |

### Test Balance Analyzer Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `--token` | Required | GitHub Personal Access Token |
| `--limit` | 1000 | Number of CI runs to analyze |
| `--output` | test_balance_report.json | Output JSON file for detailed analysis data |

## Getting GitHub Token

1. Go to [GitHub Settings > Personal Access Tokens](https://github.com/settings/tokens)
Expand Down
Loading
Loading