A YouTube transcription plugin for Agent Zero that extracts audio to text, detects visual references (charts, graphs, slides), and generates AI-powered summaries and detailed timestamped notes. Supports single videos, playlists, and entire channels with adaptive rate-limiting to stay under YouTube's throttling thresholds.
- Transcribe single videos, entire playlists, or entire channels to text
- Adaptive rate-limiting for channel jobs — classifies channels as small/medium/large and applies per-video delays + inter-batch cooldowns so YouTube doesn't flag the traffic
- Resumable channel jobs — progress is persisted to a state file; re-run the tool to pick up where it left off, or use
action=retryto re-attempt failed videos - Visual context extraction -- detects when speakers reference charts, graphs, or slides and extracts/analyzes video frames
- AI-powered summaries with main topics, key points, data/evidence, and conclusions
- Detailed timestamped notes broken into sections with visual content annotations
- Automatic memory storage -- transcripts, summaries, and notes are saved to Agent Zero's memory system
- Full file export -- all outputs saved as markdown files for reference
Docker (recommended):
# Copy plugin into the container
docker cp a0-youtube-transcribe/ <container_name>:/a0/usr/plugins/youtube_transcribe
# Create symlink for Python imports
docker exec <container_name> ln -sf /a0/usr/plugins/youtube_transcribe /a0/plugins/youtube_transcribe
# Install Python dependencies
docker exec <container_name> /opt/venv-a0/bin/python /a0/usr/plugins/youtube_transcribe/initialize.py
# Copy skills
docker exec <container_name> bash -c 'cp -r /a0/usr/plugins/youtube_transcribe/skills/* /a0/usr/skills/'
# Enable the plugin
docker exec <container_name> touch /a0/usr/plugins/youtube_transcribe/.toggle-1
# Restart to load
docker exec <container_name> supervisorctl restart run_uiUsing the install script (inside the container):
# Copy the plugin source into the container first
docker cp a0-youtube-transcribe/ <container_name>:/tmp/a0-youtube-transcribe
# Run the installer
docker exec <container_name> bash /tmp/a0-youtube-transcribe/install.shThe install script auto-detects the Agent Zero root (/a0/ or /git/agent-zero/), copies files, creates the symlink, installs dependencies, and enables the plugin.
Open Agent Zero's web interface, navigate to the YouTube Transcriber plugin dashboard, and click Check Status. All four dependencies should show as available:
- yt-dlp -- YouTube download and metadata
- youtube-transcript-api -- Caption retrieval
- ffmpeg -- Frame extraction
- Pillow -- Image handling
Open Agent Zero's chat and try:
| What you want | What to say |
|---|---|
| Transcribe a video | "Transcribe this YouTube video: https://youtube.com/watch?v=..." |
| Summarize a video | "Summarize this YouTube video: https://youtube.com/watch?v=..." |
| Detailed study notes | "Create detailed notes from this video: https://youtube.com/watch?v=..." |
| Transcribe a playlist | "Transcribe this playlist: https://youtube.com/playlist?list=..." |
| Transcribe a channel | "Transcribe this YouTube channel: https://youtube.com/@ChannelName" |
| Check channel progress | "What's the status of the YouTube channel transcription job for @ChannelName?" |
| Retry failed channel videos | "Retry the failed videos from the @ChannelName channel job" |
| Transcribe with visuals | "Transcribe this video with visual analysis: https://youtube.com/watch?v=..." |
| Tool | Description |
|---|---|
youtube_transcribe |
Transcribe a video or playlist with optional visual context extraction |
youtube_channel |
Transcribe all videos from a YouTube channel with adaptive rate-limiting and resumable state |
youtube_summary |
Generate an AI-powered structured summary from a video transcript |
youtube_notes |
Generate detailed, timestamped study notes broken into sections |
Extracts text from YouTube captions (or auto-generated subtitles), detects references to visual content, extracts video frames at those timestamps, and uses the LLM to describe charts, graphs, and slides.
Arguments:
url(required) -- YouTube video or playlist URLaction--transcribe(default) orplaylistextract_visuals--true/false(default: from config)language-- Language hint (e.g.,en,es)save_to_memory--true/false(default:true)max_videos-- Max videos from a playlist (default: 10)
Transcribes every video on a YouTube channel with a batching strategy that adapts to channel size so YouTube's throttling thresholds don't trip. Progress is persisted to a state file, so repeated invocations resume where the last run left off; failed videos (e.g., captions disabled) are tracked separately and re-attemptable.
Sizing strategy:
- Small channels (≤ 50 videos): transcribe all in one pass with a short per-video delay
- Medium channels (51-200 videos): batch into groups of 15 with a cooldown between batches
- Large channels (> 200 videos): smaller batches of 10 with longer per-video delays and multi-minute cooldowns
Arguments:
url(required) -- Channel URL (@handle,/channel/UCxxx,/c/Name, or/user/Name)action--transcribe(default),status(check progress), orretry(re-attempt failed videos)language-- Language hint (e.g.,en,es)save_to_memory--true/false(default:true)max_videos-- Limit how many videos to process in this run (useful for testing or pacing large channels)
Generates a structured summary with overview, main topics, key points, data/evidence, and conclusions. Long transcripts are automatically chunked and synthesized.
Arguments:
url-- YouTube video URLtranscript-- Pre-extracted transcript text (alternative to url)language-- Language hintsave_to_memory--true/false(default:true)
Creates detailed study notes organized by time sections. Each section includes key points, visual content references, detailed explanations, and notable quotes.
Arguments:
url-- YouTube video URLtranscript-- Pre-extracted transcript text (alternative to url)section_minutes-- Section length in minutes (default: 5)language-- Language hintsave_to_memory--true/false(default:true)
- YouTube captions (via
youtube-transcript-api) -- fastest, uses existing manual or auto-generated captions - yt-dlp subtitle download (fallback) -- downloads auto-generated subtitles in json3 format
- Scans transcript for visual keywords ("this chart", "as you can see", "this slide", etc.)
- Extracts video frames at detected timestamps using ffmpeg
- Sends frames to the LLM for detailed visual analysis
- If no explicit references found, periodic frame sampling provides general context
- Resolves the channel URL (
@handle,/channel/UCxxx,/c/Name,/user/Name) and enumerates video IDs via yt-dlp - Classifies the channel as small / medium / large and picks a per-video delay + batch size + inter-batch cooldown
- Persists progress to a per-channel state file so repeated invocations resume automatically
- Tracks failed videos (e.g., captions disabled) separately — re-attemptable with
action=retry action=statusreturns completion stats without doing any more work
- Markdown files -- Full transcripts, summaries, and notes saved to the plugin's
data/directory - Agent Zero memory -- Content indexed in the vector database for future retrieval
- Response preview -- Only a truncated preview is returned to the agent to avoid context window overflow
| Document | Description |
|---|---|
| docs/README.md | Full reference -- all tools, configuration, examples, architecture |
| docs/QUICKSTART.md | 5-minute setup guide |
| docs/VISUAL_CONTEXT.md | Visual context pipeline deep-dive |
| docs/API_REFERENCE.md | API endpoints and data formats |
| docs/DEVELOPMENT.md | How to extend and contribute |
Settings can be adjusted via the WebUI config page or default_config.yaml:
| Section | Setting | Default | Description |
|---|---|---|---|
| Transcription | preferred_source |
captions |
Caption source (captions or whisper) |
| Transcription | language |
(auto) | ISO 639-1 language code |
| Visual | enabled |
true |
Enable visual context extraction |
| Visual | frame_interval |
30 |
Seconds between periodic frame samples |
| Visual | max_frames |
20 |
Max frames per video |
| Visual | analyze_frames |
true |
Use LLM to analyze extracted frames |
| Output | save_to_memory |
true |
Auto-save to Agent Zero memory |
| Output | include_timestamps |
true |
Include timestamps in transcripts |
| Playlist | max_videos |
10 |
Max videos per playlist |
| Channel | small_threshold |
50 |
Videos ≤ this → one-pass small-channel strategy |
| Channel | large_threshold |
200 |
Videos > this → large-channel strategy with long cooldowns |
| Channel | delay_per_video |
3 |
Per-video delay (s) for small channels |
| Channel | delay_per_video_medium |
5 |
Per-video delay (s) for medium channels |
| Channel | delay_per_video_large |
8 |
Per-video delay (s) for large channels |
| Channel | batch_size_medium |
15 |
Videos per batch for medium channels |
| Channel | batch_size_large |
10 |
Videos per batch for large channels |
| Channel | batch_cooldown_medium |
60 |
Cooldown (s) between batches for medium channels |
| Channel | batch_cooldown_large |
300 |
Cooldown (s) between batches for large channels |
| Channel | skip_no_captions |
true |
Skip videos without captions to avoid wasted requests |
| Channel | extract_visuals |
false |
Extract video frames on channel runs (heavy at scale) |
- Agent Zero (development branch with plugin framework)
- Python 3.10+
- System:
ffmpeg(for frame extraction; usually pre-installed in Agent Zero containers) - Python packages:
yt-dlp,youtube-transcript-api,Pillow(auto-installed byinitialize.py)
usr/plugins/youtube_transcribe/
├── plugin.yaml # Plugin manifest
├── default_config.yaml # Default settings
├── initialize.py # Dependency installer
├── install.sh # Automated installer
├── helpers/
│ └── youtube_client.py # URL parsing, metadata, transcripts, frames
├── tools/
│ ├── youtube_transcribe.py # Transcription + visual context
│ ├── youtube_channel.py # Channel transcription + adaptive rate-limiting + resumable state
│ ├── youtube_summary.py # AI summary generation
│ └── youtube_notes.py # Timestamped study notes
├── prompts/ # LLM tool descriptions
├── api/
│ └── youtube_test.py # Dependency check endpoint
├── webui/
│ ├── main.html # Dashboard with status + tool cards
│ └── config.html # Settings page
├── skills/ # 2 skill definitions
└── data/ # Runtime output (transcripts, frames)
- "No transcript available" -- The video may not have captions or auto-generated subtitles. Some videos have captions disabled by the creator.
- Token/context overflow -- If you hit LLM token limits, the plugin now saves full output to files and returns only a preview. Check the
data/directory for complete transcripts. - No frames extracted -- Ensure
ffmpegis installed:which ffmpeg. In Agent Zero containers it's usually pre-installed. - Plugin not loading -- Ensure the symlink exists:
ls -la /a0/plugins/youtube_transcribeshould point to/a0/usr/plugins/youtube_transcribe - Import errors -- The symlink at
/a0/plugins/youtube_transcribe->/a0/usr/plugins/youtube_transcribeis required forfrom plugins.youtube_transcribe.helpers...imports