Voice-controlled AI video generation for Daydream Scope
Speak into your microphone and watch AI-generated imagery transform in real time. Say "butterfly" and a butterfly appears. Say "ocean sunset" and the scene shifts. Your voice becomes the paintbrush.
Built for live performance and interactive installation. Based on The Mirror's Echo by Krista Faist.
Microphone → faster-whisper → spaCy NLP → StreamDiffusion
↓ ↓ ↓ ↓
48kHz audio "that's my [Freddy, AI-generated
capture little guy little guy] imagery of
Freddy" (nouns only) Freddy
The plugin runs as a preprocessor in front of StreamDiffusion:
- Captures microphone audio in 3-second chunks at 48kHz
- Resamples to 16kHz and transcribes with faster-whisper (CPU, int8 quantized)
- Extracts concrete nouns and noun phrases using spaCy NLP
- Injects extracted nouns as the generation prompt with cache reset
- Filters filler speech — "um okay so like" produces no prompt change
- Falls back to the UI prompt box when no voice nouns are detected
- Real-time voice-to-image — speak and see results in ~2 seconds
- Noun extraction — only concrete nouns drive the image, not filler words
- UI prompt fallback — text box prompt stays active when you're not speaking
- Whisper on CPU — faster-whisper int8 keeps your GPU free for StreamDiffusion
- Cache reset on change — clean transitions between prompts, no ghosting
- LIFO audio queue — always transcribes the most recent speech, not a backlog
- Prompt monitor — included tkinter overlay shows what's driving the video
- Daydream Scope installed
- Python 3.10+ (Scope's bundled Python works)
- A microphone
# From Scope's virtual environment
pip install -e .
python -m spacy download en_core_web_smThe plugin defaults to device 27 (Intel Smart Sound) at 48kHz. To find your mic device number:
import sounddevice as sd
print(sd.query_devices())Then edit pipeline.py line with mic_device = 27 to match your device index.
- Open Daydream Scope
- Select Audio Transcription as the first pipeline (preprocessor)
- Select StreamDiffusion as the second pipeline
- Set Input Mode to Video (the plugin overrides this to text-only internally)
- Type a base prompt in the text box (this is your fallback prompt)
- Click Play — speak into your mic and watch the imagery respond
AUDIO-PLUGIN: transcribing...
AUDIO-PLUGIN: audio amplitude=0.0352
AUDIO-PLUGIN: result='That's my little guy Freddy.'
AUDIO-PLUGIN: nouns extracted: ['my little guy', 'Freddy']
AUDIO-PLUGIN: >>> NEW PROMPT: 'my little guy, Freddy' (from: 'That's my little guy Freddy.')
| Source | Behavior |
|---|---|
| Voice nouns | Immediately override the active prompt with cache reset |
| UI text box | Accepted after the user types a new value; clears voice prompt |
| No speech | Voice prompt persists until UI text box changes |
An always-on-top tkinter overlay that shows what's driving the video output in real time.
# Launch the monitor
python tools/scope-prompt-monitor.pywShows:
- 🎤 VOICE (green) — voice noun prompt active
- 📝 UI PROMPT (yellow) — text box prompt active
- 🔶 FALLBACK (orange) — voice timed out, reverted to text box
- Amplitude bars, extracted nouns, raw transcription, skipped filler
scope_audio_transcription/
├── __init__.py # Package version
├── plugin.py # @hookimpl registration for Scope
└── pipelines/
├── __init__.py # Pipeline exports
├── pipeline.py # Main pipeline (voice capture + NLP + prompt injection)
└── schema.py # Scope UI configuration schema
tools/
└── scope-prompt-monitor.pyw # Real-time prompt overlay (tkinter)
The plugin requires three edits to Scope's pipeline_processor.py to ensure prompt overrides from the preprocessor always reach StreamDiffusion:
- Queue bypass — preprocessor parameters merge directly into the next processor's state instead of going through the parameter queue (which can fill up and drop overrides)
- Larger parameter queue —
maxsize=64instead of 8 - Larger output queue —
maxsize=64instead of 8
See the installation guide for exact edit locations.
| Model | Size | Speed | Accuracy | VRAM |
|---|---|---|---|---|
tiny.en |
39MB | Fastest | Basic | ~0.5GB |
base.en |
74MB | Fast | Good | ~0.5GB |
small.en |
244MB | Default | Great | ~0.5GB |
medium.en |
769MB | Slower | Best | ~1GB |
All models run on CPU with int8 quantization via faster-whisper, keeping GPU memory free for StreamDiffusion.
This plugin is the technical core of The Mirror's Echo, an interactive AI projection installation by Krista Faist. The installation transforms spoken words into evolving visual landscapes using Whisper AI, spaCy NLP, TouchDesigner, and StreamDiffusion.
- Artist: Krista Faist
- Gallery: Chaos Contemporary Craft, Sarasota FL
- Fuse Factory Artist-in-Residence 2024, Columbus OH
MIT — Copyright (c) 2025 Krista Faist