Iris is an autonomous browser system that can perceive, reason, and act on web pages — similar to how a human interacts with a browser.
It combines browser automation, structured page understanding, and LLM-based decision making to execute tasks step-by-step.
- Connects to a real browser using Playwright (CDP)
- Extracts structured interactive elements from the DOM
- Uses visual context (screenshots) for reasoning
- Executes actions: click, type, scroll, navigate
- Runs an agent loop: observe → think → act
- Streams live browser screen using noVNC
- Captures browser screenshots
- Extracts interactive elements (buttons, inputs, links)
- Builds a structured representation of the page
- Sends context (DOM + screenshot) to an LLM
- Decides the next best action
- Executes actions using Playwright:
- click
- type
- scroll
- navigation
Observe → Understand → Decide → Act → Repeat
User
↓
Backend (FastAPI)
↓
Playwright Browser (via CDP)
↓
DOM Extraction + Screenshot
↓
LLM (Vertex AI - Gemini Flash)
↓
Action Execution
↓
noVNC (Live Screen Streaming)
- Single VM deployment (shared session across users)
- No per-user isolation
- Heavy reliance on vision (screenshots → higher cost)
- Uses coordinate-based clicking (can be fragile)
- Per-user isolated browser sessions
- Containerized execution (Docker)
- Scalable architecture (multi-instance support)
- Voice input for natural interaction
- Hybrid reasoning (DOM-first, vision fallback)
- Reduced LLM cost via selective screenshot usage
- Instead of raw pixel-based control, Iris builds a structured interaction map
- LLM is used for decision making, not raw extraction
- System follows a closed-loop agent architecture
- Combines symbolic (DOM) + perceptual (vision) inputs
- Backend: FastAPI, Python
- Browser Automation: Playwright (CDP)
- LLM: Vertex AI (Gemini Flash)
- Streaming: noVNC
- Infra: Google Cloud VM (Compute Engine)
Iris is evolving toward a multi-user, scalable autonomous agent system, with:
- distributed browser sessions
- intelligent routing
- efficient multimodal reasoning
Open to ideas, improvements, and collaborations.
This is an early version (V1) focused on validating the core idea of autonomous browser interaction.