Xu Cao*†, Houze Yang*, Vipin Gunda, Zhongyi Zhou, Tianyu Xu, Adarsh Kowdle, Inki Kim, Jim Rehg†
* core contributor, † project lead
- SAM 3 style gaze target estimation foundation model.
- The first text and visual concept-driven gaze target estimation model.
- Defines the Promptable Gaze Target Estimation (PGE) task.
- The first Gaze Target Estimation Agent - AnyGaze Agent, which connects GazeAnywhere to Gemini APIs.
Estimating human gaze targets from images in-the-wild is an important and formidable task. Existing approaches primarily employ brittle, multi-stage pipelines that require explicit inputs, like head bounding boxes and human pose, in order to identify the subject of gaze analysis. As a result, detection errors can cascade and lead to failure. Moreover, these prior works lack the flexibility of specifying the gaze analysis task via natural language prompting, an approach which has been shown to have significant benefits in convenience and scalability for other image analysis tasks.
To overcome these limitations, we introduce the Promptable Gaze Target Estimation (PGE) task, a new end-to-end, concept-driven paradigm for gaze analysis. PGE conditions gaze prediction on flexible user text or visual prompts (e.g., "the boy in the red shirt" or "person at point [0.52, 0.48]") to identify a specific subject for gaze analysis. This approach integrates subject localization with gaze estimation and eliminates the rigid dependency on intermediate analysis stages.
We also propose GazeAnywhere, the first foundation model designed for PGE. GazeAnywhere uses a multi-layer transformer-based detector to fuse features from frozen encoders and simultaneously solves subject localization, in/out-of-frame presence, and gaze target heatmap estimation.
- Python 3.12 or higher
- PyTorch 2.7 or higher
- CUDA-compatible GPU with CUDA 12.6 or higher
- Create a new Conda environment (optional, but recommended):
conda create -n anygaze python=3.12
conda activate anygazeAlternatively, you can use a virtual environment:
python3.12 -m venv anygaze
source anygaze/bin/activate # On Windows: anygaze\Scripts\activate- Install PyTorch with CUDA support:
pip3 install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126- Install dependencies:
pip3 install -r requirements.txt-
Install detectron2:
Follow the detectron2 documentation for installation, or use:
pip3 install "git+https://github.com/facebookresearch/detectron2.git@017abbfa5f2c2a2afa045200c2af9ccf2fc6227f#egg=detectron2" --no-build-isolationhf auth login after generating an access token).
python tools/inference.py \
--config_file configs/gazeanywhere_config.py \
--model_weights TODO \
--image_path assets/example.jpg \
--text "apperance: light brown hair girl with blue and white striped shirt" \
--save_path visualization.jpg \
--use_dark_inference
- [Feb 2026] GazeAnywhere (AnyGaze) paper is accepted to CVPR 2026! 🎉
- [Feb 2026] Released initial inference code and environment setup.
- [Mar 2026] Release pre-trained model weights on Hugging Face.
- [Mar 2026] Release the Gaze-Co benchmark for community use.
- [Apr 2026] Release full training, validation, and evaluation scripts.
- [May 2026] Launch local Gradio Web UI and interactive Hugging Face Spaces demo.
Our implementation is inspired by DINOv3, DINOv2 Meets Text, SAM 3, ViTGaze, sharingan, Gaze-LLE, and TransGesture. Thanks for their remarkable contributions and released code! If we missed any open-source projects or related articles, we would like to add the acknowledgement of this specific work immediately.
We would like to thank the following people for their contributions prior to GazeAnywhere: Fiona Ryan, Yuehao Song, Samy Tafasca, the authors of DINOv3 and SAM 3 at Meta, and the authors of OWLv2 at Google DeepMind. Part of our idea is inspired by their papers.
We are welcoming technical contributors joining us in this project. Independent researchers making significant contributions (exploring new applications, training/inference acceleration, validating new components, providing more training data) in GazeAnywhere will be added into the author list of GazeAnywhere 2. We will regularly review the Pull requests and contact contributors.
If you use GazeAnywhere or the Gaze-Co dataset in your research, please use the following BibTeX entry.
@inproceedings{cao2026gaze,
title={Gaze Target Estimation Anywhere with Concepts},
author={Cao, Xu and Yang, Houze and Gunda, Vipin and Zhou, Zhongyi and Xu, Tianyu and Kowdle, Adarsh and Kim, Inki and Rehg, James M},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2026}
}
