Gaze Target Estimation Anywhere with Concepts

Xu Cao*†, Houze Yang*, Vipin Gunda, Zhongyi Zhou, Tianyu Xu, Adarsh Kowdle, Inki Kim, Jim Rehg†

* core contributor, † project lead

SAM 3 style gaze target estimation foundation model.
The first text and visual concept-driven gaze target estimation model.
Defines the Promptable Gaze Target Estimation (PGE) task.
The first Gaze Target Estimation Agent - AnyGaze Agent, which connects GazeAnywhere to Gemini APIs.

Estimating human gaze targets from images in-the-wild is an important and formidable task. Existing approaches primarily employ brittle, multi-stage pipelines that require explicit inputs, like head bounding boxes and human pose, in order to identify the subject of gaze analysis. As a result, detection errors can cascade and lead to failure. Moreover, these prior works lack the flexibility of specifying the gaze analysis task via natural language prompting, an approach which has been shown to have significant benefits in convenience and scalability for other image analysis tasks.

To overcome these limitations, we introduce the Promptable Gaze Target Estimation (PGE) task, a new end-to-end, concept-driven paradigm for gaze analysis. PGE conditions gaze prediction on flexible user text or visual prompts (e.g., "the boy in the red shirt" or "person at point [0.52, 0.48]") to identify a specific subject for gaze analysis. This approach integrates subject localization with gaze estimation and eliminates the rigid dependency on intermediate analysis stages.

We also propose GazeAnywhere, the first foundation model designed for PGE. GazeAnywhere uses a multi-layer transformer-based detector to fuse features from frozen encoders and simultaneously solves subject localization, in/out-of-frame presence, and gaze target heatmap estimation.

Installation

Prerequisites

Python 3.12 or higher
PyTorch 2.7 or higher
CUDA-compatible GPU with CUDA 12.6 or higher

Create a new Conda environment (optional, but recommended):

conda create -n anygaze python=3.12
conda activate anygaze

Alternatively, you can use a virtual environment:

python3.12 -m venv anygaze
source anygaze/bin/activate  # On Windows: anygaze\Scripts\activate

Install PyTorch with CUDA support:

pip3 install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

Install dependencies:

pip3 install -r requirements.txt

Install detectron2:

Follow the detectron2 documentation for installation, or use:

pip3 install "git+https://github.com/facebookresearch/detectron2.git@017abbfa5f2c2a2afa045200c2af9ccf2fc6227f#egg=detectron2" --no-build-isolation

Getting Started

⚠️ Before using GazeAnywhere, please request access to the checkpoints on the GazeAnywhere Hugging Face repo. Once accepted, you need to be authenticated to download the checkpoints. You can do this by following the authentication steps (e.g., run hf auth login after generating an access token).

Basic Usage

python tools/inference.py \
  --config_file configs/gazeanywhere_config.py \
  --model_weights TODO \
  --image_path assets/example.jpg \
  --text "apperance: light brown hair girl with blue and white striped shirt" \
  --save_path visualization.jpg \
  --use_dark_inference

Release Timeline

[Feb 2026] GazeAnywhere (AnyGaze) paper is accepted to CVPR 2026! 🎉
[Feb 2026] Released initial inference code and environment setup.
[Mar 2026] Release pre-trained model weights on Hugging Face.
[Mar 2026] Release the Gaze-Co benchmark for community use.
[Apr 2026] Release full training, validation, and evaluation scripts.
[May 2026] Launch local Gradio Web UI and interactive Hugging Face Spaces demo.

Acknowledgement

Our implementation is inspired by DINOv3, DINOv2 Meets Text, SAM 3, ViTGaze, sharingan, Gaze-LLE, and TransGesture. Thanks for their remarkable contributions and released code! If we missed any open-source projects or related articles, we would like to add the acknowledgement of this specific work immediately.

We would like to thank the following people for their contributions prior to GazeAnywhere: Fiona Ryan, Yuehao Song, Samy Tafasca, the authors of DINOv3 and SAM 3 at Meta, and the authors of OWLv2 at Google DeepMind. Part of our idea is inspired by their papers.

Collaboration

We are welcoming technical contributors joining us in this project. Independent researchers making significant contributions (exploring new applications, training/inference acceleration, validating new components, providing more training data) in GazeAnywhere will be added into the author list of GazeAnywhere 2. We will regularly review the Pull requests and contact contributors.

Citing GazeAnywhere

If you use GazeAnywhere or the Gaze-Co dataset in your research, please use the following BibTeX entry.

@inproceedings{cao2026gaze,
  title={Gaze Target Estimation Anywhere with Concepts},
  author={Cao, Xu and Yang, Houze and Gunda, Vipin and Zhou, Zhongyi and Xu, Tianyu and Kowdle, Adarsh and Kim, Inki and Rehg, James M},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
assets		assets
configs		configs
data		data
docs		docs
engine		engine
modeling		modeling
scripts		scripts
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gaze Target Estimation Anywhere with Concepts

Installation

Prerequisites

Getting Started

Basic Usage

Release Timeline

Acknowledgement

Collaboration

Citing GazeAnywhere

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Gaze Target Estimation Anywhere with Concepts

Installation

Prerequisites

Getting Started

Basic Usage

Release Timeline

Acknowledgement

Collaboration

Citing GazeAnywhere

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages