AHs 2025 GazeLLM: Multimodal LLMs incorporating Human Visual Attention
Description
Processing high-resolution video with AI requires massive computational resources. GazeLLM offers an elegant solution inspired by human vision: use eye-tracking to focus only on what matters. By cropping first-person video to a small region around the user's gaze point, the system reduces pixel input to just one-tenth while achieving task comprehension equal to or better than full-resolution video. User evaluations across six real-world activities—cooking, bike repair, first aid, and sports—showed that gaze-focused video produces higher quality task descriptions than both full videos and center-cropped alternatives.
Jun Rekimoto. 2025. GazeLLM: Multimodal LLMs incorporating Human Visual Attention. In Proceedings of the Augmented Humans International Conference 2025 (AHs '25). Association for Computing Machinery, New York, NY, USA, 10 pages. https://doi.org/10.1145/3745900.3746075























