Abstract
Understanding user intent is essential for build-
ing better human interaction agents, as it enables
personalization, co-creation, and contextual adap-
tation. However, existing approaches are either
restricted to text environments, use human anno-
tation, or just predict future user actions lacking
the ability to reason explicitly about user goals.
In this work, we introduce EARL (Early Action
Reasoning for Latent intent), a theory of mind
inspired inference-time algorithm that models
user intent as an inverse planning problem, in-
ferring latent goals from observed user actions.
EARL hypothesizes potential user intent at mul-
tiple stages during the course of task execution,
enabling timely intervention and personalization.
Evaluated on three diverse benchmarks namely
Mind2Web, AiTz, and VideoGUI, and using two
strong LLMs (Gemini-1.5-Pro and GPT-4o), we
show that EARL consistently outperforms CoT-
based LLM baselines in accurately deciphering
user intent, especially under partial observations