Abstract: Explainability is vital for establishing user trust, also in robotics. Recently, foundation models (e.g. vision-language models, VLMs) fostered a wave of embodied agents that answer ...