In this paper we investigate the usability of speech-centric multimodal interaction by comparing two systems that support the same unfamiliar task, viz. bathroom design. One version implements a conversational agent (CA) metaphor, while the alternative one is based on direct manipulation (DM). Twenty subjects, 10 males and 10 females, none of whom had recent experience with bathroom (re-)design completed the same task with both systems. After each task we collected objective measures (task completion time, task completion rate, number of actions performed, speech and pen recognition errors) and subjective measures in the form of Likert Scale ratings. We found that the task completion rate for the CA system is higher than for the DM system. Nevertheless, subjects did not agree on their preference for one of the systems: those subjects who were able to use the DM system effectively preferred that system, mainly because it was faster for them, and they felt more in control. We conclude that for multimodal CA systems to become widely accepted substantial improvements in system architecture and in the performance of almost all individual modules are needed. (C) 2005 Elsevier B.V. All rights reserved.