Multimodal communication is essential in human interactions, as it allows for a more comprehensive and nuanced exchange of information and emotions. The use of multiple communication channels such as speech, body language, and gaze can enhance the clarity and richness of the communication, leading to better understanding and more effective social interactions. This paper investigates the importance of multimodal expressive communication, specifically voice, arm gestures, and gaze, in regulating human-agent interaction when joining a group of two virtual agents in a virtual reality environment. One of the virtual agents in the group uses politeness behaviors based on Brown and Levinson's politeness theory to invite participants to join the group at the side further to them, even though a closer side is available. The study finds that a combination of all modalities (verbal, gaze, arm gesture) is more effective in persuading participants to join the group at the farthest side, and arm gestures alone are more effective than gaze behavior although they are perceived to be less polite. Furthermore, although verbal-only communication can be as persuasive as other modalities, it can place a greater cognitive load on participants. This increased cognitive load may lead to delayed responses in comparison to other modalities. The findings give insight to designers of human-agent interaction systems about the use of multiple communication channels, particularly nonverbal behaviors such as arm gestures, to enhance the effectiveness of persuasive communication but they also need to balance this with other factors such as the impression and perceived politeness of virtual agents.