Background Sentiment expression and detection are crucial for effective and empathetic human-robot interaction. Previous work in this field often focuses on non-verbal emotion expression, such as facial expressions and gestures. Less is known about which specific prosodic speech elements are required in human-robot interaction. Our research question was: what prosodic elements are related to emotional speech in human-computer/robot interaction? Methods The scoping review was conducted in alignment with the Arksey and O'Malley methods. Literature was identified from the SCOPUS, IEEE Xplore, ACM Digital Library and PsycINFO databases in May 2021. After screening and de-duplication, data were extracted into an Excel coding sheet and summarised. Results Thirteen papers, published from 2012 to 2020 were included in the review. The most commonly used prosodic elements were tone/pitch (n = 8), loudness/volume (n = 6) speech speed (n = 4) and pauses (n = 3). Non-linguistic vocalisations (n = 1) were less frequently used. The prosodic elements were generally effective in helping to convey or detect emotion, but were less effective for negative sentiment (e.g., anger, fear, frustration, sadness and disgust). Discussion Future research should explore the effectiveness of commonly used prosodic elements (tone, loudness, speed and pauses) in emotional speech, using larger sample sizes and real-life interaction scenarios. The success of prosody in conveying negative sentiment to humans may be improved with additional non-verbal cues (e.g., coloured light or motion). More research is needed to determine how these may be combined with prosody and which combination is most effective in human-robot affective interaction.